Small comparison of Google Word2Vec vs Spark Word2Vec

Word2Vec (W2V) is an algorithm that takes in a text corpus and outputs a vector representation for each word. You can read more about it in my previous blog post.

There are various implementations of Word2Vec out there ready for you to use. Some of these include:

This post aims to only briefly compare the first two, namely Google’s first implementation and Spark’s ml implementation. This is mainly useful for you if:

  • You are considering whether to move from Google’s first implementation to Spark’s one
  • You are considering whether to use either of these

 

The following table summarises the differences between these:

Attribute Google implementation Spark implementation (ml)
Architecture
Skip Gram
Skip gram
Continuous bag of words
Training algorithm
Hierarchical softmax
Hierarchical softmax
Negative sampling
Notes Highly optimized, but not distributed. Depending on the parameters and the data, large speed gains can be noticed. I’ve seen it getting 60% speed up with minimal change to results.
Mainteinability No longer maintained Still maintained and supported by spark
Including in your project Requires you to download the source code directly and save it into your project The library comes with Spark already
Failure management A failure will crash the process A failure will cause Spark to try a second attempt. This is great for intermittent cases such as network connections being lost.
Stability Stable Possible instabilities that emerge from moving to a distributed model. In other words – possible standard Spark problems.
Optimizable parameters Parameter name
Vector size size setVectorSize
Learning rate alpha setStepSize
Input file train Pass your data frame
Output file output Save using Spark
Window size window setWindowSize
Use hierarchical softmax hs No other option, this is the default
Distribute into x parallel processes threads setNumPartitions
Total iterations iter setMaxIter
Minimum occurrences of a word to be considered min-count setMinCount

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s