Word2Vec (W2V) is an algorithm that takes in a text corpus and outputs a vector representation for each word. You can read more about it in my previous blog post.
There are various implementations of Word2Vec out there ready for you to use. Some of these include:
- Google’s first implementation
- Spark’s mlib and ml implementations
- TensorFlow’s implementation
- Gensim implementation
- PiPy implementation
This post aims to only briefly compare the first two, namely Google’s first implementation and Spark’s ml implementation. This is mainly useful for you if:
- You are considering whether to move from Google’s first implementation to Spark’s one
- You are considering whether to use either of these
The following table summarises the differences between these:
Attribute | Google implementation | Spark implementation (ml) |
Architecture
|
Skip Gram |
Skip gram
|
Continuous bag of words | ||
Training algorithm
|
Hierarchical softmax |
Hierarchical softmax
|
Negative sampling | ||
Notes | Highly optimized, but not distributed. | Depending on the parameters and the data, large speed gains can be noticed. I’ve seen it getting 60% speed up with minimal change to results. |
Mainteinability | No longer maintained | Still maintained and supported by spark |
Including in your project | Requires you to download the source code directly and save it into your project | The library comes with Spark already |
Failure management | A failure will crash the process | A failure will cause Spark to try a second attempt. This is great for intermittent cases such as network connections being lost. |
Stability | Stable | Possible instabilities that emerge from moving to a distributed model. In other words – possible standard Spark problems. |
Optimizable parameters | Parameter name | |
Vector size | size | setVectorSize |
Learning rate | alpha | setStepSize |
Input file | train | Pass your data frame |
Output file | output | Save using Spark |
Window size | window | setWindowSize |
Use hierarchical softmax | hs | No other option, this is the default |
Distribute into x parallel processes | threads | setNumPartitions |
Total iterations | iter | setMaxIter |
Minimum occurrences of a word to be considered | min-count | setMinCount |