Small comparison of Google Word2Vec vs Spark Word2Vec

Word2Vec (W2V) is an algorithm that takes in a text corpus and outputs a vector representation for each word. You can read more about it in my previous blog post.

There are various implementations of Word2Vec out there ready for you to use. Some of these include:

This post aims to only briefly compare the first two, namely Google’s first implementation and Spark’s ml implementation. This is mainly useful for you if:

  • You are considering whether to move from Google’s first implementation to Spark’s one
  • You are considering whether to use either of these

 

The following table summarises the differences between these:

Attribute Google implementation Spark implementation (ml)
Architecture
Skip Gram
Skip gram
Continuous bag of words
Training algorithm
Hierarchical softmax
Hierarchical softmax
Negative sampling
Notes Highly optimized, but not distributed. Depending on the parameters and the data, large speed gains can be noticed. I’ve seen it getting 60% speed up with minimal change to results.
Mainteinability No longer maintained Still maintained and supported by spark
Including in your project Requires you to download the source code directly and save it into your project The library comes with Spark already
Failure management A failure will crash the process A failure will cause Spark to try a second attempt. This is great for intermittent cases such as network connections being lost.
Stability Stable Possible instabilities that emerge from moving to a distributed model. In other words – possible standard Spark problems.
Optimizable parameters Parameter name
Vector size size setVectorSize
Learning rate alpha setStepSize
Input file train Pass your data frame
Output file output Save using Spark
Window size window setWindowSize
Use hierarchical softmax hs No other option, this is the default
Distribute into x parallel processes threads setNumPartitions
Total iterations iter setMaxIter
Minimum occurrences of a word to be considered min-count setMinCount