Into the depths of data engineering

Word2Vec (W2V) is an algorithm that takes in a text corpus and outputs a vector representation for each word. You can read more about it in my previous blog post.

There are various implementations of Word2Vec out there ready for you to use. Some of these include:

This post aims to only briefly compare the first two, namely Google’s first implementation and Spark’s ml implementation. This is mainly useful for you if:

You are considering whether to move from Google’s first implementation to Spark’s one
You are considering whether to use either of these

The following table summarises the differences between these:

Attribute	Google implementation	Spark implementation (ml)
Architecture	Skip Gram	Skip gram
Architecture	Continuous bag of words	Skip gram
Training algorithm	Hierarchical softmax	Hierarchical softmax
Training algorithm	Negative sampling	Hierarchical softmax
Notes	Highly optimized, but not distributed.	Depending on the parameters and the data, large speed gains can be noticed. I’ve seen it getting 60% speed up with minimal change to results.
Mainteinability	No longer maintained	Still maintained and supported by spark
Including in your project	Requires you to download the source code directly and save it into your project	The library comes with Spark already
Failure management	A failure will crash the process	A failure will cause Spark to try a second attempt. This is great for intermittent cases such as network connections being lost.
Stability	Stable	Possible instabilities that emerge from moving to a distributed model. In other words – possible standard Spark problems.
Optimizable parameters	Parameter name
Vector size	size	setVectorSize
Learning rate	alpha	setStepSize
Input file	train	Pass your data frame
Output file	output	Save using Spark
Window size	window	setWindowSize
Use hierarchical softmax	hs	No other option, this is the default
Distribute into x parallel processes	threads	setNumPartitions
Total iterations	iter	setMaxIter
Minimum occurrences of a word to be considered	min-count	setMinCount