Spark Word2Vec: lessons learned

This post summarises some of the lessons learned while working with Spark’s Word2Vec implementation. You may also be interested in the previous post “Problems encountered with Spark ml Wod2Vec

Lesson 1: Spark’s Word2Vec getVectors returns the  unique embeddings

As mentioned in part 2, the transform function aims to return the vectors for words within the given sentences. If you want the actual trained model, and therefore the unique word to vector representations you should use the getVectors function

Lesson 2: More partitions == more speed == less quality

There is a balance that you need to determine between having a fast implementation vs one with good quality. Having more Word2Vec partitions means that the data is separated into many smaller buckets, losing context of the words in other buckets. The data is only brought together at the end of an iteration. For this reason, you don’t want to split your data into too many partitions. However, you also don’t want to lose out on parallelism – after all you are using spark because you want distributed computation. Play around with the total partitions – the right value for this parameter will differ depending on the problem. Also remember that less partitions means less parallelism and therefore a slower algorithm.

Lesson 3: More iterations == less speed == more quality

As mentioned in lesson 2, the data from various partitions is brought together at the end of each iteration. Having more iterations means more context from the different buckets and more time training. This means that more iterations can lead to better results, but they do have an impact on the running time of the algorithm.

Lesson 4: Machine learning algorithms need a lot of hardware

This probably doesn’t come as a surprise, but it is still worth mentioning. You are using a machine learning algorithm on a distributed cluster and you keep having to give 1 thing more memory, namely your driver.

Lesson 5: Save things to parquet

Why? efficient data compression built for handling bulk data leads to less memory issues.

Lesson 6: Spark ml Word2Vec is not mockable

If you are writing tests for your Spark jobs, which you should be doing, you will probably try to mock out Spark’s Word2Vec implementation as it is nondeterministic. You will soon be greeted by an error message stating that Word2Vec cannot be mocked. You will then quickly find out that this is a final class in the ml library. To get around this you can wrap your call to Word2Vec in a function and inject it into the function that you are testing.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s