This post summarises some of the lessons learned while working with Spark’s Word2Vec implementation. You may also be interested in the previous post “Problems encountered with Spark ml Wod2Vec”
Lesson 1: Spark’s Word2Vec getVectors returns the unique embeddings
As mentioned in part 2, the transform function aims to return the vectors for words within the given sentences. If you want the actual trained model, and therefore the unique word to vector representations you should use the getVectors function
Lesson 2: More partitions == more speed == less quality
There is a balance that you need to determine between having a fast implementation vs one with good quality. Having more Word2Vec partitions means that the data is separated into many smaller buckets, losing context of the words in other buckets. The data is only brought together at the end of an iteration. For this reason, you don’t want to split your data into too many partitions. However, you also don’t want to lose out on parallelism – after all you are using spark because you want distributed computation. Play around with the total partitions – the right value for this parameter will differ depending on the problem. Also remember that less partitions means less parallelism and therefore a slower algorithm.
Lesson 3: More iterations == less speed == more quality
As mentioned in lesson 2, the data from various partitions is brought together at the end of each iteration. Having more iterations means more context from the different buckets and more time training. This means that more iterations can lead to better results, but they do have an impact on the running time of the algorithm.
Lesson 4: Machine learning algorithms need a lot of hardware
This probably doesn’t come as a surprise, but it is still worth mentioning. You are using a machine learning algorithm on a distributed cluster and you keep having to give 1 thing more memory, namely your driver.
Lesson 5: Save things to parquet
Why? efficient data compression built for handling bulk data leads to less memory issues.
Lesson 6: Spark ml Word2Vec is not mockable
If you are writing tests for your Spark jobs, which you should be doing, you will probably try to mock out Spark’s Word2Vec implementation as it is nondeterministic. You will soon be greeted by an error message stating that Word2Vec cannot be mocked. You will then quickly find out that this is a final class in the ml library. To get around this you can wrap your call to Word2Vec in a function and inject it into the function that you are testing.