This post aims to summarise some of the problems experienced when trying to use Spark’s ml Word2Vec implementation.
Out of memory exception
Spark’s Word2Vec implementation requires quite a bit of memory depending on the amount of data that you are dealing with. This is because the driver ends up having to do a lot of work. You may experience this problem with various machine learning implementations in Spark.
All you have to do is increase the total memory allocated to your driver using spark-submit’s option driver-memory. Note that your cluster may have an upper limit set which you might need to increase. The error message that you get if you set the driver memory to a value above this threshold is very straight forward. It pretty much tells you to increase the limit by changing the value of the cluster’s yarn.scheduler.maximum-allocation-mb.
In my case, the driver was using 30 GB, so I gave it 40 GB.
Total size of serialized results of X tasks (Y MB) is bigger than spark.driver.maxResultSize (Y MB)
The Word2Vec algorithm needs to deal with result sizes larger than your normal cleaning job. You can increase Spark’s limit by increasing the value of spark.driver.maxResultSize.
Default column name not found
Spark’s ml Word2Vec implementation deals with Dataframes. This means that it relies on string names of columns rather than concrete types. You are getting this error because the Dataframe’s column name does not match the default name expected by the Word2Vec training function. There are 2 options to fix this:
- Change the name expected by Word2Vec to the name of your input Dataframe’s column using the setInputCol function of Word2Vec. If you have not set a column name, then it is probably value.
- Change your input Dataframe’s column name to that expected by Word2Vec. The name expected by Word2Vec is inputCol.
OutOfMemoryError: GC overhead limit exceeded
As the driver is doing a lot of work, the default Garbage Collector seems to struggle to catch up with the cleanup. To fix this you can use concurrent garbage collection by enabling it through the Java Options. You can do this by adding XX:+UseConcMarkSweepGC to the Java options in your spark-submit.
Cannot resolve ‘`X`’ given input columns: [value, w2v_993c88fe4732__output]
As you are dealing with Dataframes when managing the results of Word2Vec you are probably trying to map these to your custom datatype after retrieval. You get an error like this if your custom type’s constructor expects the wrong parameters. As you may be retrieving the vectors in two different ways let’s look at the expectations of each one:
- Using dataframe returned by transform: this expects a type that takes in two parameters -> value: Array[String], vector: Vector
- Using dataframe returned by getVectors -> this expects a type that takes in two parameters: word: String, vector: Vector
Ensure that when you use <dataframe>.as[<customType>] that the custom type expects the above-mentioned parameter types.
Duplicates in output from Word2Vec
When saving your model you may notice that you are getting duplicated words with different vectors in your word-vector representation. One words should have one vector representation. This may be especially confusing if you re moving from Google’s implementation to Spark’s. This is happening because you are using the transform function. This function takes in the sentences that you trained the model with and returns a word-vector representation for each word in the given set. This means that repeated words across different sentences will also appear in your result with the vector representations most appropriate for their context at that point. If what you want is the single vector representation of a word, you should get the correct embeddings by using the getVectors function.
Failed to register classes with Kryo
This is not specific to Word2Vec but it did happen during the implementation. This generally means that your manual Kryo serialization registration, which is done for optimization reasons, is missing a type. Find out the type that you are missing and register it using kryo.register(classOf[<myClass>]).
Memory issues when saving the results of getVectors
Once you are almost done and all you need to do is just save your trained Word2Vec embeddings for future use you might be greeted by some memory issues. If you are, you are probably trying to either save the whole model into a single file or you are saving it into partitioned plain text files on HDFS. You have a coupe of options here.
Word2VecModel has a function save which allows you to save the model in a format that can be re-loaded into a Word2VecModel using the load function. This wasn’t quite what I needed in my case, but it may be appropriate for your use case.
I needed to save the embeddings as normal text in order for another spark job to consume it as input to a second machine learning algorithm. For this reason, I went for my second file-saving option: saving to parquet. This can be done with the following code snippet:
model .repartition(partitions) .withColumnRenamed("_1", "word") .withColumnRenamed("_1", "vector") .select("word", "vector") .write.parquet("some output path")