Monitoring ML pipelines

I have spoken a lot in this blog about the process of bringing machine learning code to production. However, once the models are in production you are not done, you are just getting started. The model will have to face its worst enemy: The Real World! 

Image credit to: Tim Smit

This post focuses on what kinds of monitoring you can put in place in order to understand how your model is performing in the real world. This considers both, continuous training as well as the usage of the trained model. It looks into:

  • Monitoring your infrastructure
  • Monitoring the data
  • Monitoring the training
  • Monitoring value in the real world

Monitoring your infrastructure

Engineers building any system generally have some monitoring in place in order to make sure that services are up and infrastructure is not at capacity. This kind of monitoring can be useful for your machine learning components as well, as it can help you spot:

  • Changes in the frequency and quantity of incoming data.
  • Changes and failures in third party systems that we may depend on for data or processing.
  • The need for upgrading the infrastructure with growing demand.
  • The need to optimise/distribute the code of your machine learning components with increased consumption of data.
  • The need to re-evaluate your chosen solution when certain limitations are reached.

Some of these situations can also have certain less obvious implications for your models. For example, an increase/decrease in the quantity/frequency of the data being received could indicate that some event has occurred in the real world (storm, earthquake) which may in turn skew your model’s results. It may also be a side effect of something like a security breach, such as an attempted DDoS attack, which may in turn negatively affect your model results (e.g. in a recommendations system).

Some popular tools for such kind of monitoring include, but are not limited to:

 

Screenshot 2020-02-12 at 18.27.18.png

Monitoring the data

The real world changes, and so will your data. As this is the most important component for training and using a model, we need to have something in place to understand when data has drifted in order to act on it.

This can be done by keeping track of some statistics (median, variance, standard deviation, etc) about your data. You can calculate these for the data that you use to train your model, and keep track of them over time, with every new training run. You can then also do the same, at a certain frequency, over a batch of the latest data having been used for inference in the model. The two graphs will help you understand:

  • Whether input training data has changed over time, and how much.
  • How often data is changing, and whether you may need to train models more frequently.
  • Whether the data being selected for training is in different to the data that is being used for inferencing. This could be a symptom of not training often enough,  but it can just as likely be caused by the selection process for training data, which may be biased.

A scenario where data has changed will likely result in the need for you to investigate what the changes are and whether they need to be acted upon. Such changes may indicate that the relationship between the input and output has changed, or that something in the real world has affected the data either in an expected manner (e.g. price changes) or an unexpected manner (e.g. natural disasters). It may also be that the change is not having any effect on results and needs no further action.

In addition to these statistical measures, one could implement a set of validations, which in turn provide alerts when data does not match expectations. For example, alerts triggered by minimum/maximum rages for certain features being exceeded, an acceptable threshold of missing data being surpassed, or the difference between the statistics of the training and inferencing data being larger than a threshold.

Image result for statistics cartoon

Image credit to: xkcd

Monitoring the training

You are probably training your model at a certain frequency in order to keep up with the changing world. There are certain things that you can keep track of during this training process which can help in spotting changes and investigating problems. These include:

  • Keeping track of feature importance, where the algorithm that you are using allows it. A change in which features are most important may indicate a possible drift in the data, or a change in the relationship between the input and output.
  • Keeping track of model quality measures (F1 score, accuracy, etc). It may be that the new model suddenly performs significantly worse when compared against an old version. This may indicate a change in the data which makes the current approach outdated or inappropriate for the problem being solved.
  • Keeping track of statistics about the input data being used to train the model. This was mentioned in the previous section.
  • Keeping track of any other model KPIs that may be indicative of the model’s performance within the the product (e.g. use case validation)

A change in these measures may be indicative of a problem in the new input data, either because of data drift, or a problem in the data cleaning process. Additionally, it could also be an indicator of a need for a different approach that would work with the new state of the data.

Image result for machine learning class comic

Image credit to: kdnuggets

Monitoring value in the real world

For every software system in production we need to understand how this software is being used by your users. We do this through analytics, which involve gathering data about what end users are doing on the application, in order to infer the value of certain features. In the same way as with a rule based system, analytics can help you understand what the effect of your machine learning components is on the product and users.

A change in user behaviour on a feature that is driven by machine learning, for example, could indicate that a new model version is no longer providing positive value to the user. Alternatively it could also mean that the model is outdated given a change that has occurred in the real world.

This kind of data is often  displayed on some interactive dashboards that allows people to monitor certain important KPIs, as well as generate reports for interested stakeholders. In certain systems, this data is also used to pin point problems by providing alerts based on a change on these KPIs. The KPIs that may be affected by your machine learning driven features are the ones that you can leverage. These can help you understand the effect that changes to these features has on the overall user experience in the real world.

Image result for dilbert analytics

Image credit to: Dilbert

As you can see, you have many options for monitoring your machine learning components once they go out into the wild. You don’t have to implement all of these. Determine what makes sense for your team, the problem that you are solving, the possible problems that you foresee, the approach that you have chosen for the solution, and the maturity of the product.

If you have done something different that worked well for your team, please share it on the comments 🙂

Now go put some beautiful graphs on that abandoned screen!

The notebook anti-pattern

In the past few years there has been a large increase in tools trying to solve the challenge of bringing machine learning models to production. One thing that these tools seem to have in common is the incorporation of notebooks into production pipelines. This article aims to explain why this drive towards the use of notebooks in production is an anti pattern, giving some suggestions along the way.

What is a notebook?

Let’s start by defining what these are, for those readers who haven’t been exposed to notebooks, or call them by a different name.

Notebooks are web interfaces that allow a user to create documents containing code, visualisations and text. They look as follows:

What are notebooks good for?

Contrary to what you might have gathered from the introduction, Notebooks are not all bad. They can be quite useful in certain scenarios, which will be described in the sub-sections below.

Data analysis

This is probably their most common use. When greeted by a new dataset, one needs to dig into the data and do certain visualisations in order to make sense of it. Notebooks are good for this because they allow us to:

  • quickly get started
  • see the raw data and visualisations in one place
  • have access to many existing cleaning and visualisation tools
  • document our progress and findings (which can be extracted as HTML)

Experimentation

When it comes to machine learning, a lot of experimentation takes place before a final approach to solve a problem is chosen. Notebooks are good to play around with the data and various models in order to gain an understanding of what works with the give data and what does not.

One time tasks

Notebooks are also a good playground. Sometimes one needs to perform an automated task once whilst perhaps not being familiar or comfortable with writing bash or using other similar tools.

Teaching or technical presentations

When teaching python, or performing a technical presentation for your peers, you may want to show code and the result of that code immediately after. Notebooks are great for this as they allow you to run code and show the result within the same document. They can show visualisations, represent sections with titles and provide additional documentation that may be needed by the presenter.

Code assessments

If your company provides code challenges to candidates, notebooks can be a useful tool. This also depends on what your company needs to assess. Notebooks allow a candidate to combine documentation, explanations and their solution into a single page. They are also easy to get running for the assessor, provided that the candidate has given the package requirements. However, what they cannot provide is a wide enough assessment of the candidate’s understanding of software engineering principles, as we will understand better from the next section.

What are notebooks bad at?

A lot of companies these days are trying to solve the challenge of bringing models to production. Data scientists within these companies may come from very varying backgrounds, including: statistics, pure mathematics, natural sciences and engineering. One thing that they do have in common is that they are generally comfortable with using notebooks for analysis and experimentation, as the tool is designed for this purpose. Because of this, large companies that provide infrastructure have been focusing on bridging the “productionisation gap” by providing “one click deployment” tools within the Notebook ecosystem, therefore encouraging the use of notebooks in production. Unfortunately, as notebooks were never designed to serve this purpose to begin with, this can lead to unmaintainable production systems.

The thought of notebooks in production always makes me think of the practicality of a onesie comic – looks beautiful but is very impractical for certain scenarios. 

Now that we know what Notebooks can do well, let’s dive into what they are really bad at in the following sections.

Continuous integration (CI)

Notebooks are not designed to be automatically ran or handled via a CI pipeline, as they were built for exploration. They tend to involve documentation and visualisations, which would add unnecessary work to any CI pipeline. Though they can be extracted as a normal python script and then ran on the CI pipeline, in most cases you will want to run the tests for the script, not the script itself (unless you are creating some artefact that needs to be exposed by the pipeline).

Testing

Notebooks are not testable, which is one of my main pain points about them. No testing framework has been created around these because their purpose is to be playgrounds, not production systems. Contrary to popular belief, testing in data products is just as important and possible as in other software products. In order to test a notebook, the code from the notebook needs to be extracted to a script, which means that the notebook is redundant anyway. It would need to be maintained to match the code in the extracted script, or diverge into some more untested chaos.

If you want to learn more about testing ML pipelines, check out the article: Testing your ML pipelines.

Version control

If you have ever had to put a Notebook on git or any other version control system and open a pull request, you may have noticed that this pull request is completely unreadable. That is because the notebook needs to keep track of the state of the cells and therefore has a lot of changes taking place behind the curtains when it is ran to create your beautiful HTML view. These changes also need to be versioned, causing the unreadable view.

Of course you may be in a team that uses pairing and not pull requests, so you may not care about the pull request being unreadable. However, you lose another advantage of version control through this readability decrease: when reverting code, or looking into old versions for a change that may have introduced or fixed a problem, you need to rely purely on the commit messages and manually revert to check a change.

This is a well known problem of notebooks, but also one that people are working on. There are some plugins that can be used in order to alleviate this at least in a web view of your version control system. One example of such a tool is Review Notebook App.

Collaboration

Collaboration in a notebook is hard.  Your only viable collaboration option is to pair, or take turns on the notebook like a game of Civilization. This is why:

  • Notebooks have a lot of state being managed in the background, therefore working asynchronously on the same notebook can lead to a lot of unmanageable merge conflicts. This can be a particular nightmare for remote teams.
  • All the code is also in the same place (other than imported packages), therefore there are continuously changes to the same code, making it harder to track the effect of changes. This is particularly bad due to the lack of testing (explained above)
  • The issues already mentioned in version control above

State

State has already been mentioned in both of the above, but it deserves its own point for emphasis. Notebooks have a notebook wide state. This state changes every time that you run a cell, which may lead to the following issues:

  • Unmanageable merge conflicts within the state and not the code itself
  • Lack of readability in version control
  • Lack of reproducibility. You may have been working in the notebook with a state that is no longer reproducible because the code that lead to that state has been removed, yet the state has not been updated.

Engineering standards

Notebooks encourage bad engineering standards. I want to highlight the word encourage here, as a lot of these are things are avoidable by the user of the notebook. Anti-patterns that are often seen in notebooks are:

  • Reliance on state: Notebooks rely heavily on state, especially because they generally involve some operations being performed on the data in the first few cells in order to feed this data into some algorithm. Reliance on state can lead to unintended consequences and side-effects throughout your code.
  • Duplication: One cannot import one notebook into another, therefore when trying multiple experiments in different notebooks one tends to copy paste the common pieces across. Should one of these notebooks change, the others are immediately out of date. This can be improved by extracting the common parts of the code and importing them into the separate notebooks. Duplication is also seen a lot within the notebooks themselves, though this is easily avoidable by just using functions.
  • Lack of testing: One cannot test a notebook, as we have seen in the above testing section.

Package management

There is no package management in notebooks. The notebook uses the packages installed in the environment that it is running in. One needs to manually keep track of the packages used by that specific notebook, as different notebooks running in the same environment may need different packages. One suggestion is to always run a new notebook in a fresh virtual environment, keeping track of that specific notebook’s requirements separately. Alternatively all the notebooks in an environment would rely on a single requirements file.

So what can we do?

Great, so now we know why notebooks in production are a bad idea and why we need to stop dressing up experimentation tools as productionization tools. Where does that leave us though? That depends on the skills and structure of your team. Your team most likely consists either of:

  • Data scientists with engineering skills
  • Or, data scientists focused on experimentation and ML/data engineers bringing models to production

So let’s take a look at the two scenarios below.

A team of data scientists with engineering skills

In this scenario, your data science team is in charge of the models end to end. That is, in charge of experimentation as well as productionization. These are some things to keep in mind:

Separation of engineering and data science skills

Some larger organisations prefer more specialised skillsets, where data scientists work on the experimental work and ML/data engineers bring those to production. The points listed in the above scenario are still relevant, but I have 1 additional suggestion specific to this scenario:

Please, please, please Don’t throw the model over the fence! Sit together, communicate and pair/mob program the pipeline into production. The model doesn’t work unless it provides value for the end users.

Conclusion

Like with any tool, there are places to use notebooks and places to avoid using them in. Let’s do one last recap for these.

Good Bad
Data analysis Continuous integration
Experimentation Testing
One time tasks Version control
Teaching/technical presentations Collaboration
Code assessments State
Engineering standards
Package management

In conclusion, there are two messages that I would like you to take from this article:

  • To the ML practitioner: notebooks are for experimentation, not for productionization. Stick to software engineering principles and frameworks for bringing things to production, they have been designed from past learnings which we should be leveraging.
  • To the people creating tooling: We appreciate you working on making things easier for everyone ♥ , but please move away from the notebook anti-pattern. Focus on creating easier to use tools that encourage positive software engineering patterns. We want:
    • Testability
    • Versioning
    • Collaboration
    • Reproducibility
    • Scalability

Testing your machine learning (ML) pipelines

When it comes to data products, a lot of the time there is a misconception that these cannot be put through automated testing. Although some parts of the pipeline can not go through traditional testing methodologies due to their experimental and stochastic nature, most of the pipeline can. In addition to this, the more unpredictable algorithms can be put through specialised validation processes.

Let’s take a look at traditional testing methodologies and how we can apply these to our data/ML pipelines.

Testing Pyramid

Your standard simplified testing pyramid looks like this:

Testing pyramid

This pyramid is a representation of the types of tests that you would write for an application. We start with a lot of Unit Tests, which test a single piece of functionality in isolation of others. Then we write Integration Tests which check whether bringing our isolated components together works as expected. Lastly we write UI or acceptance tests, which check that the application works as expected from the user’s perspective.

When it comes to data products, the pyramid is not so different. We have more or less the same levels.

Testing pyramid.png

Note that the UI tests would still take place for the product, but this blog post focuses on tests most relevant to the data pipeline.

Let’s take a closer look at what each of these means in the context of Machine Learning, and with the help fo some sci-fi authors.

Unit tests

“It’s a system for testing your thoughts against the universe, and seeing whether they match” Isaac Asimov.

Most of the code in a data pipeline consists of a data cleaning process. Each of the functions used to do data cleaning has a clear goal. Let’s say, for example, that one of the features that we have chosen for out model is the change of a value between the previous and current day. Our code might look somewhat like this:

def add_difference(asimov_dataset):
    asimov_dataset['total_naughty_robots_previous_day'] =         
        asimov_dataset['total_naughty_robots'].shift(1)

    asimov_dataset['change_in_naughty_robots'] =     
        abs(asimov_dataset['total_naughty_robots_previous_day'] - 
            asimov_dataset['total_naughty_robots'])

    return asimov_dataset[['total_naughty_robots', 'change_in_naughty_robots', 
        'robot_takeover_type']]

Here we know that for a given input we expect a certain output, therefore, we can test this with the following code:

import pandas as pd
from pandas.testing import assert_frame_equal
import numpy as np
from unittest import TestCase

def test_change():
    asimov_dataset_input = pd.DataFrame({
        'total_naughty_robots': [1, 4, 5, 3],
        'robot_takeover_type': ['A', 'B', np.nan, 'A']
    })

    expected = pd.DataFrame({
        'total_naughty_robots': [1, 4, 5, 3],
        'change_in_naughty_robots': [np.nan, 3, 1, 2],
        'robot_takeover_type': ['A', 'B', np.nan, 'A']
    })

    result = add_difference(asimov_dataset_input)

    assert_frame_equal(expected, result)

For each piece of independent functionality, you would write a unit test, making sure that each part of the data transformation process has the expected effect on the data. For each piece of functionality you should also consider different scenarios (is there an if statement? then all conditionals should be tested). These would then be ran as part of your continuous integration (CI) pipeline on every commit.

In addition to checking that the code does what is intended, unit tests also give us a hand when debugging a problem. By adding a test that reproduces a newly discovered bug, we can ensure that the bug is fixed when we think that is fixed, and we can ensure that the bug does not happen again.

Lastly, these tests not only check that the code does what is intended, but also help us document the expectations that we had when creating the functionality.

Integration tests

Because “The unclouded eye was better, no matter what it saw.” Frank Herbert.

These tests aim to determine whether modules that have been developed separately work as expected when brought together. In terms of a data pipeline, these can check that:

  • The data cleaning process results in a dataset appropriate for the model
  • The model training can handle the data provided to it and outputs results (ensurign that code can be refactored in the future)

So if we take the unit tested function above and we add the following two functions:

def remove_nan_size(asimov_dataset):
    return asimov_dataset.dropna(subset=['robot_takeover_type'])

def clean_data(asimov_dataset):
    asimov_dataset_with_difference = add_difference(asimov_dataset)
    asimov_dataset_without_na = remove_nan_size(asimov_dataset_with_difference)

    return asimov_dataset_without_na

Then we can test that combining the functions inside clean_data will yield the expected result with the following code:

def test_cleanup():
    asimov_dataset_input = pd.DataFrame({
        'total_naughty_robots': [1, 4, 5, 3],
        'robot_takeover_type': ['A', 'B', np.nan, 'A']
    })

    expected = pd.DataFrame({
        'total_naughty_robots': [1, 4, 3],
        'change_in_naughty_robots': [np.nan, 3, 2],
        'robot_takeover_type': ['A', 'B', 'A']
    }).reset_index(drop=True)

    result = clean_data(asimov_dataset_input).reset_index(drop=True)

    assert_frame_equal(expected, result)

Now let’s say that the next thing we do is feed the above data to a logistic regression model.

from sklearn.linear_model import LogisticRegression

def get_reression_training_score(asimov_dataset, seed=9787):
    clean_set = clean_data(asimov_dataset).dropna()

    input_features = clean_set[['total_naughty_robots', 
        'change_in_naughty_robots']]
    labels = clean_set['robot_takeover_type']

    model = LogisticRegression(random_state=seed).fit(input_features, labels)
    return model.score(input_features, labels) * 100

Although we don’t know the expectation, we can ensure that we always result in the same value. It is useful for us to test this integration to ensure that:

  • The data is consumable by the model (a label exists for every input, the types of the data are accepted by the type of model chosen, etc)
  • We are able to refactor our code in the future, without breaking the end to end functionality.

We can ensure that the results are always the same by providing the same seed for the random generator. All major libraries allow you to set the seed (Tensorflow is a bit special, as it requires you to set the seed via numpy, so keep this in mind). The test could look as follows:

from numpy.testing import assert_equal

def test_regression_score():
    asimov_dataset_input = pd.DataFrame({
        'total_naughty_robots': [1, 4, 5, 3, 6, 5],
        'robot_takeover_type': ['A', 'B', np.nan, 'A', 'D', 'D']
    })

    result = get_reression_training_score(asimov_dataset_input, seed=1234)
    expected = 40.0

    assert_equal(result, 50.0)

There won’t be as many of these kinds of tests as unit tests, but they would still be part of your CI pipeline. You would use these to check the end to end functionality for a component and would therefore test more major scenarios.

ML Validation

Why? “To exhibit the perfect uselessness of knowing the answer to the wrong question.” Ursula K. Le Guin.

Now that we have tested our code, we need to also test that the ML component is solving the problem that we are trying to solve. When we talk about product development, the raw results of an ML model (however accurate based on statistical methods) are almost never the desired end outputs. These results are usually combined with other business rules before consumed by a user or another application. For this reason, we need to validate that the model solves the user problem, and not only that the accuracy/f1-score/other statistical measure is high enough.

How does this help us?

  • It ensures that the model actually helps the product solve the problem at hand
    • For example, a model that classifies a snake bite as deadly or not with 80% accuracy is not a good model if the 20% that is incorrect leads to patients not getting the treatment that they need.
  • It ensures that the values produced by the model make sense in terms of the industry
    • For example, a model that predicts changes in price with 70% accuracy is not a good model, if the end price displayed to the user has a value that’s too low/high to make sense in that industry/market.
  • It provides an extra layer of documentation of the decisions made, helping engineers joining the team later in the process.
  • It provides visibility of the ML components of the product in a common language understood by clients, product managers and engineers in the same way.

This kind of validation should be ran periodically (either through the CI pipeline or a cron job), and its results should be made visible to the organisation. This ensures that progress in the data science components is visible to the organisation, and ensures that issues caused by changes or stale data are caught early.

Conclusion

After all “Magic’s just science that we don’t understand yet” Arthur C. Clarke.

ML components can be tested in various ways, bringing us the following advantages:

  • Resulting in a data driven approach to ensure that the code does what is expected
  • Ensuring that we can refactor and cleanup code without breaking the functionality of the product
  • Documenting functionality, decisions and previous bugs
  • Providing visibility of the progress and state of the ML components of a product

So don’t be afraid, if you have the skillset to write the code, you have the skillset to write the test and gain all of the above advantages 🙂

So long and thanks for all the testing!

 

Automated model serving to mobile devices

The most common approach to deploying machine learning models is to expose an API endpoint. This API endpoint would generally be called via a POST method containing the input data for the model as the body, and responding with the output of the model. However, an API endpoint is not always the most appropriate solution to your use case.

There are, for example, use cases that may require a machine learning model to be deployed on a mobile device, such as:

  • The need to use the model offline or in low connectivity areas.
  • The need to minimize the amount of data being transferred – perhaps the user lives in a place where data is not cheap nor easily accessible, or the model requires large amounts of data as input.
  • The need to not be limited by network speed, requiring more immediate results

Many tools exist that allow for data cleaning, experimentation and deployment of models all in one. These tools, however, seem to only provide API endpoints or packaged model files.

This article aims to give an overview of how versioned models can be automatically deployed to mobile devices.

The article is based on the work of an awesome cross-functional team ❤

General architecture

Let’s start with a high level view of the model deployment process. The data pipeline in the image below probably looks familiar to anyone that has deployed a model to production.

Pipeline.png

We have:

  • Data cleaning: This is the step where incorrect data is filtered out from the input and data is set up to be consumable by the model
  • Feature selection: You might have such a step if the features that you are feeding int your model require some additional computation. You might also have merged the data cleansing and feature selection steps of the pipeline together.
  • Model training: This is where the data is consumed in order to train a model with some pre-selected parameters (either through automation or through thorough experimentation)
  • Model deployment: This is the first step that I will dive a bit more deeply into below.

Model deployment

Model deployment for mobile devices can consist of the following steps:

  • Generate a configuration file for model details
  • Save the model
  • Freeze the model
  • Serve the model

Each of these is described in more details below.

Generate configuration file

We need to keep track of how we feed our data to the model in order to train it. This will:

  • Make it easier to tell the mobile device what features to extract and use for a newly downloaded model.
  • Contain information to keep track of what model is deployed on which device
  • Contain information on what the numeric output of the model means – such as corresponding labels
  • Contain information on how to access the relevant parts of the model – such as the names of the input and output layers of a neural network

The data representation recommended for this is JSON, as it is flexible, standard and readable. An example of such a configuration can be seen below:


{
    'model_name': 'example_1',
    'version': 1.2,
    'features': [
        {
            'name': 'measure',
            'grouping': 'max',
            ...
        }
    ],
    'output_labels': [ 'cat', 'dog' ],
    'input_name': 'input:0',
    'output_name': 'output:0'
}

Save the model

Once a model is trained and you have saved it, we have a representation of this model in some sort of file. If you are using Tensorflow you would have the following files:

  • A meta file: Contains the graph information, in other word, it describes the architecture of the model
  • An index file – Metadata of each tensor that is part of the graph
  • A data file – Contains the values of the variables, such as the values of the weights.

A Tensorflow model can be re-loaded using these 3 files and training can continue from where it stopped.

The image below shows the resulting files to be saved for a trained model:

Untitled drawing (4).png

Freeze the model

The saved files are not really deployable to a mobile device. The graph of a model needs to be frozen in order to be deployed. This means that the graph definition and the values for the variables need to be merged into a single binary (.pb) file consumable by the mobile device.

For more details on freezing a model, checkout the tensorflow documentation here: https://www.tensorflow.org/lite/tfmobile/prepare_models 

The merged file results in a model representation that cannot be re-loaded to train further, but can be used through tensorflow-lite or CoreML.

Note: In order to use the model in CoreML, the PB file needs an extra conversion step. For a blog post on this, check out https://medium.com/@jianshi_94445/convert-a-tensorflow-model-to-coreml-model-using-tfcoreml-8ce157f1bc3b

Serve the model

Once the mobile application starts the following steps are taken to use the model:

  1. Get latest configuration for the OS – the service provides this in JSON format
  2. If the latest configuration has a higher version than the local backup configuration
    1. Download the latest model – the service provides this in a zip file
    2. Replace the local model and configuration with the latest downloaded
  3. If the latest configuration does not have a higher version than the local one OR there is no internet connection
    1. Use the locally stored model

Once the model has been downloaded the application can use the configuration JSON to determine:

  • What features to extract
  • The name of the input layer
  • The name of the output layer
  • The meaning of the output

The image below shows an overview of the serving process end to end:

Deployment to mobile.png

Additional considerations

The above implementation covers the ‘happy path’ and the case of no internet connection. However, there are other things that could go wrong which need to be considered in order to fall back to an old model. These can be done through validating whether:

  • Model files are corrupted through network transfer
  • Configurations don’t fit the model deployed (for example 3 outputs, but only 2 described)
  • Features described are not understood by the mobile device

Conclusion

Through JSON configuration and model versioning, models can dynamically be deployed to mobile devices to be run locally. That said, there is some additional validation which needs to be implemented to fail gracefully when there are problems in the deployed model.

Evolving Neural Networks

Traditionally neural networks are trained by adjusting weights based on a measure of error being passed back through the network. This error is calculated by comparing the result of the input fed through the network against the expected value. The person creating the neural network would spend some time fiddling with the neural network’s parameters until the network can learn from the given data by adjusting its weights using the said error.

This article is a high level introduction to how evolutionary algorithms can be used to ease this process.

Before I begin, let’s take a look at some existing techniques for parameter optimization.

Automatic parameter determination

There are already some existing methods to automatically derive the parameters for ML algorithms. You can read about some of the existing tools here:

  • Auto-Keras: A library for automated machine learning – does an automatic search for parameters and architecture.
  • Bayesian optimization: a statistical optimization technique for maximizing the performance of an algorithm through smart estimation of its parameters.
  • DataRobot: A tool that focuses on providing an end to end ML experience (from data preparation to ML model deployment). It also allows for automated ML and comparison of models.
  • Dataiku: Another tool focusing on providing an end to end ML experience with the option to do automated ML.

Evolutionary algorithms

Evolutionary algorithms are algorithms that model the process of evolution. This is done by having a population of individuals, each consisting of sets of genes. Each gene represents an attribute/feature of the randomly generated data that you are trying to evolve into something meaningful.

Unlike ML algorithms, evolutionary algorithms start with no data. Instead, we have a measure of what we are trying to achieve (for this article, for example, we want to maximize the accuracy of our neural network, based on the parameters that we use to train it). We then change the individuals to try to fit this requirement as best as possible.

The genetic algorithm is the most popular and common of the evolutionary algorithms. It evolves individuals through the following steps:

  1. Randomly initialize individuals
  2. For a limited number of generations 
    1. Perform mutation – replacing a gene randomly or through some more sophisticated method.
    2. Perform crossover – merging individuals together, resulting in new individuals with various genes from each parent.
    3. Calculate the fitness of each individual – This refers to the function representing your problem. This function is applied to each individual in order to determine how good each individual is.
    4. Selection – choosing which individuals survive to the next generation based on the fitness calculated above. This forms the population for the next generation in which these steps are repeated.
  3. Select the individual with the best result, that is, the result with the highest/lowest fitness.

For a deeper dive into genetic algorithms check this article out.

 

Evolving neural networks

There are various approaches where evolutionary algorithms can be used for neural networks. These approaches aim to ease the process of designing a neural network by automating a set of steps. This section gives a high-level overview of each.

 

Evolving neural network parameters

This refers to determining the training parameters of a neural network, such as the learning rate, activation function, etc.

Evolving neural network parameters using a genetic algorithm follows the same steps as above, where:

  • Each gene of an individual is a parameter
  • Each individual is a combination of parameters, like in the picture belowUntitled drawing.png
  • The fitness function consists of:
    • Training the neural network given the parameters represented by an individual
    • Calculating the accuracy /f1-score (or any other prefered measure of neural network performance) based on a test set as the result of the fitness function

The image below shows an example of how the above individual’s layers parameters are translated to a network:

Untitled drawing (4).png

Advantages

  • Automatically being able to go through many parameter combinations in a guided rather than brute force manner
  • Can add bounds to the parameters
  • Can evolve more sophisticated parameters, such as categorical (for example: type of optimization) as the conversion from individual to the neural network is managed by you.

Disadvantages

  • Can be slow as you are training total_individuals * total_generations neural networks
  • May need to decide on genetic algorithm parameters (though the defaults are usually ok for this purpose).

 

Evolving features for the neural network

Part of training a neural network is selecting the most appropriate data to feed into the network. Given a set of parameters that perform more or less ok with a large portion of the data, we can filter this data to the more meaningful attributes through evolution. The process is the same as the genetic algorithm described above, where:

  • Each gene of the individual is an attribute
  • Each individual is a set of attributes which are inputted to the network. An example individual is shown below, where 1 represents a feature that should be fed to the network, and 0 one that should not.Untitled drawing (2).png
  • The fitness function consists of:
    • Training the neural network given a set of predefined parameters and only feeding the selected features in while training
    • Calculating the accuracy /f1-score (or any other prefered measure of neural network performance) based on a test set as the result of the fitness function

What may work for some use cases is using the larger set of attributes to evolve the parameters of the neural network, and then use the resulting neural network parameters to evolve the features as described in this subsection.

The image below shows an example of how the above individual would translate to a network’s inputs:

Untitled drawing (5).png

Advantages

  • Automatically being able to select features in a guided rather than brute force manner, which can be advantageous when the usefulness of a feature is not easily determinable through analysis.

Disadvantages

  • Can be slow as you are training total_individuals * total_generations neural networks
  • May need to decide on genetic algorithm parameters (though the defaults are usually ok for this purpose).

 

Evolving the weights directly (Neuroevolution)

Instead of determining good parameters to train the neural network, you can also evolve the weights themselves. This means that instead of using backpropagation to pass the error back and adjust the weights for some number of epochs, you can:

  • Choose an architecture (layers, size of layers, activation function)
  • Evolve weights
  • Test the new neural network

Represented in the same way as the other evolution options, we can look at it as follows:

  • Each gene of the individual is a weight
  • Each individual represents a neural network with a predefined architectureUntitled drawing (6).png
  • The fitness function consists of:
    • Replacing the weights of the predefined neural network with the gene values of the individual
    • Calculating the accuracy /f1-score (or any other prefered measure of neural network performance) based on a test set as the result of the fitness function

The image below shows an example of how the above individual may be translated to the network’s weights:

Untitled drawing (7).png

Advantages

  • Automatically being able to get a well fitting network for the given data
  • Less likely to get stuck in local minima, due to sampling various parts of the search space.

Disadvantages

  • Can be slow as you are training total_individuals * total_generations neural networks
  • May need to decide on genetic algorithm parameters (though the defaults are usually ok for this purpose).

For more details on Neuroevolution check out this article.

More

There is also research on evolving neural network architectures together with the weights using evolutionary programming. I will not go into details in this article, as the approach is slightly different from a straightforward genetic algorithm and involves additional rules. I also don’t have first-hand experience with this method, whereas I do with the other three.

Conclusions

There are 4 ways in which one can use evolutionary algorithms, such as the genetic algorithm to design neural networks, namely:

  • Evolving the parameters for the neural network training
  • Evolving the features to be fed into the network
  • Evolving the weights of the network with a predefined architecture
  • Evolving the network architecture together with the weights

The main advantage is that evolutionary algorithms allow for a guided exploratory search of the criterions of the neural network. The main disadvantage is that this involves training many neural networks, as each individual is a different network – which could be slow depending on the complexity of the problem.

Terraforming a Spark cluster on Amazon

This post is about setting up the infrastructure to run yor spark jobs on a cluster hosted on Amazon.

Before we start, here is some terminology that you will need to know:

  • Amazon EMR – The Amazon service that provides a managed Hadoop framework
  • Terraform – A tool for setting up infrastructure using code

At the end of this post you should have an EMR 5.9.0 cluster that is set up in the Frankfurt region with the following tools:

  • Hadoop 2.7.3
  • Spark 2.2.0
  • Zeppelin 0.7.2
  • Ganglia 3.7.2
  • Hive 2.3.0
  • Hue 4.0.1
  • Oozie 4.3.0

By default EMR Spark clusters come with Apache Yarn installed as the resource manager.

We will need to set up an S3 bucket, a network, some roles , a key pair and the cluster itself. Let’s get started.

VPC setup

A VPC (Virtual private cloud) is a virtual network to which the cluster can be assigned. All nodes in the cluster will become part of a subnet within this network.

To set up a VPC in terraform fist create a VPC resource:

resource "aws_vpc" "main" {
    cidr_block = "172.19.0.0/16"
    enable_dns_hostnames = true
    enable_dns_support = true

    tags {
        Name = "VPC_name"
    }
}

Then we can create a public subnet. The availability zone is generally optional, but for this exercise you should have it as some of the settings that we choose are only compatible with eu-central-1a (such as the types of instances that we use)

resource "aws_subnet" "public-subnet" {
  vpc_id            = "${aws_vpc.main.id}"
  cidr_block        = "172.19.0.0/21"
  availability_zone = "eu-central-1a"

  tags {
    Name = "example_public_subnet"
  }
}

We then create a gateway for the public subnet.

resource "aws_internet_gateway" "gateway" {
  vpc_id = "${aws_vpc.main.id}"

  tags {
    Name = "gateway_name"
  }
}

A routing table is then needed to allow traffic to go through the gateway.

resource "aws_route_table" "public-routing-table" {
  vpc_id = "${aws_vpc.main.id}"

  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = "${aws_internet_gateway.gateway.id}"
  }

  tags {
    Name = "gateway_name"
  }
}

Lastly, the routing table must be assigned to the to the subnet to allow the traffic in and out from it.

resource "aws_route_table_association" "public-route-association" {
  subnet_id      = "${aws_subnet.public-subnet.id}"
  route_table_id = "${aws_route_table.public-routing-table.id}"
}

 

Roles

Next we need to set up some roles for the EMR cluster.

First a service role is necessary. This role defines what the cluster is allowed to do within the EMR environment.

Note that EOF tags imply content with a structure. These need to have no trailing spaces, which leads to strange indentation.

resource "aws_iam_role" "spark_cluster_iam_emr_service_role" {
    name = "spark_cluster_emr_service_role"

    assume_role_policy = <<span 				data-mce-type="bookmark" 				id="mce_SELREST_start" 				data-mce-style="overflow:hidden;line-height:0" 				style="overflow:hidden;line-height:0" 			></span><EOF
{
    "Version": "2008-10-17",
    "Statement": [
        {
            "Sid": "",
            "Effect": "Allow",
            "Principal": {
                "Service": "elasticmapreduce.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}
EOF
}

This service role needs a policy attached. In this example we will simply used the default EMR role.

resource "aws_iam_role_policy_attachment" "emr-service-policy-attach" {
   role = "${aws_iam_role.spark_cluster_iam_emr_service_role.id}"
   policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceRole"
}

 

Next we need a role for the EMR profile.

resource "aws_iam_role" "spark_cluster_iam_emr_profile_role" {
    name = "spark_cluster_emr_profile_role"
    assume_role_policy = <<EOF
{
    "Version": "2008-10-17",
    "Statement": [
        {
            "Sid": "",
            "Effect": "Allow",
            "Principal": {
                "Service": "ec2.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}
EOF
}

This role is assigned the EC2 default role, which defines what the cluster is allowed to do in the EC2 environment.

resource "aws_iam_role_policy_attachment" "profile-policy-attach" {
   role = "${aws_iam_role.spark_cluster_iam_emr_profile_role.id}"
   policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceforEC2Role"
}

Lastly the instance profile, which is used to pass the role’s details to the EC2 instances.

resource "aws_iam_instance_profile" "emr_profile" {
   name = "spark_cluster_emr_profile"
   role = "${aws_iam_role.spark_cluster_iam_emr_profile_role.name}"
}

Key setup

Next you will need ssh keys that will allow you to ssh into the master node.

To create the ssh key and .pem file run the following command:

ssh-keygen -t rsa

Enter a key name, such as cluster-key, and enter no password. Then create a pem file from the private key.

ssh-keygen -f cluster-key -e -m pem

Lastly create a key pair in terraform, linking to the key that you have created

resource "aws_key_pair" "emr_key_pair" {
  key_name   = "emr-key"
  public_key = "${file("/.ssh/cluster-key.pub")}"
}

S3

Next we need an s3 bucket. You may need more that one depending on your project requirements. In this example we will simply create one for the cluster logs.

resource "aws_s3_bucket" "logging_bucket" {
  bucket = "emr-logging-bucket"
  region = "eu-central-1"

  versioning {
    enabled = "enabled"
  }
}

Security groups

Next we need a security group for the master node. This security group should allow the nodes to communicate with the master node, but also to be accessed via certain ports from your personal VPN.

You can find your public IP address by simply going to this site.

Let’s assume that your public address is 123.123.123.123 with subnet /16.

resource "aws_security_group" "master_security_group" {
  name        = "master_security_group"
  description = "Allow inbound traffic from VPN"
  vpc_id      = "${aws_vpc.main.id}"

  # Avoid circular dependencies stopping the destruction of the cluster
  revoke_rules_on_delete = true

  # Allow communication between nodes in the VPC
  ingress {
    from_port   = "0"
    to_port     = "0"
    protocol    = "-1"
    self        = true
  }

  ingress {
      from_port   = "8443"
      to_port     = "8443"
      protocol    = "TCP"
  }

  egress {
    from_port   = "0"
    to_port     = "0"
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  # Allow SSH traffic from VPN
  ingress {
    from_port   = 22
    to_port     = 22
    protocol    = "TCP"
    cidr_blocks = ["123.123.0.0/16"]
  }

  #### Expose web interfaces to VPN

  # Yarn
  ingress {
    from_port   = 8088
    to_port     = 8088
    protocol    = "TCP"
    cidr_blocks = ["123.123.0.0/16"]
  }

  # Spark History
  ingress {
      from_port   = 18080
      to_port     = 18080
      protocol    = "TCP"
      cidr_blocks = ["123.123.0.0/16"]
    }

  # Zeppelin
  ingress {
      from_port   = 8890
      to_port     = 8890
      protocol    = "TCP"
      cidr_blocks = ["123.123.0.0/16"]
  }

  # Spark UI
  ingress {
      from_port   = 4040
      to_port     = 4040
      protocol    = "TCP"
      cidr_blocks = ["123.123.0.0/16"]
  }

  # Ganglia
  ingress {
      from_port   = 80
      to_port     = 80
      protocol    = "TCP"
      cidr_blocks = ["123.123.0.0/16"]
  }

  # Hue
  ingress {
      from_port   = 8888
      to_port     = 8888
      protocol    = "TCP"
      cidr_blocks = ["123.123.0.0/16"]
  }

  lifecycle {
    ignore_changes = ["ingress", "egress"]
  }

  tags {
    name = "emr_test"
  }
}

We also need a security group for the rest of the nodes. These nodes should only communicate internally.

resource "aws_security_group" "slave_security_group" {
  name        = "slave_security_group"
  description = "Allow all internal traffic"
  vpc_id      = "${aws_vpc.main.id}"
  revoke_rules_on_delete = true

  # Allow communication between nodes in the VPC
  ingress {
    from_port   = "0"
    to_port     = "0"
    protocol    = "-1"
    self        = true
  }

  ingress {
      from_port   = "8443"
      to_port     = "8443"
      protocol    = "TCP"
  }

  egress {
    from_port   = "0"
    to_port     = "0"
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }

  # Allow SSH traffic from VPN
  ingress {
    from_port   = 22
    to_port     = 22
    protocol    = "TCP"
    cidr_blocks = ["123.123.0.0/16"]
  }

  lifecycle {
    ignore_changes = ["ingress", "egress"]
  }

  tags {
    name = "emr_test"
  }
}

Note that when you create 2 security groups ircular dependencies are created. When destroying the terraformed infrastructure in such a case, you need to delete the associations of the security groups before deleting the groups themselves. The revoke_rules_on_delete option takes care of this automatically.

Cluster

Finally, now that we have all the components, we can set up the cluster.

First add the provider

provider "aws" {
    region = "eu-central-1"
}

Then we add the cluster itself

resource "aws_emr_cluster" "emr-spark-cluster" {
   name = "EMR-cluster-example"
   release_label = "emr-5.9.0"
   applications = ["Ganglia", "Spark", "Zeppelin", "Hive", "Hue"]

   ec2_attributes {
     instance_profile = "${aws_iam_instance_profile.emr_profile.arn}"
     key_name = "${aws_key_pair.emr_key_pair.key_name}"
     subnet_id = "${aws_vpc.main.id}"
     emr_managed_master_security_group = "${aws_security_group.master_security_group.id}"
     emr_managed_slave_security_group = "${aws_security_group.slave_security_group.id}"
   }

   master_instance_type = "m3.xlarge"
   core_instance_type = "m2.xlarge"
   core_instance_count = 2

   log_uri = "${aws_s3_bucket.logging_bucket.uri}"

   tags {
     name = "EMR-cluster"
     role = "EMR_DefaultRole"
   }

  service_role = "${aws_iam_role.spark_cluster_iam_emr_service_role.arn}"
}

EDIT: The URI reference has been shown to be outdated. You can use  log_uri = “s3://${aws_s3_bucket.logging_bucket.bucket}” to reference your bucket. Additionally the subnet_id inside the ec2_attributes was mistakenly set to the vpc_id insted of the “${aws_subnet.public-subnet.id}”.

You can add task nodes as follows

resource "aws_emr_instance_group" "task_group" {
    cluster_id = "${aws_emr_cluster.emr-spark-cluster.id}"
    instance_count = 4
    instance_type = "m3.xlarge"
    name = "instance_group"
}

Saving the file

Save the file as your prefered name with the extention .tf

Creating the cluster

To run the terraform script ensure the following:

Run the following to make sure that your setup is valid:

terraform plan

If there are no errors, you can run the following to create the cluster:

terraform apply

Destroying the cluster

To take down all the terraformed infrastructure run the following:

terraform destroy

You can add the following to you file if you want the terraform state file to be saved to an S3 bucket. This file allows terraform to know the last state of terraforming your infrastructure (what has been created or destroyed)

terraform {
   backend "s3" {
   bucket = "terraform-bucket-name"
   region = "eu-central-1"
 }
}

Code: Word2Vec in Spark

Here is a snippet that might be useful to you if you are looking to implement Word2Vec and save the embeddings of the trained model. I’ve added types to the variables as well as to some placeholder names to make it easier to understand what is expected as an input to various functions

First you will need to import Word2Vec and Word2VecModel

    import org.apache.spark.ml.feature.{Word2Vec, Word2VecModel}

Then you will need to import spark session’s implicits as you will be working with Datasets.

    import sparkSession.implicits._

Then you need to prepare your input data for the algorithm. The training algorithm expects a Dataset of Sequences of Strings. In the example below “getTrainset” is a function that retrieves your training corpus and formats it into the required type.

    val trainset: Dataset[Seq[String]] = getTrainset() 

Now you are ready to start using Word2Vec. You need to first configure the Word2Vec algorithm with the parameters that you have selected.

    val wordToVec: Word2Vec = new Word2Vec()
    .setInputCol("column-name-of-your-trainset-dataset")
    .setOutputCol("output-column-name")
    .setStepSize(learningRateFloat)
    .setVectorSize(vectorSizeInt)
    .setWindowSize(windowSizeInt)
    .setMaxIter(numberOfIterationsInt)
    .setMinCount(minimumCountForDiscartingInt)
    .setNumPartitions(wordToVecPartitions)

Next comes the training step.

    val model = wordToVec.fit(trainset)

Once your model has been trained you will want to process it and save it in the format that you want. Because you are dealing with a dataset you need to map the results to something more type safe. You can do this with a case class as follows:

    case class MyCustomType(word: String, vector: Vector) {
        def toPair = (word.toLong, vector.toArray)
    }

    model.getVectors.as[MyCustomType].map(_.toPair)

Lastly, you want to save you model. Note that the Word2Vec model does have a save function that saves it in a special format easily reloadable into a Word2VecModel. In this example we will save to parquet instead as you may need a more raw version of your model.

   model
      .repartition(partitionsToSaveModelInto)
      .withColumnRenamed("_1", "word")
      .withColumnRenamed("_2", "vector")
      .select("word", "vector")
      .write.parquet(options.outputFile)

Spark Word2Vec: lessons learned

This post summarises some of the lessons learned while working with Spark’s Word2Vec implementation. You may also be interested in the previous post “Problems encountered with Spark ml Wod2Vec

Lesson 1: Spark’s Word2Vec getVectors returns the  unique embeddings

As mentioned in part 2, the transform function aims to return the vectors for words within the given sentences. If you want the actual trained model, and therefore the unique word to vector representations you should use the getVectors function

Lesson 2: More partitions == more speed == less quality

There is a balance that you need to determine between having a fast implementation vs one with good quality. Having more Word2Vec partitions means that the data is separated into many smaller buckets, losing context of the words in other buckets. The data is only brought together at the end of an iteration. For this reason, you don’t want to split your data into too many partitions. However, you also don’t want to lose out on parallelism – after all you are using spark because you want distributed computation. Play around with the total partitions – the right value for this parameter will differ depending on the problem. Also remember that less partitions means less parallelism and therefore a slower algorithm.

Lesson 3: More iterations == less speed == more quality

As mentioned in lesson 2, the data from various partitions is brought together at the end of each iteration. Having more iterations means more context from the different buckets and more time training. This means that more iterations can lead to better results, but they do have an impact on the running time of the algorithm.

Lesson 4: Machine learning algorithms need a lot of hardware

This probably doesn’t come as a surprise, but it is still worth mentioning. You are using a machine learning algorithm on a distributed cluster and you keep having to give 1 thing more memory, namely your driver.

Lesson 5: Save things to parquet

Why? efficient data compression built for handling bulk data leads to less memory issues.

Lesson 6: Spark ml Word2Vec is not mockable

If you are writing tests for your Spark jobs, which you should be doing, you will probably try to mock out Spark’s Word2Vec implementation as it is nondeterministic. You will soon be greeted by an error message stating that Word2Vec cannot be mocked. You will then quickly find out that this is a final class in the ml library. To get around this you can wrap your call to Word2Vec in a function and inject it into the function that you are testing.

Problems encountered with Spark ml Word2Vec

This post aims to summarise some of the problems experienced when trying to use Spark’s ml Word2Vec implementation.

Out of memory exception

Spark’s Word2Vec implementation requires quite a bit of memory depending on the amount of data that you are dealing with. This is because the driver ends up having to do a lot of work. You may experience this problem with various machine learning implementations in Spark.

All you have to do is increase the total memory allocated to your driver using spark-submit’s option driver-memory. Note that your cluster may have an upper limit set which you might need to increase. The error message that you get if you set the driver memory to a value above this threshold is  very straight forward. It pretty much tells you to increase the limit by changing the value of the cluster’s yarn.scheduler.maximum-allocation-mb.

In my case, the driver was using 30 GB, so I gave it 40 GB.

Total size of serialized results of X tasks (Y MB) is bigger than spark.driver.maxResultSize (Y MB)

The Word2Vec algorithm needs to deal with result sizes larger than your normal cleaning job. You can increase Spark’s limit by increasing the value of spark.driver.maxResultSize.

Default column name not found

Spark’s ml Word2Vec implementation deals with Dataframes. This means that it relies on string names of columns rather than concrete types. You are getting this error because the Dataframe’s column name does not match the default name expected by the Word2Vec training function.  There are 2 options to fix this:

  1. Change the name expected by Word2Vec to the name of your input Dataframe’s column using the setInputCol function of Word2Vec. If you have not set a column name, then it is probably value.
  2. Change your input Dataframe’s column name to that expected by Word2Vec. The name expected by Word2Vec is inputCol.

OutOfMemoryError: GC overhead limit exceeded

As the driver is doing a lot of work, the default Garbage Collector seems to struggle to catch up with the cleanup. To fix this you can use concurrent garbage collection by enabling it through the Java Options. You can do this by adding XX:+UseConcMarkSweepGC to the Java options in your spark-submit.

Cannot resolve ‘`X`’ given input columns: [value, w2v_993c88fe4732__output]

As you are dealing with Dataframes when managing the results of Word2Vec you are probably trying to map these to your custom datatype after retrieval. You get an error like this if your custom type’s constructor expects the wrong parameters. As you may be retrieving the vectors in two different ways let’s look at the expectations of each one:

  • Using dataframe returned by transform: this expects a type that takes in two parameters -> value: Array[String], vector: Vector
  • Using dataframe returned by getVectors -> this expects a type that takes in two parameters: wordString, vector: Vector

Ensure that when you use <dataframe>.as[<customType>that the custom type expects the above-mentioned parameter types.

Duplicates in output from Word2Vec

When saving your model you may notice that you are getting duplicated words with different vectors in your word-vector representation. One words should have one vector representation. This may be especially confusing if you re moving from Google’s implementation to Spark’s. This is happening because you are using the transform function. This function takes in the sentences that you trained the model with and returns a word-vector representation for each word in the given set. This means that repeated words across different sentences will also appear in your result with the vector representations most appropriate for their context at that point. If what you want is the single vector representation of a word, you should get the correct embeddings by using the getVectors function.

Failed to register classes with Kryo

This is not specific to Word2Vec but it did happen during the implementation. This generally means that your manual Kryo serialization registration, which is done for optimization reasons, is missing a type. Find out the type that you are missing and register it using kryo.register(classOf[<myClass>]).

Memory issues when saving the results of getVectors

Once you are almost done and all you need to do is just save your trained Word2Vec embeddings for future use you might be greeted by some memory issues. If you are, you are probably trying to either save the whole model into a single file or you are saving it into partitioned plain text files on HDFS. You have a coupe of options here.

Word2VecModel has a function save which allows you to save the model in a format that can be re-loaded into a Word2VecModel using the load function. This wasn’t quite what I needed in my case, but it may be appropriate for your use case.

I needed to save the embeddings as normal text in order for another spark job to consume it as input to a second machine learning algorithm.  For this reason, I went for my second file-saving option: saving to parquet. This can be done with the following code snippet:

model
.repartition(partitions)
.withColumnRenamed("_1", "word")
.withColumnRenamed("_1", "vector")
.select("word", "vector")
.write.parquet("some output path")

An overview of Word2Vec

Word2Vec (W2V) is an algorithm that takes in a text corpus and outputs a vector representation for each word, as depicted in the image below:

blog - w2v.png

There are two algorithms that can generate word to vector representations, namely Continuous Bag-of-words and Continuous Skip-gram models.

In the Continuous Bag-of-words model the task of the neural network is to predict a word given the context its context. In the Skip-gram model the task of the neural network is the opposite: to predict the context given the word. 

This post will focus on the Skip-gram model. For more information on the continuous bag of words check out this article.

The literature describes the algorithm in two main ways. The first being as a neural network and the second as a matrix factorisation problem.  This post will follow the neural network based description.

If you are familiar with neural networks already, you can think of Word2Vec as a neural network where the input is a word and the output is a probability of that word forming part of a particular context. The resulting vectors for each word are then the weights leading from  the word’s input node towards the hidden layer.

The Skip-gram model takes in a corpus of text and creates a hot-vector for each word. A hot vector is a vector representation of a word where the vector is the size of the vocabulary (total unique words). All dimensions are set to 0 except the dimension representing the word that is used as an input at that point in time. Here is an example of a hot vector:

hot vecotr.png

The above input is given to a neural network with a single hidden layer which looks as follows:

neural net (1).jpg

Each dimension of the input passes through each node of the hidden layer. The dimension is multiplied by the weight leading it to the hidden layer. Because the input is a hot vector, only one of the input nodes will have a non-zero value (namely the value of 1). This means that for a particular word only the weights associated with the input node with value 1 will be activated, as shown in the image below:

As the input in this case is a hot vector, only one of the input nodes will have a non-zero value. This means that only the weights connected to that input node will be activated in the hidden nodes. An example of the weights that will be taken into account is depicted below for the second word in the vocabulary:

neural net (2).jpg

The vector representation of the second word in the vocabulary (shown in the neural network above) will look as follows, once activated in the hidden layer:

vector rep.jpg

Those weights start off as random values. The network is then trained in order to adjust the weights to represent the input words. This is where the output layer becomes important. Now that we are in the hidden layer with a vector representation of the word we need a way to determine how well we have predicted that a word will fit in a particular context. The context of the word is a set of words within a window around it, as shown below:

Untitled drawing (3).jpg

The above image shows that the context for Friday includes words like “cat” and “is”. The aim of the neural network is to predict that “Friday” falls within this context.

We activate the output layer by multiplying the vector that we passed through the hidden layer (which was the input hot vector * weights entering hidden node) with a vector representation of the context word (which is the hot vector for the context word * weights entering the output node).  The state of the output layer for the first context word can be visualised below:

Output.png

The above multiplication is done for each word to context word pair. We then calculate the probability that a word belongs with a set of context words using the values resulting from the hidden and output layers. Lastly, we apply stochastic gradient descent to change the values of the weights in order to get a more desirable value for the probability calculated.

In gradient descent we need to calculate the gradient of the function being optimised at the point representing the weight that we are changing. The gradient is then used to choose the direction in which to make a step to move towards the local optimum, as shown in the minimisation example below.

gradient desc (1).png

The weight will be changed by making a step in the direction of the optimal point (in the above example, the lowest point in the graph). The new value is calculated by subtracting from the current weight value the derived function at the point of the weight scaled by the learning rate.

The next step is using Backpropagation, to adjust the weights between multiple layers. The error that is computed at the end of the output layer is passed back from the output layer to the hidden layer by applying the Chain Rule. Gradient descent is used to update the weights between these two layers. The error is then adjusted at each layer and sent back further. Here is a diagram to represent backpropagation:

propagation (1).jpg

I’m not going to go into the details of Backpropagation or gradient descent in this post. There are many great resources out there explaining the two. If you are interested in the details of these, Standford University tends to have great freely available lectures and resources on Machine learning topics.

The final vector representation of the word will be the weights (after training) that connect the input node for the word to the hidden layer. The weights connecting the hidden and output layers are representations of the context words. However, each context word is also in the vocabulary and therefore has an input representation. For this reason, the vector representation of a word is only that of the input vectors.