Machine Learning in the Cloud

In most machine learning projects, there is a common workflow that, at a minimum, consists of data preparation, model training, and model deployment. Still in its infancy, the Data Science community is testing various methodologies to streamline this process with varying degrees of success. This is the market that companies like Microsoft and Amazon are pursuing with their recent forays into machine learning platforms.

Products like Amazon SageMaker, Azure Machine Learning Studio, and Azure Machine Learning Service were designed to streamline the machine learning process by automating many of the most common Data Science tasks. One of the most difficult tasks for the average data scientist is productionizing models in such a way that they can provide real-time actionable inference generation that solves a meaningful business problem. Integrating models into production systems is no small feat, but all three of these products seek to make this process as simple as possible. This means that data scientists spend less time building the models and more time figuring out how machine learning can be used to solve business problems.

Amazon SageMaker

In 2017 Amazon released their machine learning platform called, Amazon SageMaker. SageMaker is a machine learning service that allows developers to “easily build and train machine learning models, and then directly deploy them into a production-ready hosted environment.” It provides integration with Jupyter notebooks for data preparation, interfaces to train models using pre-built algorithms, and has capabilities to set up endpoints to provide statistical inferences from these models.

Data Preparation

“An Amazon SageMaker notebook instance is a fully managed ML compute instance running the Jupyter Notebook App.” This means that common data preprocessing tasks such as normalization, missing value imputation, scaling/transforming can be completed in a Jupyter Notebook – a tool that most data scientists have already incorporated into their daily workflow. These Jupyter Notebooks allow data scientists to include code and text in the same document. They allow the user to iteratively explore their data while documenting the whole process. This documentation provides an easy method to revisit past analysis.

More recently, Amazon introduced a new product offering called SageMaker Ground Truth. Ground Truth helps label training data in two ways. The first is outsourcing the manual task of labeling data to human labelers via Mechanical Turk. Once there is sufficient labeled data, a machine learning model is created that can assign labels to remaining training data.

Train a Model

SageMaker provides the option to perform model training within the AWS console or the ability to call their high-level library directly from your Jupyter Notebook. The AWS console provides a visual interface that is designed for business professionals with less experience coding, while the high-level libraries allow for an environment more conducive to developers and data scientists who are more familiar working with code.

Amazon SageMaker has the capability to perform hyperparameter tuning. Hyperparameters are the parameters set before training of the model begins, are specified by the developer, and set the conditions of the training process itself. Hyperparameter tuning runs the training process with different hyperparameters to find the set that produces the best results. Once the ideal set of hyperparameters are found, the developer can use these in the model training phase to find the associated parameters tied to the best hyperparameter set.

SageMaker comes with several out-of-the-box algorithms. Below are some of the most common:

Linear Learner Algorithm – Supervised machine learning method that can be used for classification or regression
XGBoost Algorithm – Gradient boosted tree
Factorization Machine Algorithm – Similar to Support Vector Machine but models all interactions between variables using factorized parameters
K-Means Algorithm – Unsupervised algorithm that groups observations into K groups based on the features
Principal Component Analysis – Unsupervised method to reduce the number of features in a dataset without losing information
Image Classification Algorithm – Supervised algorithm that uses convolutional neural network to classify images
Sequence-to-Sequence Algorithm – Supervised learning algorithm that takes a sequence of tokens and returns another sequence of tokens. This can be used for language translation, summarization of text, or speech-to-text.
Latent Dirichlet Allocation Algorithm – Unsupervised algorithm to categorize text data.
Neural Topic Model – Unsupervised algorithm that categorizes text documents into a ‘topic.’

Deploy the Model

Now that the model is created, it is necessary to establish an HTTPS endpoint in order to call on the model to make inferences based on input data. The client can host their model with SageMaker Hosting Services in order to avoid the cost of integrating the model with their application. To do so, the client specifies the model components and the ML compute instances to host each production variant.

SageMaker allows for the use of multiple model variants for each endpoint configuration. The developer can specify different weights for each variant. This is useful because it allows an organization to test variations of the same model on subsets of their users. For instance, if an organization wishes to replace an older pre-existing model with one of three new experimental models, they can assign 85% of traffic to the pre-existing model. The remaining 15% can be divided among the new models in order to compare performance on real-time production data. Once the test is concluded, the weights can be adjusted to assign 100% of traffic to the desired model.

Azure Machine Learning

There are two main products in the Azure suite catered towards developing and deploying machine learning models in the cloud: Azure Machine Learning Service and Azure Machine Learning Studio.

Azure Machine Learning Studio provides an interface better suited towards business professionals with no coding experience. It features an interactive, visual workspace that can connect datasets to machine learning solutions on an interactive canvas. Machine Learning Studio is limited to training datasets under 10 GB and has no offline development environment.

Azure Machine Learning Service is a cloud service geared towards developers and data scientists. It supports most common Python packages and includes features to aid in the process of data preparation and model generation that can be used locally; however, it provides the flexibility to use any framework and compute resources.

Data Preparation

Just as in SageMaker, the first step is to create a workspace that provides centralized access to all artifacts needed for a machine learning project. The workspace also maintains history of various logs and metrics generated during the process.

In Azure Machine Learning Studio, the developer can import the raw data to use and drag this onto the canvas as the starting step of their workflow. Studio supplies various modules that handle the most common data cleaning tasks. These are dragged to the canvas in the desired order.

For users more comfortable with coding, Microsoft offers the Azure Machine Learning Data Prep SDK which includes automatic file type detection, functions to assist with data munging, cross-platform functionality, and methods to assist with the creation of summary statistics.

Train a Model

Azure Machine Learning Studio includes modules to split data into train and test sets. The train datasets are then sent to a training module that contains the desired algorithm. Some of these algorithms request the developer to select features while others have their own method for feature selection. Studio then allows for model scoring and evaluation with their own modules.

The developer must register the model in the model registry, which tracks models in the Azure workspace. Models are tracked by name and version. Similarly, to Amazon SageMaker, Azure Machine Learning Service has recently included the capability for automated hyperparameter tuning.

Deploy the Model

Azure Machine Learning Studio offers deployment as a web service. This can be done simply by selecting the ‘Deploy Web Service’ option from a drop-down list. This web service accepts input data in the same format as the model was trained on. The developer can configure the model to return the entire dataset with the predicted value, return just the predicted value, or any combination of columns. These requests and responses can occur one at a time or in batches. After deployment, the web service can be managed through the Azure Machine Learning Web Services portal.

All three of these products streamline the machine learning process for data scientists. Their purpose is to assist with the details of such projects and allow the developer to focus on the business problem being solved. Most companies in the early stages of data science development find productionizing models to be an intimidating task that will be out-of-reach for quite some time. These products greatly simplify this process and ensure that even smaller companies can incorporate state-of-the-art machine learning solutions with minimal costs.

Additional Resources

Here are some resources to learn more about these products: