Machine learning models are just mathematical functions programmed to derive patterns from data. They can be clubbed into two main categories – supervised learning and unsupervised learning.
Supervised learning is when one provides the algorithm/ model/ function with a set of input and output, and the model finds a way to produce the desired output given an input. e.g.
- Identifying the zip code from handwritten digits on an envelope
- Detecting fraudulent activity in credit card transactions
Unsupervised learning is when we only are given input data and no corresponding output data to go with it
- Collating similar themed blog posts
- Detecting abnormal access patterns to a website
Coming back to supervised learning, we can further break it down into one of two – classification or regression problems. Classification problems are when we need to tag a data element with a specific label – it could be an animal say ‘cat’ or it could be a document which will be labelled as ‘technical’. Regression is used to predict the value of usually a numeric element and its value could be the set of all real numbers.
Data is central to any ML problem framed and can be categorised into Training, Validation or Test:
- Training data as the name suggest is fed to the model while training
- Validation data is used to decide when to stop the training
- Test data is used to evaluate how the trained model performs
The nature of data (whether it be training, validation or test) can also be thought of as structured or unstructured.
- Structured can be thought of loosely as data which can be represented neatly in tables or categorically can be fitted into a finite set of values – like say, cell phone brands.
- Unstructured data are more real world like – say images, videos, audios which you have recorded
Some more terminology around data are the concepts of ‘training examples’, ‘labels’ and ‘features’. Think of ‘training example’ as like a single row or instance of data which is fed into the machine learning model and a ‘label’ as both as a target column in the existing dataset or as an output/ prediction. A feature can be described as a single column after it has gone data processing.
Machine Learning Workflow Now we come to Machine Learning Workflow which can be split into 6 steps
Step 1 Data Collection: As stated before, machine learning is all about the data and so collecting the relevant data is crucial. The quality of data would in turn decide the quality of the machine learning model which we create. If data is not available in the format or the quantity which we desire, we may need to end up creating it.
Step 2 Pre-processing/ Feature Engineering/ Data Transformation is when data if it is numerical has to be scaled and when it is not numerical has to be converted to numerical and then maybe scaled. In other words, we normalise the data to a kind which the machine learning model can understand.
Step 3 Training is the process of passing training data to a model so that it can learn to identify patterns. Mathematically, the aim is to optimize a function to have the least cost – meaning the least deviation between the predicted and the known value.
Step 4 Evaluation is used to fine tune our machine learning model. We might have to go through multiple iterations of training and evaluation before we settle in on the most optimized function.
Step 5 Testing is actually taking our machine learning model for a spin with near production like or real life data.
Step 6 Deploying the model As with all software in the world, there is no value to a piece of code unless it is deployed to production and starts delivering value. But with Machine Learning, the science is still a bit hazy out there.
Once deployed, machine learning models can predict/ infer either in an online or a batch mode. Thinking logically, one would use an online mode to deliver a low latency, real time result. Batch would be for precomputing results and then serve out. And since we don’t have the compulsion or the urgency of serving out the results in real time, we can use that time to work on a bigger data set.
Lastly, while on the subject of deployment I would be remiss if I were not to mention machine learning pipelines which is an orchestrated systematic way of working through all the steps in a machine learning workflow – feature engineering, training, evaluation, and predictions.