Building out software systems presents a wide variety of challenges and machine learning is no different. The challenges can be broadly categorised into 5 different types viz. Quality of the data, Repeatability, Data Variance, Tooling, Conflicting vision
Lets take each of these and talk about them
Quality of the training data: This follows the classic maxim of software engineering which is garbage in garbage out. Care should be taken in ensuring that the data used should be complete, accurate and consistent in all respects. It all starts with the sourcing of data. Ask questions – Am I harvesting this data from the right source? Can this be considered as the golden source of truth. Once satisfied, move forward towards analysing the data. Some of the things to look for when analysing the data is whether there are any typos, whether entries have been duplicated, whether there are inconsistencies in measurement units like for e.g. miles versus kilometers. There are costs to be paid for bad quality data – for e.g. if entries are duplicated, the model would incorrectly assign a higher weightage to those data elements. If data labels (note the emphasis on labels) are marked in-correct the accuracy of the whole model in predicting on production data will be wildly inaccurate. Data needs to be complete – if the job of the model is to predict the weather for the whole of California, we cannot afford to collect the data just for 2 counties – LA and San Francisco.
Repeatability: We are all used to the unshakable consistency of machine behavior, especially when it comes to programming. Given the same set of inputs, a software program will always give out the same set of outputs, there is always consistency. Not so much with machine learning. The amount of variables involved in standing up a ML model throws a wrench on this expectation. Randomness is built into the entire science of machine learning. It starts with model weights being initialised with random values. Now this weight goes through multiple iterations before converging into 1 firm value. Now it is clear that if we change the value of the weight with which we started, we will end with a different value when the iterations have finished. The solution to this is to set a fixed random value – sounds like an oxymoron. Setting the random weight to be a constant is just 1 thing, we need to ensure that all the different parts which go into a model need to be set – meaning ensure that the test data is the same between experiments, the way we which we split the data into test and training is the same, the hyper parameters are set in the same way.
Data Variance: Change is the only constant, as they say and the rule applied to the data used in machine learning too. We could have taken all the care in the world that the data which we used for training our model was accurate and complete in all respects. But with time now, the nature of data would have changed. E.g. lets say you had built a machine learning model which given a set of pictures would categorise them into either – business formals or business semi – formals. Societal norms change and what was considered business formal before could be outdated today. Likewise, with the advance of technology and data gathering mechanism, it could also be very much possible that more features of the same data set are available today which could produce a more accurate data model. So the idea here is to refresh the data which we use to train our model
Tooling: It has to be acknowledged that we donβt use a hammer to shoo away a mosquito. Meaning the right tool for the right purpose. It should be acknowledged that different use cases need varying amounts of infrastructure. Meaning an image oriented machine learning model will require more horse power than a tabular data based machine learning model. Similarly, different phases of a machine learning modelling process require different sets of tools. Meaning, a team of machine learning engineers and scientists working on prototyping a model would require less horse power than when the same prototype is running in production against real world data loads. The challenge here is developing the ability in coming up with the right set of tools for the right use case.
Conflicting vision: Too many cooks spoil the broth. And in a team between a ML engineer, the product manager and the executive leader, if they share different visions of what the model is supposed to do, then all of their views need to be re-conciled. There has to be a compromise and a go-forward path.