Solving the project management dilemma
The key difference between a development project and a machine learning project is the latter requires considerably more research. You can't estimate the amount of research beforehand. So, applying a fixed price to a machine learning project isn't going to work.
Agile works slightly better. But it can still be difficult to break down the research work into sprints in the early stages, and deliver something (other than the intermediate research results) at the end of each sprint.
From our experience, a time and materials approach, with regular client workshops, works best. The client most likely knows their data better than you, so asking them the right questions can save you hours of digging deeper into the dataset. Communication is as important for a machine learning project as it is for a development one.
Avoid risk by asking the important questions
Three questions lay the foundation for the initial research phase.
Can the problem be solved by machine learning?
Over the last few decades we've collected tons of precise historical data about earthquakes. But, the best we can do is say: "The probability of a magnitude seven or more earthquake, in this area, over the next year, is 15 percent".
Sure, it's better than nothing, but definitely not impressive given the amount and the quality of the data. This isn't because our models are poor, but the nature of the data itself. What happened in the past doesn't precisely affect the future. So, is it theoretically possible to train a usable model to solve the problem? The answer to this question is – it depends, for each project.
What about the quality of the data?
Is it practically possible to train a usable model on the existing data? This depends on whether the data actually represent the real world, or some artificial pattern created by the logic behind the data collection. It's important to work with the client and investigate how their data was collected and processed.
3. What data will be available for the model?
Forget about machine learning for a minute. Imagine you already have a model and you’re happy with its characteristics. What are the use cases of the model and its environment? Most likely, it'll be integrated into the client’s infrastructure or running as an API service. If so, are all the features necessary for the model’s input available at the moment of an API call? It's possible you have trained a pretty good model on the client’s historical data, which was post-processed and now complete to decent quality standard. But, the raw data that model has to work with on production is different, and simply doesn’t have half the features your model is using. Another example of the importance of communicating with the client.
The most obvious metric for a classification model is its accuracy. This is simply the number of correct decisions divided by all the decisions made by the model. Is it always applicable, though?
Let’s look at the example of cancer diagnostics with machine learning. In this problem, the model tries to recognise whether the tumour on the MRI image is malignant or benign. As an example, let’s assume one in 10000 people actually have cancer. Also assume our model is always making a negative decision (tumour is benign – patient is healthy) without actually finding patterns in the data. The accuracy of such a model would be 99.99 percent even though it’s practically useless.
Define the metrics and procedures to measure the performance of a model before starting to work on the model. Otherwise, you might be pursuing the wrong goals and simply wasting your time and your client’s.
Feature engineering is a big part of the job and not part of this post. But I wanted to share one thought, which might help you to decide if it’s necessary to spend a lot of time on it. Or, if it’s possible to let your awesome, deep feed-forward, fully-connected neural network do it for you. There are two types of classification problems: objective and subjective.
In an objective problem, the outcome is known for a fact and can easily be verified. Image recognition and cancer diagnostics are objective. We have the labelled data and the class of each example from the dataset is known.
Content approval, on the other hand, is subjective. The decision is made by a human, and humans can make different decisions about the same content. In an objective problem, a machine learning model can outperform a human because it can learn to extract the features from data humans can’t. This isn’t always the case for a subjective problem because quite often we’re assuming that human decision is always correct.
For subjective kinds of problems it's definitely important to do some research on how humans actually solve them. Look at what features they use, and remove ones that might potentially add more bias to the model – rather than help it perform better. Again, research and communication with a client are important.
Train, validate, repeat
As the heading suggests, split your dataset not in two parts – training and testing – but three:
Use the validation data to measure the accuracy (and/or other metrics) of the model iteratively while adjusting models and training parameters. Experiment with different neural net architectures and feature sets.
Test the model on the testing data to actually verify its performance. Most likely, it’ll be slightly worse on a testing set than on a validation set. The same way as it worse on a validation set comparing to a training set. If you’re using only training/testing data split you risk to overfit the model on the testing data because what you’re doing is trying to maximise its performance. The validation set is sometimes called the development set. So, train and develop on training and development data. Test the final model on testing data.
Finding the perfect machine learning model
There are lots of neat techniques that might help you along your way to the perfect model for the problem you’re trying to solve.
- Anomaly detection
- Principal component analysis
- Dimensionality reduction
- Data visualisations
- Good old statistics
Research, feature engineering, and training process adjustment can sometimes be more art/magic than exact science. So, sometimes it's helpful to simply play with the data without any clear purpose. Visualising a distribution of a certain feature, calculating means, medians, and correlations might not directly answer your questions. But, they'll at least give you more insights on the structure of your datasets and more ideas of what approaches are worth trying.