People often get stuck when they are asked to improve the performance of predictive models. What usually they do is try different algorithms and check their results. But often they end up not improving the model. Today I will walk you through what we can do to improve our models.
You can build a predictive model in many ways. There is no ‘must-follow’ rule. But, if you follow these ways (shared below), you’d surely achieve high accuracy in your models (given that the data provided is sufficient to make predictions).
- Add more data: More data is always useful. It helps us to capture all the variance that the data has.
I understand, we don’t get an option to add more data. For example, we do not get a choice to increase the size of training data in data science competitions. But while working on a company project, I suggest you to ask for more data, if possible. This will reduce your pain of working on limited data sets.
- More Features: Adding new features decreases bias on the expense of variance of the model. New features might help algorithms to capture the effect of that feature. i.e. While predicting daily withdrawal from ATMs, People may follow different pattern in the start of month by drawing higher amounts from ATMs. So it’s better to create a new feature that is responsive to the start of the month.
- Feature selection– This is also one of the most important aspects of predictive models. If we keep all the features in the data it might overfit the model and it will behave poorly on the unseen data. So it’s always advisable to choose important features in the model and built the model again only with important and significant features.
- Missing value and Outlier Treatment: Outliers can deflect your model so badly that sometimes it becomes essential to treat these outliers. There might be some data which is wrong or illogical. i.e. Once I was working on airline industry data, in the data there were some passengers whose age is 100+ and some of them were 2000 years. So it is illogical to use this data. This is harder to explain but it is likely that some users intentionally entered their age incorrectly for privacy reasons. Another reason might be that they might have placed their birth year in the age column. Either way, these values would appear to be errors that will need to be addressed. In the same way, missing value issue should also be addressed.
- Ensemble Models: Ensemble models can produce better results most of the times. Bagging (Bootstrap Aggregating) and Boosting are some of the ways which can be used. These methods are generally more complex and black box type approaches.
We can also ensemble several weak models and produce better results by taking the simple average or weighted average of all those models. The idea behind is that one model might be only capturing variance of the data and another model might be better at capturing the only trend. In these types of cases, ensemble method works great.
- Using the suitable Machine learning algorithm: Choosing the right algorithm is a crucial step in building a better model. Once I was working with holtzwinter model for prediction but It performed badly for real-time forecasting so I had to move on neural network models. Some algorithms are just better suited to some data sets than others. Identifying the right type of models could be really tricky, though!
- Auto- feature generation: There is a lot of buzz around the term “deep learning”. The quality of features is critical to the accuracy of the resulting machine learned algorithm; no machine learning method will work well with poorly chosen features. However, due to the size and complexity of programs, theoretically there are an infinite number of potential features to choose from. If you are doing image classification or hand writing classification then deep learning is for you. Deep learning does not require you to provide the best possible features, it learns by its own. Image processing tasks have seen amazing results using deep learning.
- Miscellaneous: It is always better to explore the data efficiently. The data distribution might be suggesting for transformation. The data might be following the gaussian function or some other family of function, in that case, we can apply algorithm with a little transformation to have better predictions. Once we get the right data distribution, the algorithm can work efficiently. Another thing we can do is fine tuning of parameters of algorithms.