- Posted by Intent Media 12 May
- 0 Comments
At Intent Media we collect vast amounts of data on travel and ecommerce, which we leverage to develop amazing products for the online travel marketplace. In particular, our data team develops models which predict user behaviors like clicks or transactions. We have developed a lot of experience with the complex world of large scale machine learning. While we have iterated on our infrastructure and implementations throughout the years, we have recently begun to productionize some of our models in Apache Spark. Here are a few thoughts on machine learning I had, based on empirical testing and experience.
1. Make production mirror development
In the past we often had a hard time testing our ML infrastructure on both a small and large scale level. For certain datasets, tools like SVMLight or Scikit-Learn are powerful. It is not difficult to build unit and integration tests around them in a development system. The problem is that when you use another tool at a larger scale (perhaps Mahout or our ADMM Hadoop implementation) you can no longer trust your small tests to tell whether a large scale change will break or not. Spark MLLib works really nicely in a development setting and is suitable for tests that run models on a few training examples with each build. We have found scaling it up to millions of training examples has been quite reliable. There are certainly some memory properties and optimizations by environment, but those generally are to improve performance, not correctness. When the same code is executing locally and on a cluster you can iterate and ship faster.
2. Thinking weights or betas as equivalent to feature ‘value’ is dangerous
The absolute value of a weight is only a very loose proxy for feature contribution. Often business team members will come and ask “which features in our model are most valuable?” and it’s tempting to go to look at the absolute value of the weights. This can cause you to misunderstand your data. When using a linear model it is really hard to perceive encoded feature interactions, and the frequency a certain feature is used has to be analysed in the underlying data. We have some relatively rare features that can be predictive, but require the website user to be on a specific page, entered some data, taken some action, or similar occurrence with low frequency. We have other features which capture similar but slightly different notions–often many will be used on each prediction. In aggregate they represent a valuable signal in the model, but individually they have low weights.
Models with weights that have large positive or negative values can often end up canceling each other out in odd ways, but can still be quite predictive.
For example, let’s suppose we have a very simple linear model which looks like this:
HAS_FOO: -3.0, HAS_BAR: 2.5, HAS_BAZ: 1.5
- User A visits with HAS_FOO: true, HAS_BAR: true, HAS_BAZ: true (-3.0*1 + 2.5*1 + 1.5*1 = score 1.0)
- User B visits with HAS_FOO: false, HAS_BAR: false, HAS_BAZ: true (-3.0*0 + 2.5*0 + 1.5*1 = score 1.5)
Supposing our classification threshold is 1, our model would calculate that both users are “positive” or likely to perform some action. But suppose HAS_FOO and HAS_BAR frequently occur together. What we cannot say is that HAS_FOO is more important than HAS_BAZ, even though HAS_FOO has a much higher absolute weight than any other feature.
Of course, if your model suffers from any overfitting or if training has found a local maxima then these issues will be even more pronounced, since not only are the features not indicative at an individual basis but are not even generalizable in aggregate.
3. Be careful making linearity assumptions about your data
Assuming that your data is linear is a little bit like assuming a normal distribution in statistics or assuming that random variables are independent in probability. The linearity assumption is really tempting because you can apply simple methods and run simple algorithms that perform really well. Linear and logistic regression are well studied, perform well at scale, have some great libraries available, and are conceptually simple to work with. The problem with this approach is that your data may interact in unforeseen ways.
We have attacked the linearity problem on a couple fronts; in our logistic models, we know that our data interacts in a variety of ways, and develop higher-order or quadratic features that combine data or apply mathematical transformations. There are also lots of families of algorithms that learn feature interactions better; especially related to decision trees. What strategy will work best for your domain of course depends the properties and size of your dataset.
4. A lot of the work is just getting the right data
We spend a lot of time tuning models to get performance just right, but our model performance is fundamentally restricted by the data we are able to collect. Sometimes the best way to improve a model is to get a new field or new dataset that can be joined in to create features you have not even considered. New data points can often provide the largest benefit to helping us boost partner revenue. Getting more data is usually not a data science or engineering problem as much as it is about relationships with other parties; we at Intent Media succeed by earning our clients’ trust, respecting privacy, ensuring data security, and demonstrating real value contributions.
This underscores the importance of contributions from all team members in predictive analytics. Scientists, engineers, product, and partnership relationship managers are all fundamental roles. Each individual and practice area relies on the others for success.
5. Regularization doesn’t matter so much when you have big data
Machine learning is not all mathematical traps and pitfalls. Sometimes it will give you a free lunch. It was fun to realize the “unreasonable effectiveness of data” empirically by seeing higher model performance from larger training sets. In some testing of different optimizers on a large dataset I wanted to make sure I was regularizing appropriately to avoid overfitting as mentioned above. I wrote a simple script that would train the model several times overnight with various regularization parameter values and methods. Regularization had a very small performance benefit, but it did not have nearly the impact I was expecting. When you have lots of data, it becomes difficult to overfit.
Hopefully some of these results will prove useful in your machine learning projects. If one thing is clear, applied machine learning has not only come a long way in the last few years, but there is much more to be built. It is an exciting time for data science. More data will improve models, defining new features will become easier, and there will be improved tooling to streamline development and deploy pipelines to production.
Chet Mancini is a Data Engineer at Intent Media, Inc, where he works on the data science team to store and process terabytes of web travel data to build predictive models of shopper behavior. He enjoys functional programming, downhill skiing, and cycling around Brooklyn. Chet has a master’s degree in computer science from Cornell University. You can find him on Twitter and on Github.