Learning from Real World Machine Learning

  • Posted by Intent Media
  • 0 Comments

At Intent Media we collect vast amounts of data on travel and ecommerce, which we leverage to develop amazing products for the online travel marketplace. In particular, our data team develops models which predict user behaviors like clicks or transactions. We have developed a lot of experience with the complex world of large scale machine learning. While we have iterated on our infrastructure and implementations throughout the years, we have recently begun to productionize some of our models in Apache Spark. Here are a few thoughts on machine learning I had, based on empirical testing and experience.

1. Make production mirror development

In the past we often had a hard time testing our ML infrastructure on both a small and large scale level. For certain datasets, tools like SVMLight or Scikit-Learn are powerful.  It is not difficult to build unit and integration tests around them in a development system. The problem is that when you use another tool at a larger scale (perhaps Mahout or our ADMM Hadoop implementation) you can no longer trust your small tests to tell whether a large scale change will break or not. Spark MLLib works really nicely in a development setting and is … Continue reading

Making Information Manageable with Fluentd

  • Posted by Intent Media
  • 2 Comments

Why are metrics and log consumption treated as separate things? By creating and maintaining an artificial separation between log data and metrics, we add complexity and fragility to the information gathering stack. In this post, I would like to write about some of the difficulties that this line of thought causes. Then I will go on to demonstrate a specific example of how you might manage information regardless of the source. My hope is that by the conclusion of this article you will have a different perspective on information management and an example implementation that you can use to evaluate your own practices.

The difficulty in segregation of information sources is that we have created processes and configuration around working with these data sources as separate things. Logs and metrics are often collected using different processes which are configured in different locations using different syntaxes. We then configure each tool to interface with third party vendors often in such a way that the overhead of changing metrics/log aggregation vendors becomes a daunting task for any large infrastructure.

Solutions

Let’s imagine an ideal implementation that doesn’t have these constraints. Configuration for both metrics and log consumption would be in the same … Continue reading

In Search of Diversity: Part 2

  • Posted by Intent Media
  • 0 Comments

This is the second post in a two part series.  See the first post here.

Creating diversity

In neither machine learning nor in startups do we need to take things as they come. When we find a lack of diversity, we can act to preserve the diversity that we do find or even induce the emergence of diversity.

To do this for ML models all we really need is some sort of metric to evaluate the amount of diversity in a given set of models. Some examples might include simple percentage agreement, variance, separation, or any one of several common distance metrics for comparing vectors of predictions. From there, we can try to construct the set of models with the maximum amount of diversity through things like random selection of training instances, varying fitness function parameters, etc. Depending on how you approach this problem, you might run into computation explosion, but there are definitely “good enough” solutions possible that can consistently produce diverse ensembles of models.

So then, what are the corollaries for a startup or a tech team? If preserving and inducing diversity is a tractable problem for ML models, how about something more complex, like people? How … Continue reading

In Search of Diversity: Part 1

  • Posted by Intent Media
  • 0 Comments

Ensemble models

In machine learning (ML), we build models. These are programs that make decisions based on data. Typically, they learn from some set of past data and are used on new data as it comes in. Over the lifetime of the field, researchers have developed many approaches to learning from data. These techniques (i.e. algorithms) are things you may have heard of: linear regression, naive Bayes, decision trees, and so on.
One of the most important developments in the history of machine learning has been the development of ensemble learning methods. At its simplest, ensemble methods are just using more than one model to make a single decision.

One of the more interesting consequences of using multiple models in an ensemble is that you may not be getting the supposed benefits of an ensemble, even though you nominally have multiple models. How could this happen? A lack of diversity.

Classifying cute

Let’s take for an example a simple classification problem. We’ll try to determine if my dog looks cute in this dress:

nom nom

 

I’ve asked a bunch of people, and not one of has ever said that she looks anything but very, very cute in this dress. So while … Continue reading

Safely Upgrading Hadoop and Cascalog

  • Posted by Intent Media
  • 0 Comments

The Intent Media data team uses Hadoop extensively as way of doing distributed computation. The big data ecosystem is evolving very quickly, and the patterns and versions we had been using are being replaced with faster and more stable platforms. This post is about how we successfully and safely upgraded our Hadoop jobs and paid down some technical debt.

Background

Many of Intent Media’s backend job flows use Hadoop to manage the sheer size of the input datasets. We have jobs that range from basic aggregation to learning models. Hadoop is an impressive framework, but we had increasingly been confronting issues with stability, missing features, and a lack of documentation. A number of indicators pointed to our version of Hadoop being too far out of date. 

For our original MapReduce jobs, we were using Hadoop 0.2, which was state of the art at the time those jobs were written, but now the framework has matured through several major versions and Hadoop 2.x is gaining adoption. We needed to at least get to 1.x to give us a path for future upgrades. 

On the data team, we write most of our MapReduce jobs in Cascalog, a Clojure domain specific language (DSL) … Continue reading

Bayesian Inference for A/B tests when metrics have categorical distributions

  • Posted by Intent Media
  • 0 Comments

Most A/B testing happens within the classical statistical framework that involves p-values, statistical significance and confidence intervals. More recently though there has been much interest in using Bayesian inference techniques in setting up and interpreting A/B tests. Terms such “Bayesian A/B tests” and “Bayesian bandits” have been used to describe these techniques. I will not reinvent the wheel here and as an introduction to the topic refer you to other sources [1][2]. In particular, I would recommend Sergey’s post as a minimum prerequisite to follow the rest of this post.

The most common example discussed in the context of Bayesian A/B tests is that of binary metrics such as click through rate (CTR) and conversion rate, which is natural given how ubiquitous these metrics are. The case where ads are either clicked or not, or when users either convert or don’t can be modelled as independent bernoulli trials and the likelihood of data under this model follows a binomial distribution. The beta distribution comes in handy as a conjugate prior and story is soon wrapped up as demonstrated in Sergey’s post.

Metrics however do not exist in a vacuum and must be tailored to the way the user interacts with … Continue reading

Engineering Your Own Education: Part 2

  • Posted by Intent Media
  • 0 Comments

In the previous post, I went through a few of the ways that an engineer can develop the skills that help get that great job in tech.  Mostly, I talked about things that you can do outside of your working life to get better.  In this post, I’m going to get into how learning can be part of your job.

I think that most smart engineers have an appreciation for the material in the previous post. Most promising candidates I meet understand that they need to be as best prepared as they can be to snag a fun, challenging tech job. The point I see many miss though, is that their tech job is the beginning of the next phase of education, not the end of their education.

One of the most important things a professor ever said to me was the following.

Everything that I’m going to teach you in this class was invented after I left school.

The professor was Dennis Kroll, and the topic of the course was Java development (that thing I now do for a living). His point was about the duration of our educational journey as engineers. If we wanted to stay useful … Continue reading

Engineering Your Own Education: Part 1

  • Posted by Intent Media
  • 0 Comments

One of the fun parts of being on a rocket ship of a startup is that I get to spend some of my time recruiting. I know that not every engineer enjoys recruiting, but personally, I really love it. When I recruit for Intent Media, I’m helping to introduce some smart motivated people to awesome jobs where they will have every chance to succeed in their professional goals and have a great time along the way. It’s sort of like being Santa Claus for good engineers.

One of the most salient aspects of recruiting for me is the difference between people working in data science and engineering and those looking to get started in the field. I got my first job in a data science group back in 2008, before there was a lot of clarity around this emerging profession.  Now, when I talk to mathematically inclined people who can write some code, they all want to talk about data science. Recent grads with degrees in things like electrical engineering, physics, and biomedical engineering are all rushing to brand themselves as data scientists.

I understand the technical person’s desire to align themselves with a hot field, overflowing with good paying … Continue reading

Testing a Dropwizard application with Cucumber-JVM and jUnit

  • Posted by Intent Media
  • 0 Comments

Dropwizard has quickly become the framework of choice for implementing new web services at Intent Media.

Because we take testing seriously, one of the first things we do when developing a new Dropwizard application is write some integration tests to make sure it will interact with its ecosystem in the way we expect.

Suppose you want to write some integration tests for a Dropwizard server application using Cucumber-JVM, the Java implementation of the popular BDD test framework.

How do you start your application and make it available within Cucumber’s test context? Dropwizard provides easy support for unit testing with jUnit, and fortunately Cucumber-JVM works with jUnit as well. Add a DropwizardAppRule with an @ClassRule annotation to your test runner class like so:

package your.dropwizard.app;

import cucumber.api.junit.Cucumber;
import io.dropwizard.testing.junit.DropwizardAppRule;
import org.junit.ClassRule;
import org.junit.runner.RunWith;

@RunWith(Cucumber.class)
public class RunCukesTest {

    @ClassRule
    public static final DropwizardAppRule RULE =
            new DropwizardAppRule(YourApplication.class, "path/to/config/file");
}

Run the test, and you’ll see your Dropwizard app starting up before your feature steps execute. It will then shut down automatically after the tests are finished. Here’s a good tutorial if you need help getting your step definitions wired up to your feature tests.

So far, so good. But there’s … Continue reading

Upgrading Dropwizard 0.6 to 0.7

  • Posted by Intent Media
  • 0 Comments

At Intent Media we started using Dropwizard last year when creating an external API. Dropwizard is a great lightweight Java framework (they really exist!) for building RESTful web services, tying together Jetty, Jersey, Jackson, JDBI and a powerful metrics library.   

Everything was working great with the API in production until a few months ago, when we needed to run our application on both http and https to process SSL requests. Unfortunately, with Dropwizard 0.6 this wasn’t possible so we decided to upgrade to the next version, 0.7. Dropwizard 0.7 isn’t a drop-in replacement for 0.6 and we couldn’t find a decent resource online, so we decided to write this guide based on our upgrade experience.

The changes we had to make to upgrade to 0.7 were the result of two changes to Dropwizard.  First, the resources moved to new packages.  Second, core Dropwizard objects were renamed.  Third, the way that the application is configured through code has changed.  

Structural Changes 

Class packages have all changed

Our previous ivy.xml file was quite sparse, but in 0.7 there are more distinct dependences.  More importantly, the package names changed from com.yammer.dropwizard and com.codahale.metrics to io.dropwizard and io.dropwizard.metrics

0.6
<dependency org="com.yammer.dropwizard" name="dropwizard-client" rev="0.6.2"/>		    
<dependency 
Continue reading