The Purginator: Dynamically throttled writes to DynamoDB

  • Posted by Intent Media
  • 0 Comments

https://github.com/intentmedia/purginator

The purginator was designed to enable dynamically throttled write operations against a local or remote DynamoDB database. It also has the ability to generate test data in a local database to validate the correctness of a planned operation.

Our initial use case was to purge several hundred thousand records from a database that was in production use, without exceeding the provisioned write capacity. DynamoDB can be an incredible tool, and we have found some great uses for it. Ensuring that your DynamoDB usage doesn’t exceed your provisioned capacity can be a challenge, especially when multiple applications share the same database. In this case, the operation to purge the records needed to not impact normal application activity, since we didn’t want our production application’s database access to be throttled or interrupted.

The purginator was developed in Clojure, using open source libraries including Faraday and Commons Math. Today, we’re open sourcing it, and we welcome comments, questions, bug reports, and pull requests.

John Chapin is a Data Engineer at Intent Media, where he is building the next generation of predictive analysis tools alongside a world-class team. When he’s not hacking on a functional programming project, he can be Continue reading

Modeling Madly: Machine learning at hackathons

  • Posted by Intent Media
  • 0 Comments

I recently wrapped up another hackathon here at Intent Media. You can see my summary of one of our previous hackathons here. These past two hackathons I’ve taken on some slightly different challenges than people usually go after in a hackathon: developing new machine learning (ML) models.

I‘ve been working on data science and machine learning systems for a while, but I’ve found that doing so under extreme constraints can be a distinctly different experience. A good data hacker can easily find themselves with a great idea at a hackathon but with little to nothing to demo at the end. Accepting that my personal experience is just my own, let me offer three tips for building new models at a hackathon.

Time is not on your side.

At a hackathon you can almost run out of time and still come up with something pretty good as long as you get that last bug fixed before the demo when you’re building something like a web app. This characteristic of hack day is great to build into the plan for your web app hack, but it does not apply to a machine learning hack.

Think about what happens when you do … Continue reading

Acceptance Testing Data Collection: Asserting On Outbound Resource Requests Using CasperJS and PhantomJS

  • Posted by Intent Media
  • 0 Comments

At Intent Media, we run ad networks, which means that on the data science team, we don’t just science data, we also collect it.

One way in which we accomplish this collection is through the use of beacons on our partners’ sites. And because we are nearly as excited about software testing as we are about statistical modeling and Hadoop-ing, we also want to do proper acceptance testing of our beacons.

Acceptance testing for us means that we need a way to assert that a browser, upon executing our javascript, will make a request to our beacon url. It turns out that our pal PhantomJS has a handy method for handling this kind of thing, and our favorite PhantomJS wrapper CasperJS emits an event on resource requests to help us out.

Setting Up An Event Listener For Resource Requests

Armed with these tools, we can set up a listener on the 'resource.requested' event and do something useful when it happens:

casper.onResourceRequestFor = function(requestRegexp, callback) {

  var resourceListener = function(request) {
    if (request.url.match(requestRegexp)) {
      callback();
    }
  };

  this.on('resource.requested', resourceListener);
};

Now we can require our little library in our test, which we’ll write in CoffeeScript:

require('./casper.intentmedia.js')

And we can make … Continue reading

Cross Browser User Bridging with DynamoDB

  • Posted by Intent Media
  • 0 Comments

Rationale & Background

As the web evolves, user identification has become key when it comes to research, privacy, product customization and engineering. Companies are always balancing the need to respect users’ data collection wishes with the product and economic benefits of providing a customized user experience. The days of relying on third party cookies are gone. Web companies continue to need more effective, trustworthy ways of identifying visitors as previously seen customers or households.

Anonymous user shopping patterns form the core of our ecommerce predictive decisioning platform. We have several classification and regression models that predict the probability of conversion and expected purchase price. We also model a visitor’s expected CTR (Click Through Rate) to perform customized ad selection. To accomplish all this, site page visit history is saved in a “user profile” which represents the best view we have of a site visitor.

When we see a new user, we generate a new UUID and store that in a first party cookie on a partner’s site. That becomes a partner-specific identifier for a that user’s browser.

While first-party visitor cookies are more effective than typical third-party advertising cookies at maintaining a persistent ID for a user, they suffer from Continue reading

Clojure NYC: Vert.x Intro

  • Posted by Intent Media
  • 0 Comments

Vert.x is a lightweight, high performance application platform for the JVM. It enables polyglot modules to communicate asynchronously across a shared event bus in a distributed environment. In this talk at the Clojure NYC Meetup, John Chapin gives an overview of Vert.x concepts, practical usage, and Clojure support. He also presents a live demonstration of several key Vert.x concepts in the context of an example client/server chat application.

The slides for the presentation can be found here and the code can be found here.

John Chapin is a Data Engineer at Intent Media, where he is building the next generation of predictive analysis tools alongside a world-class team. When he’s not hacking on a functional programming project, he can be found running along the Hudson River or planning his next trip abroad. John has a bachelor’s degree in computer science from James Madison University.

Summer @ Intent Media

  • Posted by Intent Media
  • 0 Comments

I am a graduate student at Columbia University studying Operations Research (OR). I came to Intent Media (IM) as a data scientist intern this summer with excitement as well as a bit of nervousness. I have great interest in data science, and I wanted to see what it would be like to work in the area, however I was anxious about whether I would be able to contribute well. Also, I had never worked at a startup before, so I was unsure as to what lay ahead of me. Plus, I am a little scared of dogs, so it did not help that I saw Ben as soon as I walked in to the office.

ben

 

I was given a Macintosh laptop, something I had never used before (I mean c’mon, Apple is a fruit!), and was asked to set up everything myself. It is with a bit of abashedness that I admit that I have never been good at figuring out where wires must go. Yes, I have an engineering background, but I only learned how to efficiently fix bugs I created myself, when I was a software engineer. Also, the enormous amount of technical jargon that was flying … Continue reading

Random Assumptions will make an A– Part 2

  • Posted by Intent Media
  • 0 Comments

Overview

In our previous post, Random Assumptions can make an A— Part 1, we discussed a solution that we built to build sticky random numbers. In this blogpost we’ll examined in detail the bug that we introduced, and our efforts and final solution to the problem.

The Initial (not so good) solution

We wanted to generate several sticky random numbers for a user based off a user_id for our multivariate tests. We used a combination of the user_id, the multivariate test attribute id, and some salt to generate a seed for the Random generator. Our solution looked something like this:

int seed = String.format("%s some salt", userId).hashCode();
double diceRoll = Random(seed).nextDouble();

Having implemented this we patted ourselves on the back and went for a beer.

A few weeks later we found out that our solution was buggy. During an analysis of some data, we found a high amount of covariance with our tests. What is covariance, you might ask? A formal definition is available at Wikipedia of course, but we can demonstrate the problem using our original example.

We have a test with two designs Design … Continue reading

Machine Learning with Spark MLlib on Elastic MapReduce

  • Posted by Intent Media
  • 0 Comments

At Intent Media we use Amazon Elastic MapReduce with Spark for some of our data processing and large scale machine learning tasks. Here we share some details and an example of how to set up Spark 1.0.0 MLlib on EMR for machine learning.

There are materials available that explain how to set up Spark 0.8.1 for data processing on EMR. To demonstrate running a machine learning job with MLlib, we created a project that adds a couple of additional elements to those examples:

  1. A Scala main class for a simple EMR-based machine learning job. This class follows Spark’s built-in BinaryClassification example closely, with small changes that allow the user to specify the Spark executor memory (something we have found useful on larger datasets, defaults to 512m), and a modified SparkConf setup to work with EMR.
  2. Spark 1.0.0 support. Spark’s MLlib module has seen a number of improvements in recent releases, including sparse feature vectors, decision trees, naive bayes classification, and distributed linear algebra routines for SVD and PCA. The 1.0+ jars available on the Spark website unfortunately do not work out of the box on EMR but for previous releases AWS has provided EMR-compatible jars in the s3 bucket at
Continue reading

TL;DQA

  • Posted by Intent Media
  • 0 Comments

I am a Quality Assurance Engineer here at INTENT MEDIA. Here is a little video about how I became a QA—a unique tale of a cockeyed optimist and her unbridled enthusiasm. Here is IM’s youtube channel which has MANY other gems and you can learn more about me here.

There are a lot of interesting things about being a QA, but my favorite is the necessity to think big picture, so that the dev is free to focus on the task at hand. I like thinking through the branching paths of possibility that a new feature can create for the user to follow, then using logic to pare down those branches so that my test plan is optimally thorough, yet succinct and non-redundant.

In fact this strategy reminds me a lot of the approach I took to studio architecture, which I majored in in college, and math, my minor. That way of thinking exemplifies what I like about QA, architecture, math, Virginia Woolf, Walt Whitman, what I know of Buddhism through Wu Tang, and pretty much my general outlook on the world.

Proof of the System

My philosophy is that there is an underlying system in everything that scales … Continue reading

Hack Day: March 2014

  • Posted by Intent Media
  • 0 Comments

One of the best things about working in tech at Intent Media is the fun we all have while working. I’m not just talking about the obviously fun stuff like playing foosball or watching the dogs chase their own shadows. We also seem to have an inordinate amount of fun battling with checkstyle and debugging EMR jobs. I think it’s a great sign for our future how much we all seem to enjoy doing the stuff that is, in fact, our real jobs.

All of this goes double for when we’re really trying to have a bit of fun with things during my favorite regular IM event, Hack Day. We have Hack Days about once every quarter. The goal is to work on something outside what we would normally do. Of course, we’ve had lots of great products come out of Hack Days and make into customers hands. We’ve also had a ton of success with other Hack Day projects that focused on improving our office, our kitchen, and just our experience at work. This latest Hack Day was no exception to our long track record of successful Hack Days with tons of awesome projects. Here’s a recap.

Day 0: Continue reading