- Posted by Intent Media 11 Nov
- 0 Comments
I recently wrapped up another hackathon here at Intent Media. You can see my summary of one of our previous hackathons here. These past two hackathons I’ve taken on some slightly different challenges than people usually go after in a hackathon: developing new machine learning (ML) models.
I‘ve been working on data science and machine learning systems for a while, but I’ve found that doing so under extreme constraints can be a distinctly different experience. A good data hacker can easily find themselves with a great idea at a hackathon but with little to nothing to demo at the end. Accepting that my personal experience is just my own, let me offer three tips for building new models at a hackathon.
Time is not on your side.
At a hackathon you can almost run out of time and still come up with something pretty good as long as you get that last bug fixed before the demo when you’re building something like a web app. This characteristic of hack day is great to build into the plan for your web app hack, but it does not apply to a machine learning hack.
Think about what happens when you do find that last bug in a machine learning project. You might still need to:
- Reprocess your raw data.
- Regenerate your signals.
- Retrain your model.
- Evaluate your new model on sample data.
- Calculate performance statistics.
- Draw your conclusions.
You can’t just hit refresh. Even with a well-oiled workflow, some of those tasks can take all of the time scheduled in your average one-day hackathon. Take #3 for example. Training a production grade model using, say, Hadoop, can take a lot of time, even if you have the cash to spin up a fair-sized cluster of EC2 instances.
What that means for your hack can vary, but you’re just asking for trouble if you don’t start with latency considerations taken into account in the scope and goals of your project. A solid project design is crucial if you’re going to hope to take all of the little steps involved in getting your model ready to demo.
Which leads me to my next point…
Bite off less than you can chew.
One of the best things about working in data science is all of the really smart people. But, of course, the corollary is that one of the worst things about working in data science is all of the really smart people. Sharp engineers and data scientists can take the nugget of an idea and envision a useful, powerful suite of products that would take years to build. This obviously is not so useful when you only have a day or two.
Mature dataists know just how much ambition is too much and plan accordingly. I happen to be lucky enough to work with some very smart and very mature data scientists and engineers, so this has not been a problem for either of my last few hacks. But, I’m just lucky that way. You might not be so lucky.
Unrealistic ambitions are a constant danger in a machine learning hack. These ambitions run along the edge of all activities like a precipice beckoning you to dive off and see where you land. If you take one thing away from this post, let it be this: don’t dive off the cliff. Just don’t do it. You won’t like where you land. You’ll wind up with more questions than answers and you’ll have nothing to show come demo time. Moreover, your fellow developers who worked on apps and not ML models will simply not understand what you spent your time on.
What does a precipice look like? It could be a novel distance metric. It could be a fundamental improvement to a widely used technique like support vector regression (SVR). Or it could just be something really benign sounding like a longer training set. I would say that even choosing to pose the problem as a regression one instead of a classification one could qualify. Having something like a confusion matrix instead of mean squared error (MSE) as your guide could give you a lot of focus for some quick iterations on your model. That’s true even if your problem feels might be better posed as a regression problem…given more time.
The danger of ambition originates in the intrinsic tension between the rigorous and exploratory mode of academic data science/machine learning education and the pedal-to-the-metal pace mandated by a hackathon. They are different modes of working, and you will have to suspend some of your good habits for a day or so, if you want to have something to demo.
Use lightweight tools.
This last point can be the trickiest to put in practice, but I think it can totally be the difference between a project that feels like a hack and one that feels like just getting warmed up on a weeklong story. If you’ve figured out how to scope your project appropriately and designed something that you can build in a day or two, you can still actually fail to build it. I think it can the difference can easily come down to technology choices.
For example, I currently work mostly in Cascalog, Clojure, and Java on top of Hadoop to process files stored in S3. I know these tools well enough to pay my rent, but I would definitely think carefully about trying to use any of them in a tight-paced context. I have spent weeks trying to understand a single Cascalog bug. Seriously.
Much of this is down to the inherent difficulty of MapReduce and distributed systems, but it’s dangerously hairy territory any way you look at it. Clojure could be an incredibly good prototyping tool in this context, but the difficulty is going to stem a lot from the available libraries. With a young JVM language, it’s pretty common to find yourself reaching for a library that only exists in Java. This forces you to get into interop, which is a tricky enough topic on its own.
As for using Java, I think it’s a reasonable choice, but much of what Java and its ecosystem are trying to do for you with things like static type, class hierarchies, etc. is focused on building software that can last. Remember, you’ll probably trash your result in mere hours.
If you know the language, Python offers an unbeatable value proposition for this use case. scikit-learn has nearly everything you could imagine needing. pandas, NumPy, and SciPy are all sitting there to be brought in when appropriate. And don’t forget how awesome it can be to prototype in a purpose-built exploratory development environment like IPython.
But this is machine learning, and sometimes our data is just big. Maybe even web scale. Some people hate these phrases, but they serve a purpose. We don’t all use Hadoop out of a love for complex Java applications.
Data Science is statistics on a Mac.
— Big Data Borat (@BigDataBorat)
@BigDataBorat is big data stats on a Mac Pro?
— Duncan Foster (@GoneCaving)
Big data is not just statistics on a Mac Pro, although it can often look like that. Scale can be a real necessity even in a hackathon.
When it is, there are no easy answers. If you’re lucky, maybe you can actually work with multiple hour model learning times. If you’re really lucky, you might be using Spark and not Hadoop, in which case it might not take hours to learn your model.
My point is that, insofar as you have a choice, choose the leaner, meaner tool, the one that will let you do more with less input required from you. Don’t use that C++ library that promises awesome runtime but with Python bindings that you’ve never tried. You’ll never figure out its quirks in time. Write as little data cleanup code as you can manage. Commands like dropna can save you precious minutes to hours.
And if you can get your data from a database or an API instead of files, then, for the love of Cthulhu, do it. Hell, even if you have to load your data from files to a database first, it might be worth your time. SQL is one of the highest-productivity rapid-prototyping tools I know.
And while Hadoop is tough to use as a rapid prototyping tool, there are even ways of making that elephant run a bit faster. Depending on what you’re doing Elastic Map Reduce or PredictionIO can get you to the point of being productive much faster.
I could have danced all night.
I love hackathons and their variations. They remind me of the fun old days in grad school, furiously hacking away to come up with something interesting to say about definitionally uncertain stuff.
The furious pace and the pragmatic compromises are part of the fun. Compared to things like pitch events, hackathons have way less problems (even if they have their issues as well). At their best they’re about the love of unconstrained creation. I’ve tried to do machine learning hacks because it’s just so damn cool to go from zero to having a program that makes decisions. It amazes me every time it works, and doubly so when I can manage to get something working on a deadline.
Taking on a challenge like building a new model in a hackathon is also a great learning experience, especially if you get to work as part of a strong team. Machine learning in the real world is an even larger topic than its academic cousin, and there’s always interesting things to learn. Hackathons can be great places to rapidly iterate through approaches and learn from your teammates how to build things better and faster. That’s pretty likely to come in handy sometime.
Jeff Smith is a data engineer at Intent Media working on large scale machine learning systems. He has a background in AI and bioinformatics. Intent Media is the fifth startup he’s worked at, and it’s easily the most fun one. Jeff has a master’s degree in computer science from the University Hong Kong. You can find him on Twitter, Medium, and his own blog.