If you are looking for the smartest data scientists to help you with a project, the go-to place is Kaggle.com, where they “make data science into a sport.” More than 82,000 different people from 100 different countries all over the world have signed on, and many of them have submitted at least one entry to the more than 250 different contests held since it opened its doors back in 2010.
I have to say that I am a big fan (from afar) of Kaggle, mainly because of my training. One of my hardest but most fun classes when I was an engineering graduate student was a class in building mathematical models, which is what we called data science back in the day.
Each Kaggle problem set is run as a competition, with prizes, deadlines, and rules aplenty. Kaggle takes a percentage cut off the top to administer the contest. It has a blue-chip roster of customers who also conduct privately sponsored contests. “This is because some of their data is too sensitive to be public,” their CEO Anthony Goldbloom told me. Examples include Microsoft, who used Kaggle to improve gesture recognition on the Xbox; NASA, for better dark matter imaging tools; and GE, for more accurate airline arrival time estimation.
Kaggle offers “companies a cost-effective way to harness the cognitive surplus of the world’s best data scientists,” according to their website. “There are some pretty amazing people who compete,” says Goldbloom. “And some enter 80 or more times per contest, devoting a lot of their time.” Even Goldbloom has tried his hand on a few, although he isn’t highly ranked.
Kaggle has been so successful that other contest providers have come online, including India-based CrowdAnalytix.com, Innocentive.com for the life sciences and TunedIT.org mainly for education and research projects. But Kaggle has been around the longest and has the largest talent pool to draw on.
Here are five contests that are somewhat off the beaten path and illustrate the depth and breadth of their reach and influence.
1. Identify the best performing models to predict personality traits based on Twitter usage.
This awarded just $500 but almost 100 teams entered, showing that it isn’t always about the dough. One of the top entries was from Jason Karpeles, a marketing forecaster from Texas who is in the top ten overall of all Kagglers and has participated in 36 different contests. I spoke to him about his accomplishments. Karpeles isn’t your typical data scientist: he has economics degrees and a MBA from Duke and works in marketing. “I don’t know if it is impressive or pathetic the number of contests that I have entered,” he said. He signed up early in Kaggle’s history and admits that he is “obsessed with the site.” What is interesting is his total dollar winnings are miniscule, especially when you compare them to his total time spent on various contests. With one contest that had more than a thousand people entered, he spent many hours working on the problem.
Why enter so many contests? Mainly for his own self-education. “Being in a Kaggle contest is a lot like getting a post-graduate education,” he says. “It is also a good way to sharpen my skills, expand my knowledge and see how to manipulate particular data sets that I don’t often come into contact with,” he said. “I was afraid that I might fall behind in the marketplace because data science is moving so quickly.”
Karpeles also mentioned something that is very interesting. “I am very introverted, and I don’t market myself very well, so this has been a way for me to get out there. Kaggle has been great for me to see how I perform globally across industries.” He tells potential contestants to just “get out and start doing something, just to try it. Don’t be afraid of failure, or your ranking. Experience is the best teacher.”
During World War II the science of operations research got its start when it was trying to track German submarine movements and keep Allied ships from getting torpedoed. So it is somewhat fitting that a current Kaggle contest, which ends in April, is doing something similar. Only this time instead of German subs they are looking at audio recordings of whales and trying to prevent them from hitting transatlantic ships. Cornell University’s Bioacoustic Research Program has extensive experience in identifying endangered whale species and has deployed a 24/7 buoy network to guide ships from colliding with the last 400 of a particular species of whale. The contest will pay out $10,000 to the best detection algorithm, and so far there are 137 teams hard at work on this contest, including two graduate students who have inevitably called their team Free Willyzx and another team named Herman Melville.
This one paid out $3000 to a Slovakian and was a bit of fun. “Santa needs help choosing the route he takes when delivering presents around the globe. Every year, Santa has to visit every boy and girl on his list. It’s a tough challenge, and Santa admits he scored a B- on his combinatorial optimization final.” The winner had to find two shortest-distance paths through a route of chimneys.
How many of us have been insulted from a comment posted online? What, are you stupid or something? Exactly. So this contest was to predict when something would be considered insulting to someone else. Or as the contest introduction states, “create a generalizable single-class classifier which could operate in a near real-time mode, scrubbing the filth of the Internet away in one pass.” It wasn’t all that altruistic. Security vendor Impermium sponsored the contest. They were looking to “identify new ways to defend against malicious language and social spam online, and help clean up the web by scrubbing away unwanted obscenities from user-generated content.” Not surprisingly, the competition found out that people tend to be most abusive between 9:00 pm and 10:00 pm.
This was big money, with a prize of $10,000 and had 50 entries. The winner was Vivek Sharma, who has entered numerous Kaggle contests. He and other top finishers were offered a job interview at the company along with the prize purse. While they ultimately did not hire anyone, “the Kaggle competition was useful and we were able to examine many interesting algorithms,” said their PR rep via email. Their engineering team has a fresh perspective on this problem and “helped ensure against tunnel vision.”
This competition was held last year and sponsored by the William and Flora Hewlett Foundation, with the top prize of $60,000 going to a team called “SirGuessalot” who could match the average of two human teachers grading high school essays. The team submitted more than 140 different attempts before wining the top prize. “It almost sounds like science fiction,” says Goldbloom.
Maybe some of these will stimulate your imagination and get you to try your hand at one contest. Good luck!