Data Science Roundup #50: A Superintelligent Hedge Fund, 100 Billion Lessons Learned, and the Rise of the Digital Fingerprint

Data Science Roundup #50: A Superintelligent Hedge Fund, 100 Billion Lessons Learned, and the Rise of the Digital Fingerprint
By The Data Science Roundup • Issue #50

This week's best data science articles
Numerai is synthesizing machine intelligence to command the capital of an American hedge fund using crowdsourcing and ensembles. This is interesting stuff—I don’t feel qualified to have an opinion on whether it’s as big as they claim, but you definitely need to read this. Here are two more great posts from the team if you’re a finance data geek.
If you’re anything like me, you know that deep learning isn’t a new concept, that it has its roots many decades ago. You’re probably familiar with the name Marvin Minsky, and you know that today’s breakthroughs rely on these theoretical breakthroughs from the past. This article fills in the gaps, giving you a primer on the history of what may be the most important technology of our time. And it’s only a six-minute read. Do yourself a favor: read it.
These Python libraries will make the crucial task of data cleaning a bit more bearable—from anonymizing datasets to wrangling dates and times. I’m personally going to check out PrettyPandas, as I definitely need more formatting control over the data tables I output for my clients.
This post, written by the VP Engineering at Stitch, goes deep into the challenges faced in building a data pipeline that has delivered 100 billion records over its first 10 months. My personal takeaway: data engineers may sometimes be too quick to incorporate open source tools. The Stitch team has now removed Spark from their stack and reduced its usage of Kafka. Very interesting lessons that may save you thousands of engineering hours.
We all know that asking questions is an important skill, but have you ever had someone actually attempt to teach you how to ask a good question? Much of data science is figuring out how to formulate the best possible question, and as it turns out, you might have a lot to learn. 
Much of the source data for data science comes from user clickstream data. And the key to the entire clickstream is the cookie: a clever hack invented in the 90’s. But in the ad tech industry, cookies are gradually being shunted in favor of fingerprinting. Read this article if you do (or plan to do) any work with clickstream data.
Data viz of the week
Far better usage of this chart type than GA's Behavior Flow tab.
Pay it forward!
I curate the Roundup on my nights and weekends because of the amazing support I get from readers. Know any data scientists that would enjoy reading it? Please send them here (or forward this email). Thanks!
Thanks to our sponsors :D
Fishtown Analytics is a boutique analytics consultancy serving high-growth, venture-funded startups. Have analytics questions? Let’s chat.
Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.
Did you enjoy this issue?
The Data Science Roundup
The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.
Carefully curated by The Data Science Roundup with Revue. If you were forwarded this newsletter and you like it, you can subscribe here. If you don't want these updates anymore, please unsubscribe here.