This is really one of my favorite articles on ML ever. There are so many walkthroughs on how to throw a bunch of code together that roughly accomplishes a goal; there are far fewer guides to how to not screw it up (that requires a knowledge of the Real World). That gets you here:
You end up with the project where the metrics randomly jump up or down, do not reflect the actual quality, and you are not able to improve them. The only way out would be to rewrite the entire project from the scratch. That is when you know — you shot yourself in the foot with a bazooka.
The post touches on three particularly common areas of technical debt: feedback loops, correction cascades, and “hobo features”.
The more time we at Fishtown Analytics are spending on data science, the more interested I get in all of the non-algorithmic parts of the process. This just-released post summarizes it incredibly well:
Building and optimizing the predictor is easy. What is hard is finding the business problem and the KPI that it will improve, hunting and transforming the data into digestible instances, defining the steps of the workflow, putting it into production, and organizing the model maintenance and regular update.
So good. Every product is a data product, and classical UX skills need to be augmented (quickly) with much deeper data skills in our PMs.
Working with data at the core of a product requires a level of understanding of data modeling, data infrastructure, and statistical and machine learning. It goes beyond understanding the results of experiments and reading dashboards — it requires a deep appreciation for what is possible and what will soon be possible by taking full advantage of the flow of data.
The author has built an impressive set of benchmarks comparing Theano, TensorFlow, and CNTK, running on three different GPUs. His summary:
The accuracies of Theano, TensorFlow and CNTK backends are similar across all benchmark tests, while speeds vary a lot.
Relevant if you’re making production decisions today, but potentially more so to follow the evolution of the space. In other high-level languages, the broad trend is to sacrifice execution efficiency for programmer efficiency. With the intense computational needs of deep learning, it’s not clear that things will play out the same way.
This is insanely cool. The push for open data in government has been happening for a while, with plenty of cool results. Frequently, though, CSV datasets reside on dingy webservers waiting for an analyst to download them with an R script. This is certainly better than nothing, but it’s far from open data living up to its true potential. Open data needs to be active, to be integrated into our lives, for that to happen.
This If This Then That project does just that: it allows people to turn open datasets into active tools. This is just a glimpse into the beginning of a very big trend.
At Fishtown Analytics, we work with venture-funded startups to implement Redshift, Snowflake, Mode Analytics, and Looker. Want advanced analytics without needing to hire an entire data team? Let’s chat.