I recently linked to Part I of this guide, and Part II is even better. It covers designing star schemas, partitioning data (an oft-overlooked topic!), Airflow design patterns, and ends with a top list of best practices.
This is an absolute must-read—probably the single best data eng post I’ve ever read. I can’t add anything; just read it.
We are proud to announce the beta release series of JupyterLab, the next-generation web-based interface for Project Jupyter.
This is a big deal. Jupyter has millions of users, 1.7mm public notebooks on Github, and support for over 100 languages. While there are many notebook products today, the sheer size of Jupyter’s community means that it matters.
I haven’t had a chance to play around with JupyterLab yet but plan to in the near future; would love to hear your thoughts / reactions.
Lots of good insights in here on Stitch Fix’s model of hiring / training “full stack data scientists"—data scientists that can create a mathematical model, write production code, and maintain the model in production. While it’s probably not possible to hire a whole team of people who can do everything, it is possible to train new hires across the entire stack.
Also, completely agree on explicit data team representation at at the C level:
When data science is your company’s competitive edge, having a CAO represent data science at the executive level is more effective than being represented by an engineering head, like a CTO. The CAO has a deeper and more nuanced understanding of data science(…)
Bunch of good stuff from Stitch Fix this week. This post announces a new open source project called Flotilla:
Today we’re excited to introduce Flotilla, our latest open source project. Flotilla is a human friendly service for task execution. It allows you to focus on the work you’re doing rather than how to do it. In other words, Flotilla takes the struggle out of defining and running containerized jobs.
Related to the above post, this is the type of tooling that allows data scientists to deploy production code without investing all of their time in devops.
[Tensory Comprehensions] will allow researchers and programmers to write layers in a notation that is similar to the maths they use in their papers and communicate concisely the intent of their program. They will also be able to take that notation and translate it easily into a fast implementation in a matter of minutes rather than days.
Really cool mechanism to generate fast code from high-level network descriptions.
Algorithmia just implemented a neural network in Solidity, the Ethereum scripting language. Is it a Good Idea to have a neural network running on-chain? I’m not smart enough to answer that question, but it certainly is novel and noteworthy. This paragraph was amusing:
And with a moment’s notice, 22 thousand machines ran the first neural network on the Ethereum blockchain. What looked like machine code to these everyday miners, was actually a fully functioning neural network. Feb 15th was a good day.
We are looking for someone to fill an upcoming gap in our business model. We are not exactly sure what you will be doing, but we are sure that our shareholders will love the idea that we have Data Scientists. You will report to someone that does not understand what you do, and you will often be met with skepticism when you present your solutions to management.
Amusing, poignant. I think the article overstates the point just a bit (technical skills are important!) but it’s a worthwhile and memorable read.
At Fishtown Analytics, we work with venture-funded startups to implement Redshift, Snowflake, Mode Analytics, and Looker. Want advanced analytics without needing to hire an entire data team? Let’s chat.