This post is a gold standard for the kinds of things that you should expect to be shared with you as you’re applying for / interviewing for a data science role. The failure modes for DS often arise out of a failure to answer these questions well, and pre-sharing them with all applicants publicly is an extremely strong signal.
How is Data Science structured?
What sort of problems do you work on?
How do you decide what to work on?
What sort of tech do you use?
Are you hiring a data scientist right now? You should write your own version of this doc. Are you job hunting? You should demand one.
You’re getting two posts from Neal Lathia of Monzo this week! (It had been a while since I caught up on his blog…) This is a fantastic post about how to create hybrid ML/rules-based systems, which will often end up being both more robust systems and far simpler to build.
I recently wrote up an internal doc as a guide when there is appetite to add machine learning into an existing rule engine. This blog post pulls out three key questions from it: a) Can we use machine learning in a rule engine? b) How and where do we add the “machine learning bit?” c) How do we architect this type of system?
My takeaway: start with rules, replace certain decision nodes with narrow ML models when this improves performance of the system. I like this approach a lot, it’s distinctly boring yet practical :)
The folks @ Elementl built a first-class dbt-Dagster integration! It’s pretty great—I got a live demo from the data team at Good Eggs recently. In short, you can:
- Explicitly model the dependencies between your dbt models and processes that use other technologies;
- Schedule and execute pipelines that include both dbt and other technologies;
- Monitor your dbt models in the same tool you use to monitor your other processes, with historical and longitudinal views of your operations.
The monitoring part is particularly novel and useful IMO. Having great processes around metadata capture and analysis is really the next level of maturity of the dbt workflow—with that information, you can direct your attention as a maintainer so much more effectively.
Really neat. How do you reduce a millions-of-pixels image down to a set of component colors and have those colors represent something useful? The above image does a great job of showing why the problem is hard: both the left and the right candidate solutions are “wrong,” but for different reasons.
I really enjoyed the narrative of how the team went through the construction of this solution. Also something neat about this post: the author included a section for “what else we tried.” So rare and yet so useful!
Ok, this is cool. A team at Google built a detector for sign language—not to translate that sign language into audio or text, but just to detect that a given set of frames was in fact sign language. Then, they do something that I wouldn’t have thought of:
When the sign language detection model determines that a user is signing, it passes an ultrasonic audio tone through a virtual audio cable, which can be detected by any video conferencing application as if the signing user is “speaking.” The audio is transmitted at 20kHz, which is normally outside the hearing range for humans. Because video conferencing applications usually detect the audio “volume” as talking rather than only detecting speech, this fools the application into thinking the user is speaking.
How frustrating it must be to be on a call with a bunch of folks signing only to have the camera never focus on the correct person! I like this solution so much because it’s excellent problem and intervention selection: a beautiful example of how to improve a product using ML.
Facebook finally joins Airbnb, Lyft, Netflix, and Uber (and many others) in creating its own in-house data catalog. If you’re following the space (as I very much am), this isn’t a revolutionary release—it’s hitting on the same themes as other similar in-house products. And because it’s built on top of Facebook’s proprietary social graph search utility, Unicorn, it’s unlikely to be open sourced at any point.
There are a lot of nice touches though. Here’s my favorite paragraph:
Nemo indexing is generally aware of our data ecosystem. For example, if a data pipeline duplicates a column into a downstream table, the original column’s description and the upstream table’s name are also stored for the downstream artifact. Presto queries of data artifacts are noted, so if an engineer performs a Presto query, that will increase the Nemo score both generally, for that table, and for the specific engineer who performed the search.