There’s so much to like here. The meat of the argument can be summed up here:
There is no correct win rate waiting to be unearthed; one version isn’t true while another is false. Each version is equally accurate because they are tautological: They measure precisely what they say they measure, no more and no less. Our job as analysts isn’t to do the math right so that we can figure out which answer is in the back of the book; it’s to determine which version, out of a subjective set of options, helps us best run a business.
This gets so effectively at the thing that I’ve found to be the single hardest thing to mentor other data analysts on: business context! What is interesting? What should make you curious to dig in deeper? What change would a given piece of information lead you to make in the world?
Another thing I’ve been thinking about recently is explicitly creating separate workspaces for curated (production) and messy (experimental, not-yet-production) work. I think one of the challenges Benn is describing isn’t just that analytics is messy…it’s that teams often co-mingle the messy with the clean parts of the process.
Which ends up coming down to environment management. Creating end-to-end workflows that facilitate environment management is harder than it should be today. How could we make that easier as an ecosystem?
Good DS starts simple, ships, and then iterates. Bad DS starts with the most advanced technique they know.
This is just one of many fantastic nuggets in this post. If you are, or know, someone who is starting out their career in data science, please share this with them. These insights are the ones rarely focused on and yet far more determinative of success than the specific programming languages and statistical techniques in your tool belt.
This tweet is a very succinct summary of one of the most important posts that I’ve ever read: Engineers Shouldn’t Write ETL. If you find yourself saying, “hey, data engineers are valuable!” you’re not wrong–it’s that the org structure that they typically operate in leads to very poor outcomes and rampant mediocrity. Read the above post to understand why that’s the case. It’s just as true today as the day it was penned in 2016.
Heh…wow. This is a post that I had queued up to read for the past couple of months and am only now getting back to. It’s one of the deeper blog posts on the link between sql you wrote and explain plan your database ran.
This is one of the biggest areas that I see new analytics engineers struggle with and is probably the deepest that an AE has to go on a purely technical / CS fundamentals continuum. In fact, you can skip this knowledge and just kinda cross your fingers that the optimizer will give you good results for a little while…but if you want to truly feel confident traversing any dataset, this is knowledge you need.
The post focuses on how different database engines optimize correlated subqueries. Here’s just the tip of the iceberg to give you a taste:
The easiest way to execute this is to run the subquery once for each row in the outer query, but this is potentially very inefficient. Databases rely on being able to collect, reorder and batch operations to reduce interpreter overhead and optimize memory access patterns. Running the same query many many times in a nested loop reduces that optimization freedom.
The author is a true expert in the field and is quite good at making somewhat arcane concepts accessible IMO.
Popular Dev Tools aren't just solving a problem. They solve core emotional needs * HuggingFace makes you feel smart * Unity makes you feel like a kid again * Github makes you feel seen * Fastai makes you feel like you belong * VSCode makes you feel like a tinkerer