TL;DR: pipeline debt is a species of technical debt that infests backend data systems. It drags down productivity and puts analytic integrity at risk. The best way to beat pipeline debt is a new twist on automated testing: pipeline tests, which are applied to data (instead of code) and at batch time (instead of compile or deploy time).
I really love this. Analytics and data science rely on quality data pipelines, and it’s shocking just how infrequently these pipelines are appropriately tested. The article does a wonderful job discussing just how challenging it can be to get this stuff right and the importance of doing so. We do a ton of data testing at Fishtown Analytics and are huge believers.
My only quibble with the tool itself is the technical approach it takes. Brian and Josh from dbt slack put it better than I could (below). Even so, I give this team a ton of credit for pushing forward a very important topic.
We believe these values and principles, taken together, describe the most effective, ethical, and modern approach to data teamwork.
Too often, a data team’s efficacy is limited by its organization’s willingness to buy into good data practices. This manifesto is a way for the entire community to rally around how data should be done, providing tremendous social proof for individual data teams implementing these practices.
This short and sweet article goes through five principles of web design as applied to dashboard construction. The decisions you make when laying out your dashboard really matter: don’t just vomit up a bunch of charts.
While we’re on the topic of visual design, font is another area worthy of attention that is often overlooked. The simple example above goes to show how big of a readability impact font choice can have: which of the two ingredients lists above do your eyes find easier to read?
Here’s a quote that I loved, on an aspect of fonts I had never considered:
[fonts built for tables] have a second and equally important characteristic: they maintain their equal widths across a range of weights. (This runs counter to the typical behavior in a typeface, in which heavier weights become progressively wider.) Known as “duplexing,” this is one of the essential characteristics of tabular figures, because it allows designers to highlight individual lines in boldface without disrupting the width of the column.
Earlier this month, I had the exciting opportunity to moderate a discussion between Professors Yann LeCun and Christopher Manning, titled “What innate priors should we build into the architecture of deep learning systems?” This discussion topic – about the structural design decisions we build into our neural architectures, and how those correspond to certain assumptions and inductive biases – is an important one in AI right now.
This is a fascinating topic in AI design at the highest level. This will be an ongoing conversation over the coming decade; if it’s new to you this is a great intro.
We had recently published a large-scale machine learning benchmark using word2vec, comparing several popular hardware providers and ML frameworks in pragmatic aspects such as their cost, ease of use, stability, scalability and performance.
I was surprised at how some of the more niche providers beat out AWS. I had previously been unaware of PaperSpace, but they seem worth further investigation.
At Fishtown Analytics, we work with venture-funded startups to implement Redshift, Snowflake, Mode Analytics, and Looker. Want advanced analytics without needing to hire an entire data team? Let’s chat.