The biggest architectural shift happening in modern distributed systems is moving to log-based communication between large numbers of microservices.
If the above sentence made your eyes glaze over, you are not alone. Data scientists and analysts typically don’t spend a ton of time thinking about the technical implementation of how their data arrives at their doorstep, but this adoption of log-based communication will have very significant impacts on your work in coming years. You should get ahead of it.
Imagine that, instead of querying a customers table, you query a customers event stream that contains every update to a customer ever. You’ll find that some queries get more complicated: if you want the current state of a customer you’ll have to reconstruct it for yourself. But you’ll also have more power: you’ll be able to reconstruct that state at any given point in time. You’ll also likely have access to data in much closer to real-time.
This article is an excellent intro into log-based architectures, as seen through the experience that the New York Times had in their migration. If this isn’t a topic you’re familiar with, spend the time to digest.
Modern systems must be aware of the quality of the incoming data and capable of identifying, reporting and handling erroneous cases accordingly.
Modern data organizations have dramatically increased their ability to ingest and process new data streams, but they frequently have no mechanism to validate that data either at the outset of a project or, just as importantly, on an ongoing basis. This article walks through the why and how of creating high-quality data.
We think a ton about this problem at Fishtown Analytics and it’s exactly what we’ve built dbt data tests to help with. We’re starting to see more companies beginning to take this issue seriously, but it’s still rare.
Even once your data obeys all of your technical rules, it’ll still have outliers. The post gives my favorite definition yet of what exactly an outlier is:
[An outlier is an] observation which deviates so much from other observations as to arouse suspicion it was generated by a different mechanism.
The reason I like this definition is that it focuses on the data creation process, not the data point itself. Outliers are not data points that you don’t like—that disconfirm your hypothesis—they’re data points whose variance suggests they were captured erroneously. We see this frequently in web analytics: because the data collection environment is so heterogenous, some outliers show a time on site of over a decade.
If you’re going to provide high quality analytics to your end consumers on an ongoing basis, outlier detection and correction needs to be a part of your pipeline.
Today, we’re announcing a new resource for the data science community to help wistful data talent and managers pining for their next superstar. Mode’s new Data Jobs Board lists great jobs for analysts, data scientists and data engineers at exciting companies doing cutting-edge work.
There exists a hidden gap between the more idealized view of the world given to data-science students and recent hires, and the issues they often face getting to grips with real-world data science problems in industry.
It is critical to our mission to enable machine learning researchers with the most powerful training scenarios, and for us to give back to the gaming community by enabling them to utilize the latest machine learning technologies. As the first step in this endeavor, we are excited to introduce Unity Machine Learning Agents.
At Fishtown Analytics, we work with venture-funded startups to implement Redshift, Snowflake, Mode Analytics, and Looker. Want advanced analytics without needing to hire an entire data team? Let’s chat.