Data Science Roundup #83: OCR @ Dropbox, The Nature of Knowledge, Data Validation & more!

We just got back from DataEngConf in San Francisco last week. Congrats to our friends at Hakka Labs f
Data Science Roundup #83: OCR @ Dropbox, The Nature of Knowledge, Data Validation & more!
By The Data Science Roundup • Issue #83
We just got back from DataEngConf in San Francisco last week. Congrats to our friends at Hakka Labs for putting on a great event! Highly recommended.
- Tristan
Referred by a friend? Sign up here!

Two Posts You Can't Miss
Fascinating post. The author goes back to the very earliest writing we have on what it means to know something:
Back at the beginning of Western culture’s discovery of knowledge, Plato told us that it’s not enough for a belief to be true…knowledge in the West has consisted of justifiable true beliefs — opinions we hold for a good reason.
Here’s my favorite line in the piece:
The machine-learned way of seeing might be more reflective of how the world actually is than purely human knowledge could ever be.
What if most knowledge isn’t easily reducible to symbolic logic but requires a network with billions of weights to know?
The truth is that, as we begin to discover more about our own neural processes, we actually don’t understand nearly as much about our own mechanisms of reasoning as we had previously believed. Knowledge, even our own knowledge, has always been incomprehensible to us.
Long read, but very worthwhile. Also worth a look: The Myth of a Superhuman AI.
It turns out that a large corpus of stories plus some fairly straightforward sentiment analysis can tell us quite a lot about the human preference for narrative structure. The chart below is a rather profound summarization of 112,000 stories: the relative sentiment of words based on where they appear within a story. You can tell exactly when Sam and Frodo enter Mordor and when they throw the Ring into Mount Doom.
I’m fascinated by this not because it tells us something we didn’t know, but because of what amazing choices the author made in his analysis to arrive at such clear conclusions. Aspirational.
Every story ever.
Every story ever.
This Week's Top Posts
From crowdsourcing to convolutional layers to training to production, the Dropbox engineering team outlines every step in their OCR pipeline. Impressive.
Data validation is a topic I care a lot about and it doesn’t get nearly enough attention. This package has some wonderful validation constructs: even if you don’t spend a lot of time in R, it’s worth reading this piece purely for its approach to scalable data validation.
Now that you’re already a data scientist, maybe it’s time to consider moving over to product? The article points out the similarities between the roles:
  • Data scientists and product managers make decisions with data.
  • Data scientists and product managers work cross-functionally.
  • Data scientists and product managers choose an objective function and ruthlessly optimize for it.
Definitely agree. I’d be curious if any readers have made this transition.  •  Share
The purpose of exploratory data analysis is the finding, not the exploring
In perceptual classification, the analyst looks at the data and matches what they see against familiar patterns. In perceptual clustering , the analyst finds groups of similar patterns without necessarily leveraging known patterns.  •  Share
I covered this article several weeks ago. Now, a new author has picked up the dataset and created an entire mapping of the similarity of the top 10k subreddits. Great map, great walkthrough. Includes code.
The distance between Espresso and Cappuccino
Customer: Espresso? But I ordered a cappuccino!
Robot: Don’t worry, the cosine distance between them is so small, that they are almost the same thing.
I can’t tell if this is funny or not (maybe a little?), but I’m fascinated that people are producing data science humor. I can’t imagine a less funny topic.
Data viz of the week
More story data, different source! He kidnaps, she screams.
More story data, different source! He kidnaps, she screams.
Thanks to our sponsors!
At Fishtown Analytics, we work with venture-funded startups to implement Redshift, Snowflake, Mode Analytics, and Looker. Want advanced analytics without needing to hire an entire data team? Let’s chat.
Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.
Did you enjoy this issue?
The Data Science Roundup
The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.
Carefully curated by The Data Science Roundup with Revue. If you were forwarded this newsletter and you like it, you can subscribe here. If you don't want these updates anymore, please unsubscribe here.