Data Science Roundup #77: Artificial Agents Create Their Own Language, Annotated Audio Data, & more!

Happy Sunday! Thanks, as always, for reading. If you enjoy reading the Data Science Roundup, I'd appr
Data Science Roundup #77: Artificial Agents Create Their Own Language, Annotated Audio Data, & more!
By The Data Science Roundup • Issue #77
Happy Sunday! Thanks, as always, for reading. 
If you enjoy reading the Data Science Roundup, I’d appreciate it if you could forward this email to three friends. It’s your referrals that keep us growing! 🙏🙏
- Tristan
Referred by a friend? Sign up here!

Two Posts You Can't Miss
OpenAI agents invented a language from scratch:
Our approach yields agents that invent a (simple!) language which is grounded and compositional. Grounded means that words in a language are tied to something directly experienced by a speaker in their environment, for example, a speaker forming an association between the word “tree” and images or experiences of trees. Compositional means that speakers can assemble multiple words into a sentence to represent a specific idea, such as getting another agent to go to a specific location.
Must read.
Google just released a new version of SyntaxNet, incorporating the results of over a year of NLP research. Consider the following sentence: “The gostak distims the doshes.”
This sentence was originally coined by Andrew Ingraham who explained: “You do not know what this means; nor do I. But if we assume that it is English, we know that the doshes are distimmed by the gostak. We know too that one distimmer of doshes is a gostak.“ Systematic patterns in morphology and syntax allow us to guess the grammatical function of words even when they are completely novel: we understand that ‘doshes’ is the plural of the noun ‘dosh’ (similar to the ‘cats’ example above) or that ‘distim’ is the third person singular of the verb distim. Based on this analysis we can then derive the overall structure of this sentence even though we have never seen the words before.
This Week's Top Posts
I crunched the numbers on eight measures of 917 cities to learn what constitutes a typical city in America. Here’s what I found.
An almost surprisingly interesting post, given what a common dataset the author is working with. Great reminder of how important storytelling is.
“Data scientist” is certainly a term that takes its fair share of criticism. My main problem with the term is that it is actually too broad: the variance in skillset for someone with a data scientist title is incredibly high.
Companies in the market for data science talent should think long and hard about which of these profiles they’re actually looking for.
There are many, many startups today incorporating AI into their products and services. This article presents 50 of the largest / most well-funded, and the list is well worth a look.
AudioSet consists of an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos. The ontology is specified as a hierarchical graph of event categories, covering a wide range of human and animal sounds, musical instruments and genres, and common everyday environmental sounds.
By releasing AudioSet, we hope to provide a common, realistic-scale evaluation task for audio event detection, as well as a starting point for a comprehensive vocabulary of sound events.
Google is on a roll: this dataset could potentially be as important as ImageNet
This post by data scientist Nikolas Markou is making the rounds right now. In it, he presents an exhaustive bulleted list of detailed recommendations for how to tune a neural network. All substance, no fluff.
Have you ever been asked “What exactly is a tensor?” and wished you had a more coherent answer? If so, this post is for you.
Data viz of the week
The data for the visualization below comes from 770,000 tubes of saliva analyzed by It’s hard to get a sense from the embedded version, but there are some great stories played out in the details. Click through to see the larger version.
Thanks to our sponsors!
Fishtown Analytics works with venture-funded startups to implement Redshift, Snowflake, Mode Analytics, and Looker. Want advanced analytics without needing to hire an entire data team? Let’s chat.
Developers shouldn’t have to write ETL scripts. Consolidate your data in minutes. No API maintenance, scripting, cron jobs, or JSON wrangling required.
Did you enjoy this issue?
The Data Science Roundup
The internet's most useful data science articles. Curated with ❤️ by Tristan Handy.
Carefully curated by The Data Science Roundup with Revue. If you were forwarded this newsletter and you like it, you can subscribe here. If you don't want these updates anymore, please unsubscribe here.