In this post I hope to explain as concisely as I can some of the key problems with pandas’s internals and how I’ve been steadily planning and building pragmatic, working solutions for them.
Wes McKinney knows a thing or two about pandas—he started writing it in his spare time almost a decade ago and continues to be deeply involved. In this post, he tells a story that starts with pandas but goes further, leading eventually to him pulling together a coalition around an Apache project called Arrow.
The motivating force behind the story has been performance at scale. Wes talks about how a bunch of design decisions in pandas still plague the project today. Arrow attempts to provide a “columnar data middleware” that provides zero-copy access between tools like Impala, Kudu, Spark, and Parquet. In his own words: “I strongly feel that Arrow is a key technology for the next generation of data science tools.”
I mentioned Arrow here almost two years ago at this point when the project first started. It’s made a ton of headway since then, and it’s well worth checking out if it’s new to you.
Some advice about moving into data science after completing a PhD in a natural science from someone who did. My favorite quote is from a section titled “Why tech companies shouldn’t hire you”:
[Tech firms] know that, left alone, a typical science PhD cannot build robust, complex software systems. More fundamentally, science PhDs are often ignorant about the basic tools and conventions of collaborative software development.
In a survey of more than 2,000 psychologists, Leslie John from Harvard Business School discovered that more than 50% of psychologists had waited to decide whether to collect more data until they had checked the significance of their results, thereby allowing them to wait until their hypotheses are confirmed.
Working on your own projects, solving a problem from beginning to end, is the best way to build your data science skills. That very frequently starts with web scraping to collect your initial dataset. Scrapy, a Python library, is an extremely capable library for building crawlers, and this tutorial is 👌.
A new idea is helping to explain the puzzling success of today’s artificial-intelligence algorithms — and might also explain how human brains learn. From Geoff Hinton:
It’s extremely interesting. I have to listen to it another 10,000 times to really understand it, but it’s very rare nowadays to hear a talk with a really original idea in it that may be the answer to a really major puzzle.
You’ve likely heard that the flooding in Houston following Hurricane Harvey is reportedly a 500-year or even 1000-year flood event. You’ve perhaps also heard that this is the third 500-year flood that Houston has experienced in a three-year span, which calls into serious question the usefulness or accuracy of the “500-year flood” designation. This made me wonder: what’s our actual best estimate of the yearly risk for a Harvey-level flood, according to the data? That is the question I will attempt to answer here.
After years of being left for dead, SQL today is making a comeback. How come? And what effect will this have on the data community?
IMO this is a bit of a flawed perspective–both SQL and NoSQL have a role to play in current and future data processing. However, it also presents useful history of the industry that is entirely worthwhile if you’re not familiar.
At Fishtown Analytics, we work with venture-funded startups to implement Redshift, Snowflake, Mode Analytics, and Looker. Want advanced analytics without needing to hire an entire data team? Let’s chat.