Data Science Roundup reader Vicki Boykis has a great new post on the idiosyncrasies of SparkR’s dataframe. As a data scientist using R, SparkR is an incredibly powerful tool to extend your existing skillset into the world of parallelized computing, but it’s important to understand what’s going on under the hood. Vicki’s article does a great job of showing exactly that.
Also, I’m embarrassed by just how much I enjoyed this joke from the article:
Some people, when confronted with a problem, think “I know, I’ll use multithreading”. Nothhw tpe yawrve o oblems.
It’s a good point: use Spark only when you have to.
Reader Michelangelo D'Agostino built a cool Jupyter utility that notifies you on the completion of a long-running cell. Stop alt-tabbing back into your browser to see if your model has finished training!
SQL optimization is such a critical topic, and one that too few data professionals go deep on. If you’re not familiar with how a query optimizer works, how to read an explain plan, or what linear time is, this article is a great getting started guide. Share this post with your colleagues who are bogging down your Redshift cluster 😉
Note that much of this post was written in the context of a traditional relational engine like MySQL. The core concepts are very relevant for modern analytic databases although the specific recommendations are somewhat less applicable.
Nodebook is a fascinating extension to Jupyter Notebooks that makes it easier to treat notebooks like real code: it maintains state and relationships across cells, ensuring that regardless of the execution order, you always get coherent results. Usable today—check it out.
At this point in the development of the SaaS business model, the metrics one uses to evaluate a SaaS business are fairly well-known. This article is a wonderful new take on one of the core metrics in SaaS: the LTV (lifetime value) to CAC (customer acquisition cost) ratio.
If you work at a subscription-based business, this is a must-read.
Sometimes it’s better to tell people they’re being stupid while making them laugh at the same time. There are tons of xkcd comics that illustrate the strange ways we can all become confused, and this post pulls together many of the best. Use them in your PowerPoints 😊
This is super freaking cool: a utility that allows you to connect Presto (an open source SQL engine) to the Ethereum blockchain. Complete installation instructions plus example queries.
While I don’t recommend plugging this into Mode Analytics and becoming a cryptocurrency day trader, there are so many fascinating and profitable things to be done with this… Want some ideas? Check this out.
We’ve found that adding adaptive noise to the parameters of reinforcement learning algorithms frequently boosts performance.
Randomness helps neural networks get unstuck from local minima. The deeper we go into AI research, the more I find that I can take personal life lessons from the findings. Everyone needs some change, even your RNNs.
If you don’t care about finance and investing, skip this. I found it fascinating. Here are a couple of quotes:
…broad acceptance of [AI / ML] is slow due to various factors, the most important being that AI requires investment in new tools and human talent. The majority of funds use fundamental analysis because this is what managers learn in their MBA programs. There are not many hedge funds that rely solely on AI.
…I believe the transition for most traders will not be possible. The combination of skills required for understanding and applying AI rules out 95% of traders used to drawing lines on charts and watching moving averages.
In short: there is plenty of work to be done (and money to be made) in bringing data science to financial markets.
At Fishtown Analytics, we work with venture-funded startups to implement Redshift, Snowflake, Mode Analytics, and Looker. Want advanced analytics without needing to hire an entire data team? Let’s chat.