Most data tech in production is pull-based: you have to go looking for an answer. Notifications and stream-based analysis are topics with a lot of interest, but significantly less deployment. With Airbnb having made this investment, hopefully many more companies will have the leverage they need to get serious about real-time.
Highly recommended if you are (or will be) considering a project in this area.
If you plug and play ML models without understanding the math under the hood, you’ll make really meaningful mistakes. Choose the wrong algorithm. Choose the wrong hyperparameters. Underfit. Overfit. Mis-estimate your confidence intervals. Pain and suffering will ensue.
Anomaly detection is a technique used to identify unusual patterns that do not conform to expected behavior, called outliers. It has many applications in business, from intrusion detection (identifying strange patterns in network traffic that could signal a hack) to system health monitoring (spotting a malignant tumor in an MRI scan), and from fraud detection in credit card transactions to fault detection in operating environments.
This overview will cover several methods of detecting anomalies, as well as how to build a detector using simple moving average (SMA) or low-pass filter.
I did some fun anomaly detection this past week—detecting website traffic anomalies caused by TV advertising. This stuff is fun.
This post is awesome. If you still haven’t made the jump from Excel to R in your day-to-day, read this. It highlights why the jump is actually quite hard to make, and what the rewards are once you’ve made it.
There are hordes of Excel users out there; I’m fascinated by the problem of getting these users to learn and use more sophisticated tech.
Do you know what stemming and lemmatization are? No? You may not have had to tackle any NLP yet, but there’s no way you’ll be able to stay away from it for long. There’s just too much text out there. This is a solid intro to familiarize you with the key concepts.
Quora recently announced the first public dataset that they ever released. It includes 404351 question pairs with a label column indicating if they are duplicate or not. In this post, I like to investigate this dataset and at least propose a baseline method with deep learning.