You’re likely familiar with Dask, the parallelization framework for Python. If you’re like me, you haven’t had to use it yourself and so haven’t had the opportunity to go deep. This podcast is an easy way to get an overview; it gets satisfyingly technical. Here’s my favorite chunk of the transcript:
(Q) In Dask, if I want to instantiate a really, really big distributed array, what kinds of work are you doing in Dask to instantiate that array?
(A) (…) we’ve got these thousand machines each holding maybe 10 NumPy arrays and now we need to sort of map and figure out which for this particular NumPy array, where does it fit in the broader picture? Maybe this is the NumPy array that corresponds to the temperature over France, for example. On this other computer is a NumPy array corresponding to the block of temperature over Italy. We know that if we want to sort of look at the Italy-France connection, we need to have those two machines.
Dask is a really a system that’s watching all those machines and is tracking all those Python objects and is as necessary telling those machines what to do, “Okay. It’s now time for the machine holding France to compute its sum. It’s now time for the machine holding Italy to transfer that array over to the machine holding France so that we can do some interaction.”
There’re two problems here. One is figuring out a plan of which arrays need to talk to each other and then executing that plan, which is a lot of talking to all the machines, make sure they’re doing the right thing. If one machine goes down, making sure the work that was on it gets replaced.
Cool! Dask literally lays on top of Numpy and adds cluster support. Got it.