Similtaneously that you just would possibly perhaps maintain a huge sequence of files and cherish to achieve equivalent computations on every and every ingredient, records parallelism is a uncomplicated system to speedup computation the utilization of a pair of CPUs and machines as neatly as GPU(s). Whereas that is never any longer the one invent of parallelism, it covers an huge class of compute-intensive functions. A serious hurdle for the utilization of files parallelism is that that you just would possibly perhaps maintain to unlearn some habits devoted in sequential computation (i.e., patterns raze result in mutations of files constructing). Particularly, it is a ways vital to portray libraries that will well presumably allow you to portray what to compute in portray of how to compute. Almost, it system to portray generalized bag of route of and decrease operations and bid your computation with reference to them. Fortunately, at the identical time as you realize the instrument to write iterator comprehensions, there would possibly perhaps be now no longer any longer extremely efficient more to be taught for having bag entry to to a huge class of files parallel computations.
This introduction most foremost specializes within the Julia functions that I (Takafumi Arakaki @tkf
) bag developed. This is why, it at this time specializes in thread-primarily primarily based entirely entirely parallelism. There would possibly perhaps be uncomplicated dispensed computing toughen. GPU toughen is a customarily requested characteristic but it hasn’t been performed but. Survey moreover entirely varied parallel-computation libraries in Julia.
Also stamp that this introduction does no longer discuss learn the technique to portray threading primitives similar to Threads.@spawn
since it is a ways too low-stage and mistake-inclined. For records parallelism, the next-stage description is instrument more acceptable. It moreover helps you write more reusable code; e.g., the utilization of the equivalent code for single-threaded, multi-threaded, and dispensed computing.
julia
and librariesMost of the examples right here would possibly perhaps well presumably presumably effectively moreover merely work in all Julia 1.x releases. On the assorted hand, for the handiest raze result, it is a ways extremely urged to construct primarily primarily the most up-to-date launched version (1.5.2 as of writing). Which you’re going to be ready to earn it at https://julialang.org/.
In case you construct julia
, that you just’re going to be ready to effectively construct the dependencies required for this tutorial by working the utilization of Pkg; Pkg.add(["Transducers", "ThreadsX", "OnlineStats", "FLoops", "MicroCollections", "BangBang", "Plots", "BenchmarkTools"])
in Julia REPL.
Similtaneously you clutch the utilization of precisely the equivalent atmosphere venerable for attempting out this tutorial, wing the next instructions
git clone https://github.com/JuliaFolds/records-parallelism
cd records-parallelism
julia --conducting
after which within the Julia REPL:
julia> the utilization of Pkg
julia> Pkg.instantiate()
julia
To portray multi-threading in Julia, that you just would possibly perhaps maintain to launch it with a pair of execution threads. Similtaneously that you just would possibly perhaps maintain Julia 1.5 or higher, that you just’re going to be ready to effectively delivery it with the -t auto
(or, equivalently, --threads auto
) option:
$ julia -t auto
_
_ _ _(_)_ | Documentation: https://doctors.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Build "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Model 1.5.2 (2020-09-23)
_/ |__'_|_|_|__'_| | Obedient https://julialang.org/ liberate
|__/ |
julia> Threads.nthreads() # preference of core that you just would possibly perhaps maintain
8
The repeat line option -t
/--threads
would possibly perhaps well presumably presumably effectively moreover moreover clutch the preference of threads to be venerable. In older Julia releases, portray the JULIA_NUM_THREADS
atmosphere variable. As an illustration, on Linux and macOS, JULIA_NUM_THREADS=4 julia
begins juila
with 4 execution threads.
For more records, be taught about Initiating Julia with a pair of threads within the Julia instruction manual.
julia
with a pair of employee processesAbout a examples beneath repeat masks Disbursed.jl-primarily primarily based entirely entirely parallelism. Delight in how multi-threading is setup, that you just would possibly perhaps maintain to setup a pair of employee processes to construct speedup. Which you’re going to be ready to open julia
with -p auto
(or, equivalently, --procs auto
). Disbursed.jl moreover potential that you just can add employee processes after taking off Julia with addprocs
:
the utilization of Disbursed
addprocs(8)
For more records, be taught about Initiating and managing employee processes part within the Julia instruction manual.
Mapping is presumably primarily primarily the most customarily venerable characteristic in records parallelism. Snatch how Julia’s sequential route of
works:
a1 = route of(string, 1: 9, 'a': 'i')
9-ingredient Array{String,1}:
"1a"
"2b"
"3c"
"4d"
"5e"
"6f"
"7g"
"8h"
"9i"
We can merely change it with ThreadsX.route of
for thread-primarily primarily based entirely entirely parallelism (be taught about moreover entirely varied libraries):
the utilization of ThreadsX
a2 = ThreadsX.route of(string, 1: 9, 'a': 'i')
@express a1 == a2
Julia’s outmoded library Disbursed.jl comprises pmap
as a dispensed version of route of
:
the utilization of Disbursed
a3 = pmap(string, 1: 9, 'a': 'i')
@express a1 == a3
🔬 Take a look at Code
the utilization of Take a look at
@testset open
@take a look at a1 == a2
@take a look at a1 == a3
pause
☑ Pass
Take a look at Summary: | Pass Whole
take a look at space | 2 2
As a a diminutive of more « realistic » example, let’s play with the Collatz conjecture which states that recursive application the Collatz characteristic defined as
collatz(x) =
if iseven(x)
x ÷ 2
else
3x + 1
pause
reaches the #1 for all evident integers.
I am going to skip the mathematical background of it (as I bag no belief extremely efficient about it) but let me repeat masks that there are many of fun-to-perceive explanations in YouTube 🙂
If the conjecture is nice, the preference of iteration required for the preliminary cost is finite. In Julia, we’re in a space to calculate it with
characteristic collatz_stopping_time(x)
n = 0
whereas nice
x == 1 && return n
n += 1
x = collatz(x)
pause
pause
Stunning for fun, let’s portray the stopping time of the preliminary values from 1 to 10,000:
the utilization of Plots
plt = scatter(
route of(collatz_stopping_time, 1: 10_000),
xlabel = "Preliminary cost",
ylabel = "Stopping time",
stamp = "",
markercolor = 1,
markerstrokecolor = 1,
markersize = 3,
dimension = (450, 300),
)
We can with out problems parallelize route of(collatz_stopping_time, 1: 10_000)
and construct an supreme speedup:
julia> Threads.nthreads()
4
julia> the utilization of BenchmarkTools
julia> @btime route of(collatz_stopping_time, 1: 100_000);
18.116 ms (2 allocations: 781.33 KiB)
julia> @btime ThreadsX.route of(collatz_stopping_time, 1: 100_000);
5.391 ms (1665 allocations: 7.09 MiB)
Julia’s iterator comprehension syntax is a sturdy instrument for composing mapping, filtering, and knocking down. Snatch that mapping would possibly perhaps well presumably presumably effectively moreover moreover be written as an array or iterator comprehension:
b1 = route of(x -> x + 1, 1: 3)
b2 = [x+1 for x in 1:3]
b3 = bag(x + 1 for x in 1: 3)
@express b1 == b2 == b3
b1
3-ingredient Array{Int64,1}:
2
3
4
The iterator comprehension would possibly perhaps well presumably presumably effectively moreover moreover be utilized with threads by the utilization of ThreadsX.bag
:
b4 = ThreadsX.bag(x + 1 for x in 1: 3)
@express b1 == b4
🔬 Take a look at Code
the utilization of Take a look at
@testset open
@take a look at b1 == b2 == b3
pause
☑ Pass
Take a look at Summary: | Pass Whole
take a look at space | 1 1
Label that more evolved composition of mapping, filtering, and knocking down would possibly perhaps well presumably presumably effectively moreover moreover be utilized in parallel:
c1 = ThreadsX.bag(y for x in 1: 3 if isodd(x) for y in 1:x)
4-ingredient Array{Int64,1}:
1
1
2
3
Transducers.dcollect
is for the utilization of iterator comprehensions with a dispensed backend:
the utilization of Transducers
c2 = dcollect(y for x in 1: 3 if isodd(x) for y in 1:x)
@express c1 == c2
🔬 Take a look at Code
@take a look at c1 == c2 == [1, 1, 2, 3]
Capabilities similar to sum
, prod
, most
, and all
are the examples of discount (aka fold) that will well presumably presumably effectively moreover moreover be parallelized. They are very tall devices when mixed with iterator comprehensions. The portray of ThreadsX.jl, a sum of a iterator created by the comprehension syntax
d1 = sum(x + 1 for x in 1: 3)
9
can with out problems be parallelized by
d2 = ThreadsX.sum(x + 1 for x in 1: 3)
9
🔬 Take a look at Code
@take a look at d1 == d2
For the beefy checklist of pre-defined reductions and entirely varied parallelized capabilities, invent ThreadsX.
and press TAB within the REPL.
We can portray most
to compute primarily primarily the most stopping time of Collatz characteristic on a given the form of preliminary values
max_time = ThreadsX.most(collatz_stopping_time, 1: 100_000)
350
🔬 Take a look at Code
@take a look at max_time == 350
We construct a speedup similar to the route of
example above:
julia> @btime most(collatz_stopping_time, 1: 100_000)
17.625 ms (0 allocations: 0 bytes)
350
julia> @btime ThreadsX.most(collatz_stopping_time, 1: 100_000)
5.024 ms (1214 allocations: 69.17 KiB)
350
OnlineStats.jl offers a very rich and composable space of reductions. Which you’re going to be ready to dart it because the first argument to ThreadsX.decrease
:
the utilization of OnlineStats: Mean
e1 = ThreadsX.decrease(Mean(), 1: 10)
Mean: n=10 | cost=5.5
🔬 Take a look at Code
the utilization of OnlineStats; @take a look at e1 == match!(Mean(), 1: 10)
💡 Label
Whereas OnlineStats.jl customarily does no longer present the quickest system to compute the given statistics when the entire intermediate records can slot in reminiscence, in about a circumstances you attain now no longer indubitably clutch on completely the handiest performance. On the assorted hand, it goes to be cost pondering entirely varied programs to compute statistics if ThreadsX.jl + OnlineStats.jl turns into the bottleneck.
For non-trivial parallel computations, that you just would possibly perhaps maintain to write a custom discount. FLoops.jl offers a concise space of syntax for writing custom reductions. As an illustration, that is learn the technique to compute sums of two quantities in one sweep:
the utilization of FLoops
@floop for (x, y) in zip(1: 3, 1: 2: 6)
a = x + y
b = x - y
@decrease(s += a, t += b)
pause
(s, t)
(15, -3)
🔬 Take a look at Code
@take a look at (s, t) == (15, -3)
On this case, we attain no longer initialize s
and t
; but it no doubt indubitably is never any longer a typo. In parallel sum, the one low-cost cost of the preliminary portray of the accumulators adore s
and t
is zero. So, @decrease(s += a, t
works as if
+= b)s
and t
are initialized to acceptable invent of zero. On the assorted hand, since there are an total lot zeros in Julia (0::Int
, 0.0::Drag with the dart with the circulation64
, (0x00 + 0x00im)::Developed{UInt8}
, …), s
and t
are undefined if the enter sequence (i.e., the associated cost of xs
in for
) is empty.
x in xs
To administration the invent of the accumulators and likewise to manual certain of UndefVarError
within the empty case, that you just’re going to be ready to effectively space the preliminary cost with accumulator = initial_value op enter
syntax
@floop for (x, y) in zip(1: 3, 1: 2: 6)
a = x + y
b = x - y
@decrease(s2 = 0.0 + a, t2 = 0im + b)
pause
(s2, t2)
(15.0, -3 + 0im)
🔬 Take a look at Code
@take a look at (s2, t2) === (15.0, -3 + 0im)
To correct cherish the computation of @floop
with @decrease(accumulator =
syntax, that you just’re going to be ready to effectively construct a rough belief by nice ignoring
initial_value op enter)@decrease(
and corresponding ,
s and )
. More concretely:
Extract expressions accumulator = initial_value
(« initializers ») from accumulator = initial_value op enter
and save them in entrance of the for
loop.
Convert accumulator = initial_value op enter
to inplace update accumulator = accumulator op enter
.
Strip off @decrease
.
So, the above example of @floop
is similar to
let
s2 = 0.0
t2 = 0im
for (x, y) in zip(1: 3, 1: 2: 6)
a = x + y
b = x - y
s2 = s2 + a
t2 = t2 + b
pause
(s2, t2)
pause
(15.0, -3 + 0im)
🔬 Take a look at Code
@take a look at (s3, t3) === (s2, t2)
The hastily-hand version @decrease(s += a, t += b)
is performed by the utilization of the first ingredient of the enter sequence because the preliminary cost.
This transformation is venerable for producing the contaminated case that is utilized in a single Process
. Moderately masses of outcomes from tasks are mixed by the operators and capabilities specified by @decrease
. More explicitly, (s2_right, t2_right)
is mixed into (s2_left,
by
t2_left)
s2_left = s2_left + s2_right
t2_left = t2_left + t2_right
⚠ Warning
Don’t portray locks or atomics! (until what you are doing)
Particularly, attain no longer write
acc = Threads.Atomic{Int}(0)
Threads.@thread fors x in xs
Threads.atomic_add!(acc, x + 1)
pause
Locks and atomics will allow you to write just correct concurrent functions when venerable accurately. On the assorted hand, they attain so by limiting parallel execution. The portray of files parallel sample is fully the terminate system to construct extreme performance.
findmin
/findmax
with @decrease() attain
@decrease() attain
syntax is especially primarily the most versatile system in FLoops.jl for expressing custom reductions. It is miles terribly devoted when one variable can affect entirely varied variable(s) in discount (e.g., index and cost within the instance beneath). Label moreover that @decrease
would possibly perhaps well presumably presumably effectively moreover moreover be venerable a pair of cases within the loop physique. Right here’s a system to compute findmin
and findmax
in parallel:
@floop for (i, x) in pairs([0, 1, 3, 2])
@decrease() attain (imin = -1; i), (xmin = Inf; x)
if xmin > x
xmin = x
imin = i
pause
pause
@decrease() attain (imax = -1; i), (xmax = -Inf; x)
if xmax < x
xmax = x
imax = i
terminate
terminate
terminate
@repeat imin xmin imax xmax
imin = 1
xmin = 0
imax = 3
xmax = 3
🔬 Take a look at Code
@take a look at (imin, xmin, imax, xmax) == (1, 0, 3, 3)
We can stamp the computation of @floop
roughly by ignoring the lines with @gash support() attain
and corresponding terminate
. More concretely:
Extract expressions accumulator = initial_value
(« initializers ») from (accumulator = initial_value; enter)
or (accumulator; enter)
and build them in entrance of the for
loop.
Opt away @gash support() attain ...
and corresponding terminate
.
let
imin2 = -1
xmin2 = Inf
imax2 = -1
xmax2 = -Inf
for (i, x) in pairs([0, 1, 3, 2])
if xmin2 > x
xmin2 = x
imin2 = i
pause
if xmax2 < x
xmax2 = x
imax2 = i
terminate
terminate
@repeat imin2 xmin2 imax2 xmax2
terminate
imin2 = 1
xmin2 = 0
imax2 = 3
xmax2 = 3
🔬 Take a look at Code
@take a look at (imin2, xmin2, imax2, xmax2) == (1, 0, 3, 3)
The above computation is broken-down for every partition of the enter collection and mixed by the reducing fair defined by @gash support()
block. That is to dispute,
attain(imin2_right, xmin2_right, imax2_right,
is mixed into
xmax2_right)(imin2_left, xmin2_left, imax2_left,
by
xmax2_left)
if xmin_left > xmin_right
xmin_left = xmin_right
imin_left = imin_right
pause
if xmax_left < xmax_right
xmax_left = xmax_right
imax_left = imax_right
terminate
Deliver: Search that x
and i
of the first @gash support() attain
block are modified with xmin_right
and imin_right
while x
and i
of the 2nd @gash support() attain
block are modified with xmax_right
and imax_right
. Right here’s why we damaged-down two @gash support() attain
blocks; we maintain to « pair » x
/i
with xmin
/imin
or with xmax
/imax
reckoning on which if
block we’re in.
findmin
/findmax
with ThreadsX.gash support
(slack!)Demonstrate that it is no longer important to make utilize of @floop
for writing a custom discount. As an illustration, you’re going to be ready to write an equivalent code with ThreadsX.gash support
:
(imin3, xmin3, imax3, xmax3) = ThreadsX.gash support(
((i, x, i, x) for (i, x) in pairs([0, 1, 3, 2]));
init = (-1, Inf, -1, -Inf)
) attain (imin, xmin, imax, xmax), (i1, x1, i2, x2)
if xmin > x1
xmin = x1
imin = i1
pause
if xmax < x2
xmax = x2
imax = i2
terminate
return (imin, xmin, imax, xmax)
terminate
@explain (imin3, xmin3, imax3, xmax3) == (imin, xmin, imax, xmax)
🔬 Take a look at Code
@take a look at (imin3, xmin3, imax3, xmax3) == (imin, xmin, imax, xmax)
On the assorted hand, as you would possibly perhaps well presumably earn out about, it is a ways a ways more verbose and blunder-prone (e.g., the preliminary values and the variables are declared in varied space).
gash support
mapreduce
and gash support
are priceless when combining pre-existing operations. As an illustration, we can with out problems put into effect histogram by combining mapreduce
, Dict
, and mergewith!
:
str = "dbkgbjkahbidcbcfhfdeedhkggdigfecefjiakccjhghjcgefd"
f1 = mapreduce(x -> Dict(x => 1), mergewith!(+), str)
Dict{Char,Int64} with 11 entries:
'f' => 5
'd' => 6
'e' => 5
'j' => 4
'h' => 5
'i' => 3
'expedient' => 4
'a' => 2
'c' => 6
'g' => 6
'b' => 4
Label that this code has a performance enviornment: Dict(x => 1)
allocates an object for every and every iteration. Right here is imperfect namely in threaded Julia code attributable to it customarily invokes garbage sequence. To place away from this map back, we’re in a space to change Dict
with MicroCollections.SingletonDict
which does no longer allocate the dictionary within the heap. SingletonDict
would possibly perhaps well presumably presumably effectively moreover moreover be « upgraded » to a Dict
by calling BangBang.mergewith!!
. This would well presumably presumably effectively moreover merely then earn a mutable object for every and every job to mutate. We can then fabricate an environment apt parallel histogram operation:
the utilization of BangBang: mergewith!!
the utilization of MicroCollections: SingletonDict
f2 = ThreadsX.mapreduce(x -> SingletonDict(x => 1), mergewith!!(+), str)
@express f1 == f2
🔬 Take a look at Code
@take a look at f1 == f2
(For more records, be taught about Transducers.jl’s advert-hoc histogram tutorial.)
Let’s compute the histogram of collatz_stopping_time
over some fluctuate of preliminary values. Not just like the histogram example above, all of us is attentive to that the stopping time is a evident integer. So, it is a ways miles realistic to portray an array because the pointers constructing that maps a bin (index) to a depend. There would possibly perhaps be now no longer any longer a pre-defined reducing characteristic adore mergewith!
we’re in a space to portray. Fortunately, it is a ways miles straightforward to write it the utilization of @decrease() attain
syntax in @floop
:
the utilization of FLoops
the utilization of MicroCollections: SingletonDict
maxkey(xs:: AbstractVector) = lastindex(xs)
maxkey(xs::SingletonDict) = first(keys(xs))
characteristic collatz_histogram(xs, executor = ThreadedEx())
@floop executor for x in xs
n = collatz_stopping_time(x)
n > 0 || proceed
obs = SingletonDict(n => 1)
@decrease() attain (hist = Int[]; obs)
l = dimension(hist)
m = maxkey(obs)
if l < m
resize!(hist, m)
absorb!(gaze(hist, l+1:m), 0)
terminate
for (k, v) in pairs(obs)
@inbounds hist[k] += v
terminate
terminate
terminate
return hist
terminate
As we talked about above, @gash support() attain
blocks are damaged-down in two contexts; for the sequential substandard case and for combining the accrued outcomes from two substandard circumstances. Thus, for combining hist_left
and hist_right
, we maintain to change hist_right
to obs
. Right here’s why we maintain to take care of the circumstances the bag obs
is a SingletonDict
and a Vector
. As a result of a pair of dispatch, it is a ways terribly straightforward to absorb the adaptation within the two containers. We can correct utilize what Cross
defines for pairs
and easiest maintain to define maxkey
for arresting the final difference.
💡 Demonstrate
When writing @gash support() attain (L₁ = I₁; R₁), (L₂ = I₂; R₂), ..., (Lₙ =
, be determined that the
Iₙ; Rₙ)attain
block body can take care of arbitrary conceivable price of Lᵢ
substituted to Rᵢ
and no longer correct Rᵢ
s which are calculated right away within the for
loop body.
Instance utilization:
the utilize of Plots
plt = space(
collatz_histogram(1: 1_000_000),
xlabel = "Stopping time",
ylabel = "Counts",
brand = "",
size = (450, 300),
)
We utilize @floop executor for ...
syntax in convey that it is a ways uncomplicated to swap between varied form of execution mechanisms; i.e., sequential, threaded, and dispensed execution:
hist1 = collatz_histogram(1: 1_000_000, SequentialEx())
hist2 = collatz_histogram(1: 1_000_000, ThreadedEx())
hist3 = collatz_histogram(1: 1_000_000, DistributedEx())
@explain hist1 == hist2 == hist3
🔬 Take a look at Code
@take a look at hist1 == hist2 == hist3
As an illustration, we can with out problems study the performance of sequential and threaded execution:
julia> @btime collatz_histogram(1: 1_000_000, SequentialEx());
220.022 ms (9 allocations: 13.81 KiB)
julia> @btime collatz_histogram(1: 1_000_000, ThreadedEx());
58.271 ms (155 allocations: 60.81 KiB)
@threads
and @dispensed
Julia itself has Threads.@threads
macro for threaded for
loop and @dispensed
macro for dispensed for
loop. They are expedient for uncomplicated portray circumstances but advance with some boundaries. As an illustration, @threads
does no longer bag constructed-in reducing characteristic toughen. Though @dispensed
macro has reducing characteristic toughen, it is a ways restricted to pre-defined capabilities and it is a ways wearisome to pick out care of a pair of variables. Every of these macros easiest bag straightforward static scheduler and lacks an option adore basesize
supported by FLoops.jl and ThreadsX.jl to tune load balancing. Moreover, the code written with @threads
can no longer be reused for @dispensed
and vice versa.
Hopefully, this tutorial covers a unadorned minimal for you to launch writing records-parallel functions and the documentations of FLoops.jl and ThreadsX.jl are in actuality a diminutive more accessible. These two libraries are in accordance with the protocol designed for Transducers.jl which moreover comprises entirely varied devices for records parallelism.
Transducers.jl’s parallel processing tutorial covers a equivalent topic with explanations for more low-stage minute print. Parallel inspect depend tutorial in accordance with Man L. Steele Jr.’s 2009 ICFP talk is more improved but I salvage it a terribly horny example to portray for belief what’s doable with a wise construct of the reducing characteristic.
Label that guidelines equipped on this tutorial are very entire and must be acceptable moreover when the utilization of entirely varied libraries. As an illustration, the premise of custom discount is devoted in GPU computing when the utilization of mapreduce
on CuArray
.
💡 Label
Work in development. TODO: Add more tutorials and how-to guides.
3 Commentaires
eigenspace
Takafumi has really built up some amazing infrastructure in the package ecosystem. His Transducers.jl [1] is really interesting and powerful and lately he's done a lot of work with things like FLoops.jl [2] and ThreadsX.jl [3] to try and bring the benefits of transducers to more 'regular' familiar representations so more people can enjoy the benefits. The basic idea behind all of it is that he has an efficient and modular way of describing various 'looping' constructs that can be stuck together, optimized and parallelized automatically.
It'd be quite interesting to see this stuff extended to GPUs.
[1] https://github.com/JuliaFolds/Transducers.jl
[2] https://github.com/JuliaFolds/FLoops.jl
[3] https://github.com/tkf/ThreadsX.jl
melling
There's a current MIT course call Introduction to Computational Thinking that's using Julia. I've watch a handful of videos so far. Good introduction to Julia.
https://computationalthinking.mit.edu/Fall20/
tkf
Hi, the author here.
One of the hidden messages of the introduction is: watch Guy Steele's talks [1][2][3] if you are interested in data parallelism! These talks are not Julia-specific and the idea is applicable and very useful in any languages. My libraries are heavily inspired by these talks.
Of course, if you haven't used Julia yet, it'd be great if (data) parallelism gives you an excuse to try it out! It has a fantastic foundation for composable multi-threaded parallelism [4].
[1] How to Think about Parallel Programming: Not! https://www.infoq.com/presentations/Thinking-Parallel-Progra…
[2] Four Solutions to a Trivial Problem https://www.youtube.com/watch?v=ftcIcn8AmSY
[3] Organizing Functional Code for Parallel Execution; or, foldl and foldr Considered Slightly Harmful https://vimeo.com/6624203
[4] Announcing composable multi-threaded parallelism in Julia https://julialang.org/blog/2019/07/multithreading/