At a high-level we can view computers as Von Neumann machines, with an area of main memory (RAM), and a CPU for performing computations. When the CPU is executing instructions it isn’t actually operating on data that exists in main memory, but instead on copies of that data in a CPU-local memory called registers.
Typically a CPU will have only enough registers to store a handful of numbers.
Since the CPU can only hold a few numbers, when it receives an instruction,
it usually must collect the data all the way from main memory, load it into
its registers, perform the computation, store the result in another register,
and finally copy the result back into main memory. This means the simple looking
code z = x + y
actually becomes:
load x from RAM into register r1 <- slow
load y from RAM into register r2 <- slow
r3 = r1 + r2
store r3 into RAM as z <- slow
Copying from RAM into registers is a very slow operation compared to performing an arithmetic operation on the CPU. For a human, it is the difference between blinking and brushing your teeth.
For the majority of programs this is fine, as modern architectures have a cache which can hold recently-used values. They can also request data before they actually need it, so that when they do it is already there.
When writing performance-critical software however, this fetching to and from main memory is often what dominates the execution time. Not only is fetching data all the way from RAM really slow, but CPUs themselves are very fast, so during the majority of a program running, the CPU is actually not running but sitting idle, waiting for data. Dataflow architectures provide one way to alleviate this.
Consider a neural network, where we are repeatedly doing the same series of computations but on different input vectors. The only variation is in the values of the input, but otherwise the remaining computation remains the same:
def run_through_network(input_vector: Vector):
output_vector = run_layer_1(input_vector)
output_vector = run_layer_2(output_vector)
output_vector = run_layer_3(output_vector)
...
return output_vector
Ignoring caches and other technicalities, this might look something like this at the machine-level:
load the input vector from RAM into CPU registers
run layer 1
store the result in RAM
load the result from RAM
run layer 2
store the result in RAM
load the result from RAM
run layer 3
store the result in RAM
...
But this is incredibly inefficient. Why would we store the result in RAM (slow and far away), and fetch it again when we already had it in the registers of our CPU. Admittedly, modern computers would cache something this trivially simple, but as caches are implemented in hardware, they can catch only simple patterns.
Dataflow architectures avoid this. One might specify
a series of instructions as a graph, where we model how data flows
through the program. Each operation (layer 1
, layer 2
, …) could
then have its own unit of hardware, and we just funnel data through
these units. Something more like:
fetch input vector
send input vector to layer 1
forward the result to layer 2
forward the result to layer 3
...
This model means that we aren’t wasting time storing and fetching data from a really far away main memory, but each computation has the data it will need nearby to it. In implementation, dataflow architectures are best suited for hardware designs, where each operation can be an independent circuit. Another benefit of dataflow architectures is that if operations are independent enough, you can achieve significant parallelism, where tasks run at the same time.
For our example, once a vector has gone through layer 1, we can send another vector to layer 1 immediately whilst the first is now being processed in layer 2, rather than waiting for it to go through the whole pipeline.
This can happen throughout the whole network meaning that for n computations we can process n elements of data at the same time, which can result in the program being n times faster.
It would be expensive to have to produce new hardware units every time we want to edit our programs so usually we use a hybrid architecture that is “dataflow-ish”, such as Field Programmable Gate Arrays (FPGAs).
FPGAs are one type of hardware that can be reprogrammed at any time. That is, you can design the circuits on an FPGA with code, and then load those circuits at anytime. Whilst not as fast as fabricated silicon designs, they are much faster than software, and are great for prototyping real hardware, or used in production implementing smaller dataflow programs.
Another type that is more suited for large-scale production are architectures alike the Tenstorrent ML accelerators. These machines have many general-purpose cores, but each with their own massive local memory and arranged in a grid so that they can pass data directly to each other, rather than through a global RAM. Here you can program each core as an independent computation, and write data movement programs to move data through these cores.
With the correct programming, dataflow architectures can give significant speedup to programs that spend a majority of their time fetching data to and from main memory (memory-bound programs). Recently a lot of dedicated hardware for machine learning has been released in the form of highly-parallel dataflow architectures, suggesting we may have to adapt the way we think about programming more generally to keep seeing increases in performance.