Scenes from a graphical, parallel biology (Presentation)

 This talk was originally prepared for the Embryo Physics course in Second Life on April 4, 2012.

So, what IS a graphical, parallel approach to computational biology? To be clear, "graphical, parallel" biology is equivalent to simulating biological systems using graphical processing hardware and computing power in parallel. The state-of-the-art technology for this is the GPU processor, which I will now describe.
The G stands of graphics creation. This technology is commonly applied to the rendering (drawing and updating in real-time) of computer graphics for movie special effects and video games. The power of these processors has increased greatly over the past 10-15 years, especially when arrayed in parallel. The P stands for processing. While this hardware is generally used for computing graphics, researchers from the physical, biochemical, and computational sciences have been increasingly using GPU processing capabilities to solve "hard" problems.

Processing also refers to the need for special strategies and data structures for processing datasets in parallel. What we typically do in science is analyze data using serial processing (e.g. algorithms, statistical routines). Processing data in parallel requires us to rethink our approach to pre=processing data and solving problems. The U stands for units. The units are single cores that process data in parallel. GPU systems are unique in their design. Each processor deals with a small part of a given problem. Collectively however, all processors solve the problem much faster than a conventional processor with greater bandwidth operating on data as a single stream (e.g. CPUs).

In comparing GPUs to CPUs, there is speed-up for certain (but not all) tasks. Tasks that take advantage of the GPU chip design do particularly well. One example is a fast kNN-search. In this graph, the y-axis represents the speedup factor for GPUs relative to CPUs. For each version of the kNN algorithm, GPUs outperform CPUs, which decreases with an increasing larger n.

 This is due to two unique features of GPU processor design. One is that all units (or cores) connect to a global memory, which can be used to partition data. Each core has a multithreaded architecture and on chip memory, which can be used to optimize local processing. A typical GPGPU system will have cores arrayed in multiples of 8 (e.g. 4096 cores).

These differences translate into differences into how data is handled on the GPU. At left is a fractal based on the Julia set. At right is the CUDA code for the CPU (center) and GPU (far right). The biggest difference is that GPU code must map input data to a grid and block structure before distributing the job to the array of cores. This is done by specifying thread (threadIdx) and block (blockIdx) indices.

This multithreaded architecture allows for stream processing. This is where a function (e.g. kernel) operates on a set of input data, which then serves as the output. The consequence of this: no dependencies between elements -- meaning that processing does not directly allow for recursion. This is done in parallel fashion so that each core takes a small chunk of data and produce an output simultaneously.

Before we go further, we need to understand that analyzing a problem in parallel is not the same as doing so serially. When we add cores, we should not expect a linear increase in performance. For example, we should not expect a 4-fold gain in performance by using four processors instead of one. There are practices such as load balancing and which can optimize performance scaling, however.

At the top of this slide is a typical CUDA architecture. Note that each group of 8 processors shares a 16 KB buffer of shared (local) memory. There is also communication with off-chip (global) memory. Each function is characterized by a kernel which handles and maps the data to where it is processed. And keep in mind that we are distributing the problem, updating, and then re-assembling the outputs in global memory.
 Here is a "hello world" (e.g. basic) example of vector addition. On a CUDA architecture this would be solved in the following way: A and B would be added together in stepwise fashion from 0 to 4. The output, or C, would be returned to global memory and reassembled as a vector. This is different from a serial approach, because all steps can be performed at the same time.

 Next I will show you the address system on a CUDA architecture. In the example on our side, there are four threads for every block. They are indexed in a row, column format. Threads can cooperate because they reside in the same block. Blocks are also organized in a row, column manner, but they cannot cooperate. So why do we have blocks? One use for blocks is to enable multigrid methods, which provide a grid that is resized during the convergence of an algorithm. This approach is often used for computations that involve multiple "scales" of analysis.

Now I will briefly describe a way to optimize GPU performance for datasets with multiple features (e.g. variables, variability). In the example shown, a single variable is being mapped (e.g. decomposed) into a  computational scheme (tiles). This is just one way datasets can be processed to take advantage of the CUDA architecture. On the bottom right of the slide are some "tips and tricks" for handling data.

Finally, I want to share with you a typical algorithm that has been modified for a graphical, parallel (GPU) system. This is the mergesort algorithm, which would produce a sorted deck of playing cards after a bout of 52-card pickup. What is special about this implementation is that recursion is replaced by the iterative transfer of strings between memory stores. Since these local sorts are all done in parallel, the process is faster than a serial sort with recursion.

 Done with the technical part of the talk! Now we are ready to see what biology in parallel looks like!

 In my initial investigations into GPU computing, I focused on two areas of personal interest: models of distributed behavior (so-called flocking and swarming), and phylogenetics (how to reconstruct evolutionary relationships).

In the phylogenetic case, we can see that parallel methods are capable of implementing classic algorithms (the Peeling algorithm, which allows us to build a tree of life one branch at a time) in new ways. However, we can also see how graphical, parallel architectures are not always the way to go.

However, I am interested in graphical, parallel computing for reasons other than "speeding things up". One of these is the use of texture maps to represent repeating patters, or motifs, in the data. Motifs are usually used in sequence analysis, but they also may serve as a way to partition large data sets and processes. Texture maps were originally used in computer graphics to tile 3D objects with a 2D surface. In this sense, they provide a level of detail to a surface.

This could potentially address a host of open biological problems in fundamentally new ways. But I have identified four areas of biology that might benefit from the use of graphical, parallel computing.

From my own work, I am interested in how signaling molecules (especially so-called leaderless mRNAs and proteins) behave collectively and participate in paracrine signaling.

 While this is a tricky problem, at least one paper has characterized them as being part of a parallel distributed process.
In a recent poster, I proposed a model for paracrine signaling and its relationship to cellular reprogramming (the conversion of one cell type to another). This has not been adapted to a graphical, parallel world as of yet, but serves as a prime candidate for future research.

In my case, I am trying to map a hybrid model (genetic algorithm + cellular automata) to CUDA architectures. One thing this may enable is for interactions in many different contexts to be tested simultaneously, with the resulting solutions being assembled and compared in global memory.

Graphical, parallel computing is also used to better understand the expression and action of genes, the activity of RNA molecules, and the function of proteins. I have provided several citations for your reference.
Here is an example of an RNA folding algorithm implemented on a GPU. Notice the use of blocks and timeflow to characterize the input data matrices.

One example of interest to this group is simulating morphogenesis. In this case, we are trying to approximate an emergent process between individual cells. This is good fit for graphical, parallel approaches, as higher-dimensional geometry and parallel processes are key features of morphogenesis. In this example, models were specified in CompuCell 3D and then computed on a GPU.

Population modeling is another inherently parallel process that can be explored in a graphical, parallel world. In this case, agents or populations of agents could be distributed across nodes, while the best solution (optimization) would then be selected from the outputs. In this case, I am envisioning a two- tiered optimization process.

 Using this approach, we may also be able to explore the potential of populations to act as averaging and filtering mechanisms for stochasticity at multiple levels of biological organization (e.g. an order from chaos-type effect).

Finally, I return to texture maps and how mipmaps could be used for dealing with multiscalar processes. In this case, I am proposing that mipmaps (applying textures at different scales) can be used to exploit the internal structure of a dataset.

This includes processes such as growth (allometry) and fractal growth (self-similarity). In this case, growth would involve the addition, subtraction, or size scaling of these mipmaps. Of course, this is more of an idea at this point.

So this talk was a bit different from most talks on graphical, parallel (or GPU) computing.

Again, I am interested in special attributes of CUDA (or similar) architectures other than speed-up. In general, "graphical" approaches excel at matrix-intensive calculations. These include physics calculations, or mathematical models that require iterative updating and path prediction (e.g. Newtonian physics).

 However, graphical, parallel approaches can also be used for multi-dimensional datasets and complex geometries. One example I didn't cover is from neuroimaging, and involves the coupling of anatomical
and functional data (co-registration). Finally, it must be stressed that parallelism offers new opportunities and challenges for scientific computing which are only now beginning to be appreciated.

Thank you for your attention.