The triangularization stage requires additional computation; each branching in the tree incorporates the information from all of a node's child nodes. This can be done simultaneously by assigning different processers to handle each branch in a particular level. Finally, the order of complexity for simulating one time step is:.
- Another job for Biggles.
- Incorporating Disease and Population Structure into Models of SIR Disease in Contact Networks.
Figure 1. Segment-based parallelism. The calculation for node 8 involves only segments 1—8, and is carried out in parallel according to Stone's algorithm. The calculation for node 12 involves segments 1,2,3,4,9,10,11,12 and is of the same nature as the calculation for node 8, but with different indexing.
Passar bra ihop
Calculations for segments 1—4, which are marked by the dashed line, are the same on both sides and are carried out only once. We expect that this approach would perform better for neuron that has many branches with similar number of segment among the branches of the same level. When this method is used the triangularization stage is the same as described above. The difference is at the back-substitution stage of the algorithm. The node's value solely depends on its parent's value. For an unbranched cable, the parent's index is the consecutive index.
Stone's algorithm solves the equations in log N parallel steps, where processor i after j steps has incorporated the values from the 2 j nodes preceding to i as described at Figure 1. Thus, after log N steps each processor has integrated the values from all relevant nodes. A similar process is implemented for the branched cable back-substitution: the problem properties remain the same except for the unique path from node i to its root, which may include non-consecutive indices [the indexing algorithm for a branched cable is described at Hines ].
For N segments, the complexity is therefore:. Both branch-based and segment-based parallelization approaches are tailored to the GPUs architecture. Both approaches maintain controlled use of shared memory and simultaneously compute single instructions on multiple data SIMD. The actual computation time on the GPUs may vary from the theoretical complexity due to the special architecture properties and optimizations done by the compiler. Computation time for various instances of the problem will be analyzed in the following subsections. We used the reconstructed morphology of a pyramidal neuron from a previous study Keren et al.
Figure 2 shows the voltage responses of both simulations middle , along with error curves top. For the 1 nA sweep, the maximal and average root-mean-square errors were 0. Figure 2. Bottom : Current-clamp protocols for passive and active responses. The apparent increase in the error Figure 1 , up during the spike train is due to a small linear shift in time, calculated to be 0.
Both time step value gave negligible disagreement for the passive stimulation Figure 2 , top. Each time step the application solves the membrane mechanism models and adds their conductance and current contributions to the current equation system Hines and Carnevale, , When all mechanisms are solved, the application can update the equation system and solve it. The result from the equation system is the membrane voltage of each segment of the neuron at the current time step.
The main computation effort for advancing a time step in the simulation derives mostly from three operations: calculating the membrane mechanism, setting the equation system and solving it. We analyzed the model workload in two steps. First, we analyzed the runtime for solving the matrix which represents the equation system and the model Figure 3.
Figure 3. Analyzing the runtime of matrix and model solving. Four different simulations were used to analyze the runtime of model solving and solving the equation system matrix : 1, full simulation; 2, simulation with a passive model; 3, simulation without solving the matrix Hodgkin—Huxley model ; 4, simulation with a passive model and without solving the matrix. To calculate the runtime of matrix solving the runtime of simulation with a passive model was subtracted from the full simulation. To calculate the runtime of model solving the runtime of simulation without matrix solving was subtracted from full simulation.
The simulations were ran on fork morphologies with , , and segments. Figure 4. Simulation of a single sweep of a single neuron of different topologies, models, and scale. A Median runtimes of a fork morphology inset with increasing number of segments for: NEURON simulation solid lines ; segment-based parallelization dotted line ; and branching-based parallelization dashed line.
The Hodgkin—Huxley model is shown in black and the passive model in gray. B Median runtimes of full binary tree morphologies inset using the Hodgkin—Huxley model with increasing tree depth. The passive model is not shown for clarity. The runtime of solving the matrix was analyzed by eliminating the relevant part of the code and running the simulation without it. Runtime of solving the model was analyzed using the passive model, which hardly requires any computational effort. We conducted another run using the passive model but without solving the matrix to estimate the runtime for the rest of the code Figure 3.
Next, we subtracted the runtime of the simulation without the code for solving the matrix from the runtime of a full simulation to calculate the runtime for solving the matrix. Similarly, we subtracted the runtime of the simulation with the passive model from the simulation of the Hodgkin—Huxley model to calculate the runtime of computing the model. These tests were repeated with fork morphology of , , and segments the morphology is displayed in Figure 4A. The results are shown in Figure 3. Most of the runtime was used for solving the matrix With an increasing number of segments heavier simulations solving the matrix took a more significant share of the total runtime.
This result agrees with the theoretical analysis; as the per-segment model calculation and updating the equation system run in linear time, the complexity of solving the matrix is super-linear. Combinations of the back-substitution method with the different models and topologies are compared in Figure 4. Figure 4A shows the simulation runtime for a single trace as a function of the number of segments in the neuron.
We checked runtimes of both segment-based and branching-based parallelization approaches using both the passive and Hodgkin—Huxley models. For the fork topology, the segment-based parallelization performed best Figure 4A. Figure 4B shows the runtimes for Hodgkin—Huxley simulation on a binary tree topology. In this topology, with small depths of less than 7, the segment-based parallelization still performed better than the branching-based parallelization. With depths greater than 7 there was enough branching in the tree for branching-based parallelization to outperform the other simulations.
Current-clamp experiments usually use multiple traces to evaluate the reaction of the neuron to different physiological conditions. As described in the Methods section, each thread block was organized such that the CUDA block's y-index described a different trace. The runtime analysis for simulation of multiple traces in a single multiprocessor is presented in Figure 5. We used current-clamp protocols, where each trace included data points, and stimulated the neuron with increasing number of traces from 1 to GPUs' runtimes increase sub-linearly due to optimization of multiprocessor context switching, leading to a performance factor of Figure 5.
Simulation of multiple sweep protocol and multiple neurons.
A simulation protocol using reconstructed pyramidal neuron morphology with branches C and a varying number of sweeps, where each sweep included data points. A Run times of increasing number of sweeps using three simulations: NEURON simulation solid line ; segment-based parallelization dotted line ; branching-based parallelization dashed line. The y-axis shows the runtime on a log scale. B Run times of the three simulations on multiple neurons, where each neuron consists 13 traces 65, data points as described in A.
Finally, we utilized all 14 multiprocessors by simulating several neurons on different multiprocessors using the grid of blocks NVIDIA, b. Different number of neurons were ran and the performance of the two back-substitution approaches were compared with NEURON. We simulated 1— neurons. Above neurons the GPU simulation runtimes increased linearly with neuron number, while below 14 neurons GPU runtimes remained virtually equal Figure 5B.
Above 14 neurons, the runtime depended on the longest running multiprocessor, i. The method achieved a fold performance factor for a realistic configuration of morphology and stimulation protocol with several neurons.
- Global Social Policy and Governance.
- The New Palaces of Medieval Venice.
- Timing for Animation.
We have developed a parallel algorithm for simulating multi-compartmental models of single neurons on GPU devices. Using GPUs significantly reduces the flops per dollar ratio Chien-Ping, , thus making intensive neuronal simulations available to a larger audience. To validate the correctness of our model, we compared our results to those calculated in NEURON, resulting in a small error between the platforms Figure 2 , which decreased with smaller step size. Prior to implementing the simulation environment on the GPU we assumed that most of the computation time would be used in solving the current diffusion equation system or the matrix.
We thus focused on solving the matrix more efficiently.see
Nonhierarchical communications networks: an application of compartmental modeling
In the theory presented here, we introduced two methods for the back-substitution part of solving the equation system—segment-based Figure 1 and branching-based parallelization. The branching-based method performed better for highly branched morphologies, while the segment-based method was best for relatively simple morphologies Figure 4. The branching-based parallelization used less framework shared memory see Methods than the segment-based method. This may allow simulation of more sweeps per blocks, which means running heavier simulations faster. After validating the correctness of our application we explored the domains of applicability, domains in which our simulations performed better than NEURON.
We also examined for which cases each back-substitution method should be used. The most basic simulation used a fork morphology and a single sweep Figure 4A and reached a speedup of 1. This difference in performance was due to the advantages of the GPU with more complex models, since the math operations reduce the memory transfer latency [for more details on memory latency on GPU see Nickolls et al.
We next checked more realistic simulations using morphologies reconstructed from real neurons and each simulation composed of several traces. When NEURON simulates multiple traces on the same neuron, the increase in performance time shows a constant relationship to the number of data points. With 13 traces CUDA ran The final simulation we ran, used multiple sweeps for many neurons. The increase in speed arose from each neuron being simulated on one of the 14 different multiprocessors composing the GPU.
Figure 5B shows that runtime increased in multiples of 14 steps, since runtime was set by the multiprocessor that finished last. Simulating many instances of the same neuron expediently, opens the door to apply Monte Carlo simulations for statistical analysis and optimization algorithms—GPUs outmatch most platforms for these kind of problems Lee et al. Using GPUs for simulating compartmental models may thus decrease runtime to the order of 2 magnitudes.
Naive optimization using openCL or running the same algorithm used by the CPU code converted to a GPU may achieve only minor speedups in the order of one magnitude. Our application, however, was tailored to the GPU and implemented several optimizations to take advantage of its computational power. For example, ILP Volkov, dealt with multiple segments, so each thread calculated several segments.
Using ILP allowed us to use a fixed number of threads 32—a single warp to simulate the most basic simulation unit—one sweep. Using one warp for the whole sweep fixed CUDA block's x-index to 32, where the y-index was the number of sweeps in the protocol see Methods. This layout was optimal in our application and spared the user the need to find the optimal number of threads in the block, as in our previous GPU ion cannel simulator Ben-Shalom et al.
We gained a further increase in performance by using a buffer shared memory of the size of a warp that held the output of the recorded site in the simulation. This buffer was filled every 32 time steps by the different threads in the warps and then transferred the information to the global memory. This optimization assured that slow global memory transfers occurred only every 32nd step and would be coalesced—uncoalesced memory transfer are major slowdown factors in CUDA applications Ryoo et al. Instead, we suggest the use of GPUs for compartmental modeling.
Currently users can extend our application for their needs by adding membrane mechanisms using our implementation of the Hodgkin—Huxley model as an example. A further natural extension would be to simulate realistic neuronal networks on GPUs. This would result in a supercomputer with the computational abilities of the expensive Blue Gene Gara et al.
Until neuronal network simulation is possible on GPUs, one can use hybrid CPU—GPU supercomputers where the complicated neurons may be simulated using our application, while the CPUs simulate the network organization and communication between the neurons. The aim of this study was to construct a software that uses GPUs for simulating compartmental models. GPUs are evolving very quickly. CUDA's scalability permits easy implementation of our application to the new and future architectures. Easy scalability assures that the application will perform even better with GPU development.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. This project was supported by the German Israeli Foundation Mara Almog for the help with the reconstructed cells.
Ben-Shalom, R. Optimizing ion channel models using a parallel genetic algorithm on graphical processors. Methods , — Carnevale, N. Chien-Ping, L. De Schutter, E. Computer software for development and simulation of compartmental models of neurons. Pubmed Abstract Pubmed Full Text. Druckmann, S. A novel multiple objective optimization framework for constraining conductance-based neuron models by experimental data. Fidjeland, A. ASAP Gara, A. IBM J. Harada, T. Thalmann and S. Musse Petropolis: SBC , 63— Hegland, M.
On the parallel solution of tridiagonal systems by wrap-around partitioning and incomplete LU factorization. Hines, M.
Compartmental Modeling with Networks : Gilbert G. Walter :
Efficient computation of branched nerve equations. A program for simulation of nerve equations with branching geometries. View via Publisher. Save to Library. Create Alert. Share This Paper. Figures and Topics from this paper. Citations Publications citing this paper. Quasi-positive realization of externally positive discrete systems Roberto Guidorzi.
Goncharova , Maxim V. Adaptive and neuroadaptive control for nonnegative and compartmental dynamical systems Kostyantyn Y. Nonlinear dynamical systems and control for large-scale, hybrid, and network systems Qing Hui. Haddad , Qing Hui , Jayanthy Ramakrishnan. Udaya Shankar.
Fast evaluation and dynamic control of integrated services networks Ibrahim Matta. References Publications referenced by this paper. He is currently a Supervisor in the Switching Systems Performance Division at Bell Laboratories, where he is involved with modeling and control of voice and data networks. Distinguished Member of Technical Staff in. Time behavior of stressed nonhierarchical networks Clayton M.
Lockhart , Mario R.