Controlled Magnetic Fusion (by J. N. Leboeuf)

Of the energy sources currently available worldwide--such as coal, oil, gas, nuclear fission, and solar energy--only controlled thermonuclear fusion has the capability of supplying the needed power to bring the population of the Earth to an energy consumption level comparable to that of the United States. Sixty petawatts per hour would be needed annually if the current world population were to consume the 12,000 kW/h that each citizen of the US uses per year on average. Fusion has the potential of delivering this huge amount of power without environmental catastrophe and for an extended, if not infinite, time.

Controlled fusion relies on a plasma or fully ionized hot gas made up of electrons and ions. The initial plasma fuel of choice is deuterium-tritium, whose constitutive atoms can be made to fuse in a controlled and sustained way when the ``Lawson'' criterion is satisfied:

temperature tex2html_wrap_inline80 density tex2html_wrap_inline82 confinement time tex2html_wrap_inline84 sec tex2html_wrap_inline86 keV.

Magnetic fusion further relies on external and/or plasma-generated magnetic fields in toroidal devices such as tokamaks and stellarators to confine a moderate density ( tex2html_wrap_inline88 ), high-temperature (10 keV) plasma on a time scale of a few seconds to satisfy the Lawson criterion.

Computational Issues

Because of the inherent space- and time-scale disparities involved, most of the outstanding physics questions in fusion cannot be answered analytically or experimentally. In particular, large-scale numerical calculations are required to understand and control the anomalous transport caused by plasma fluctuations. Two primary computational approaches are being used to model magnetically confined fusion plasmas.

Both of these first-principles approaches require solving an enormous number of ODEs and PDEs on very different, yet coexisting space ( tex2html_wrap_inline90 m to m) and time ( tex2html_wrap_inline90 sec to sec) scales in the complex geometry dictated by the confining magnetic field.

Architectural Issues

One issue is which of the types of architecture envisioned for a petaflops machine--global shared memory, network of processors, or processor-in-memory (PIM)--is best suited for PIC and MF plasma models? Because of current memory limitations and because particle models are highly parallel, PIC modelers tend to prefer networks of processors (with PIM the second choice, and global shared-memory the last). On the other hand, because of current performance and ease of programming, MF modelers tend to favor global shared-memory (with networks of processors second, and PIM last).

Memory

Related to the choice of architecture is the issue of total memory requirements. Based on present memory occupancy, these requirements have been determined to be as follows:

PIC Calculations[1]:

MF calculations (problem size scaled naturally to 1 petaflops):

Execution

At present, detailed performance models assessing the breakdown between sequential, parallel, and communications-dominated parts of both PIC and MF calculations are not available for distributed-memory platforms such as the Paragon and the T3D. For PIC models, current benchmarks indicate that the total execution time (particle push, charge deposition, field solve, force distribution to particles) for one time step scales as

Total time tex2html_wrap_inline98 tex2html_wrap_inline100 .

This scaling has been obtained for a fixed problem size and for an increasing number of processors from 8 to 128. Such scaling would appear to be quite favorable to a large number of processors, assuming that sufficient memory per processor is available to fit enough particles and grid points in node memory to keep the processors busy computing instead of communicating as the calculation size is scaled up.

I/O

An I/O traffic of 2.7 petabytes to disk in 8.1 hours of execution time is anticipated for MF calculations with 1200 grid points and 36,000 modes on a petaflops machine. File sizes of 1.6 gigabytes will be written to disk every time step, for an aggregate I/O throughput of 93 gigabytes/sec. This I/O capacity needs to be accommodated in the design of a petaflops machine.

Scaling from Current Performance

Two issues are considered here: memory scaling and efficiency.

Memory Scaling. Total memory requirements for the large-scale high-resolution, three-dimensional plasma calculations envisaged on a petaflops machine have been assessed in a slightly different way for PIC and MF models.

For PIC calculations, one possible path involves increasing the problem size by two orders of magnitude above what is feasible on current platforms and increasing the physical time of the calculation. This approach would require 2 terabytes of memory for a calculation with 1000 tex2html_wrap_inline94 1000 tex2html_wrap_inline94 256 grids and 100 particles per cell. On the other hand, both computational time and memory scale linearly with the number of particles, since the six particle arrays dominate over the five field arrays when a large number of particles is used per cell. A rather memory-taxing linear scaling of

Memory size tex2html_wrap_inline98 tex2html_wrap_inline108

is expected, so that a sustained speed of 1 petaflops would result in 1 petabyte from the 10 gigaflops currently achieved on the T3D system for 10 gigabytes of aggregate memory.

For MF calculations, total memory requirements are more difficult to determine because of geometry considerations and the interplay between storage of matrices and mode-redirection arrays for the convolutions. In cylindrical geometry, the total memory scales roughly like the number of modes squared multiplied by the number of grid points, and the mode convolution arrays dominate. In toroidal geometry, the total memory scales approximately like the number of grid points multiplied by the number of modes squared multiplied by the maximum toroidal mode number included in the calculation, and the matrix arrays dominate. Current calculations on the C90 with 420 grid points and 4000 modes take 50 hours of CPU time at 6.5 gigaflops sustained for a physical time of 1 tex2html_wrap_inline90 s. It is anticipated that a petaflops machine will allow for calculations with three times the number of grid points and nine times the number of modes currently possible on the C90 covering 1 ms of physical time, or a factor of 1000 over the currently achievable physical time. This improvement would entail an increase of the memory to 0.5 terabytes for cylindrical calculations, as compared with the current 2 gigabytes on the C90, yielding a scaling of

Memory size tex2html_wrap_inline98 tex2html_wrap_inline114

and an increase to 50 terabytes for toroidal calculations for a scaling of

Memory size tex2html_wrap_inline98 tex2html_wrap_inline118 .

These memory scalings remain to be verified empirically.

Efficiency and Communications Overhead. Because of the small memory per node on distributed-memory platforms such as the iPSC/860 (8 to 15 megabytes), and even the T3D (128 megabytes), the most memory-intensive arrays cannot be stored in memory for the large-scale MF calculations of interest here (the matrix and its inverse needed for the linear implicit part of the calculation). Instead, the inverse is recalculated at every time step in the calculation, resulting in a longer execution time as compared with that on shared-memory platforms. In fact, a penalty factor of 20 in execution time per step is incurred on a single processor of the C90 if this minimum storage implementation is adopted there. This situation, combined with poor compiler optimization and the communication overhead associated with the intricate parallelization of the matrix inversion, helps explain in part the current best performance at 0.3 gigaflops sustained on 128 processors of the Paragon.

Technical Issues

A critical issue from an architecture viewpoint is the availability of ample memory per processor so that the calculation size can be scaled up as the number of accessible processors increases in order to maintain optimal performance. That is, per-processor computations must be maximized while interprocessor communications are minimized.

Care should also be taken so that the operating system does not occupy a significant fraction of the available memory per node, as OSF1 does now on the Paragon. This problem would exacerbate the current scale-up difficulties.

Better compilers and pre-compilers will be needed to take full advantage of the clock cycles. Current experience with PIC and MF models on the T3D indicates that only 5 to 10 percent performance improvement results if no single-node optimization is performed on top of what the compiler delivers on its own. Careful loop rewrites and handling of real to integer truncations have resulted in an increase to 40 percent improvement. Similarly, tools to do runtime optimization ought to be provided. A trivial example is the ordering of the parallel outer loops over the number of modes and the highly vectorized inner loops over grid points on the C90, which could be automatically reversed should the number of grid points dominate over the number of modes.

Experience with both PIC and MF models of plasmas also shows that it is vital to have as many I/O nodes as compute nodes for these data-out intensive calculations. Much fewer I/O nodes, such as 8 I/O nodes for 128 compute nodes on the iPSC/860, created an I/O bottleneck when writing large analysis and restart data sets to disk. A petaflop machine has the potential of creating peta-sized files in number and in size. Automatic concatenation of the large number of large files thus produced at each write before writing to disk or permanent storage is also highly desirable instead of manual intervention. The subsequent ability to analyze and visualize these large data sets also presupposes the existence and support of machine-independent binary formats, like the NetCDF suite of routines, on the computing platform.

Algorithms

The modern PIC models used to simulate magnetically confined plasmas are gyrokinetic or low-noise tex2html_wrap_inline120 ones, in which the electrons are treated as guiding centers and the ion motion has been averaged over the period of gyration around the magnetic field. These PIC models use a 3-D finite difference grid over which the charge of the particles is accumulated and the plasma and external fields are represented. In the electrostatic case, the solution of Poisson's equation needed to calculate the electric force from the plasma charge density is performed by using 3-D fast Fourier transforms. The particles are advanced in time using second-order leapfrog and/or predictor-corrector methods. Domain decomposition of particle and grid quantities among processors is used on parallel machines.

The MF models of interest here use finite differences in the radial directions and Fourier-mode expansion in the poloidal and toroidal directions, or the short way and the long way around the torus in which the plasma is confined. The mode representation is tailored to the magnetic field, which winds its way around the torus. The linear terms in the time evolution equations for the plasma moments and the magnetic field are solved for implicitly and are second order in time, as are the nonlinear terms which are solved for explicitly in time. The second-order derivatives present in the equations result in a block tridiagonal matrix for the linear part, and the nonlinear terms involve mode convolutions which have so far been performed analytically instead of using Fourier transforms. The matrix inversion is performed in parallel over the number of modes, while the convolutions are parallelized over the number of grid points.

In summary, the critical algorithmic building blocks for PIC and MF plasma models are

Progress Estimates

Numerical computations are needed to accelerate the development of a reliable, clean, affordable, international power source using controlled thermonuclear fusion. For this development to be realizable, the required large-scale calculations need to be performed in real time. The success of this enterprise also requires real-time analysis and visualization of the results, an ability to steer the calculations as they proceed, and on-line comparisons between models and with available experimental data. It is projected that the realization of these goals will require at least 1petaflops of computational power, several petabytes of on-line and archival storage, and a non-negligible fraction of this 1 petaflop computational power to retrieve, represent, and assimilate the enormous quantities of data in a meaningful and timely way.

Beyond petaflops, one can envision the possibility of creating a virtual magnetic confinement device on the computer and of interacting with it as it proceeds from startup through flat top and on to termination of a full discharge in an integrated design and physics environment.

References

[1]
Viktor K. Decyk, University of California at Los Angeles, personal communication, 1995.