DRAFT Findings
Petaflops Architecture WorkShop
Oxnard, California
April 22-25, 1996
by: Dr. Thomas Sterling, Senior Staff Scientist
Center of Excellence in Space Data and Information Sciences
NASA Goddard Space Flight Center
FINDINGS
The workshop on Petaflops Architecture was rich in understanding and findings related to the advance of very high performance computing towards the goal of achieving practical and useful systems capable of sustaining petaflops level performance. This chapter presents many of these findings in two groups: the seminal findings drawn from all aspects of the workshop deliberations, and a much larger list of supporting detailed findings organized by subdomain of architecture, technology, applications/algorithms, and system software.
MAJOR FINDINGS
ACCELERATED PATH TO PETAFLOPS:
Alternative approaches to petaflops architecture may yield petaflops capable systems significantly sooner than the natural evolution of computing systems limited to COTS technology.
LATENCY:
Latency management is critical to effective computing at petaflops scale and may require aggressive techniques in architecture, software, and algorithms. The only low latency is when enough fast memory is nearby.
PARALLELISM:
Parallelism for petaflops computing both for performance and efficiency will require between 100,000 and 1,000,000 sustained concurrent active threads and will require new scalable algorithms to expose sufficient parallelism as well as architecture and system software to manage it.
BANDWIDTH:
The primary obstacle to efficient petaflops execution is sufficient bandwidth across the system between processors and memory. Advances in hierarchical structures of processors and memory including PIM, advanced interconnections technology and topologies, and possibly on-the-fly data compression may all be required.
SUPERCONDUCTING RSFQ TECHNOLOGY:
Superconducting technology employing advanced RSFQ devices provides an important opportunity for implementation of petaflops processors at very high clock rates (> 100 GHz) and extremely low power consumption.
SPD - 1ST PETAFLOPS COMPUTER:
The first petaflops computer is likely to be a special purpose device designed with physical data paths and processing logic configured to match the requirements of kernel computations for a single algorithm. This could occur within 5 years and cost a modest $10 M.
PROCESSOR GRANULARITY:
Many small processors will provide greater performance at lower power consumption than fewer large processors for the same chip area, and if integrated with memory on the same die, such as PIM, can exploit memory bandwidth intrinsic to DRAM row buffers. However, effects of Amdahl's Law favor fewer high speed processors versus many more low speed processors.
MEMORY:
Memory based on DRAM technology will dominate the cost of a petaflops machine for many architecture concepts and the processor-to-memory ratio is a critical parameter of system design. Optical photo-refractive holographic storage technology and out-of-core approaches as well as emphasis on small working data set applications may significantly reduce system cost and make petaflops machines available sooner than would otherwise be possible. Distributed shared memory structures will ease programming and system software design but may have limits in effectiveness of hiding latency and delivering bandwidth.
OPTICAL INTERCONNECT:
Optical networks have wide applicability across classes of architecture and will be essential for high bandwidth, constrained latency, and in the case of free-space optics reduce interconnect complexity.
ARCHITECTURE CONCEPT CATAGORIES:
Among the diverse architecture concepts explored, the conceptual advances could be discriminated by catagories of 1. advanced processor design, 2. embedded aggressive latency management techniques in hardware and software, and 3. system interconnection structures and methods.
ARCHTYPICAL ARCHITECTURES:
Four major architecture types were identified although many variants or even combinations of each are possible.
1. COTS-based multiprocessors to exploit broad market investment,
2. Processor-In-Memory for low power and high memory bandwidth possibly with COMA,
3. Hybrid-technology multithreaded NUMA with superconducting processors and a deep memory hierarchy using semiconductor and optical memory with optical network,
4. Special purpose devices possibly incorporating reconfigurable logic.
ALGORITHMS:
A new class of computational algorithms will be required to enhance application parallelism by orders-of-magnitude and to decouple computation from communication for latency tolerant execution. These algorithms will be complicated by a move to meta-applications combining multiple disjoint kernels with distinct execution profiles.
SYSTEM SOFTWARE:
Although there is a basic level of software required to make a petaflops computer usable, there appear to be no new system software challenges that are unique to petaflops level computing. Nonetheless, the degree of scaling of critical system parameters such as parallelism and latency will significantly aggravate the responsibilities of system software and will require innovative solutions to compile time and runtime management of parallel activities, data sets, and physical resources.
BUSINESS MODEL:
An approach to developing and sustaining a petaflops computer community including joint industrial and academic R\&D and hardware and software manufacturing and support on an economically sound basis is crucial to the realizability of useful petaflops computing. A new business model is required redefining the relationship and roles of all participating entities.
SUPPORTING FINDINGS
ARCHITECTURE
1. The workshop was focussed in large part on the NSF Point-Design-Study proposed architectures which identified the following architecture concepts as important directions to petaflops computing research ordered from inside engine out:
- superconducting processor logic
- direct mapping of physical dataflow paths to algorithm kernels exposing intrinsic memory
- structures to logic compiler/library visible datapath resident reconfigurable logic parallel
- speculative execution support for runtime address
- disambiguation
- multithreading
- aggressive prefetching into memory hierarchy COMA
- ensemble of non-uniform granularity processors to beat Amdahl's Law hierarchical topologies with varying bandwidth to match
- algorithm communications
- free space full interconnection with optics
2. The major issues identified as defining the design space of a petaflops computer are a. the processing engine, b. the memory structure, c. the implementation technology, d. the programming and execution paradigm, and e. the aggregate operational requirements. The architecture examples were partitioned as emphasizing the interconnection method, the processing engine, or the latency management strategy.
3. The Processor-In-Memory (PIM) architecture concept provides superior performance to power consumption and die area. By exploiting direct access to DRAM row buffers and multiple processors per chip, much higher processor to memory and processor to processor bandwidth can be achieved than with conventional structures. An interesting CMOS design point is merging COMA with PIM with aggressive prefetching. While permitting systems to be implemented with a single chip type, PIM nonetheless fixes the memory to performance ratio and relies on the availability of very large amounts of parallelism.
4. With a possible 4 orders-of-magnitude ratio of register to DRAM access times, aggressive latency management techniques are critical to effective use of computing and memory resources of petaflops computering systems. Enhanced approaches to prefetching, cache coherence, multi-threading, and locality affinity will be essential aspects of petaflops architecture in combination with algorithm and system software techniques. Much of the application parallelism will be employed to hide latency through these techniques. System structures may be necessary that provide enough memory near the processing resources to limit the latency of majority of accesses, possibly increasing system cost.
5. Memory system structure and usage is a major driving factor of performance, efficiency, and cost and is a critical pacing item to practical and affordable petaflops systems. Essentially all proposed systems assume a virtual memory relationship between physical and abstract storage but demand paging was not required. Memory capacity dominates system cost if CMOS DRAM is assumed. Memory bandwidth requirements imply aggressive memory hierarchy which may be limited in effectiveness. Whereever possible, internal memory bandwidth must be exposed and exploited. Similarly, distributed shared memory using advanced cache coherence protocols may not be feasible using conventional methods. Direct programmer access to memory hierarchy control may enhance algorithmic efficiency as would programmable cache management policies. On-the-fly data compression may reduce requirements for memory bandwidth and capacity in certain cases. PIM technology may benefit from pushing COMA deep into the memory hierarchy for global shared memory.
6. Mixing multiple technology types may provide a superior operating point in performance and cost than strict semiconductor system design. An architecture employing superconducting processor logic, optical communications, holographic storage, and CMOS DRAM may achieve a petaflops in an earlier time frame at substantially lower power consumption and cost. The high clock rate of such a system may require intra-processor multi-threaded task scheduling to simplify logic complexity by eliminating pipeline interlocks as well as hiding latency.
7. Special Purpose Devices (SPD), possibly employing reconfigurable logic, match data paths and logic elements directly to an algorithms data flow and operation requirements. SPDs appear to provide a performance to cost advantage of 2 to 3 orders-of-magnitude compared to more general purpose architectures making petaflops computing feasible within 5 years for specific algorithms. Reconfigurable logic may provide some flexibility in algorithm satisfied by the hardware structure. However, standard techniques for devising such organizations has yet to be developed. One possibility for a more general system is a heterogeneous framework of SPDs and reconfigurable logic managed by a general purpose control computer.
8. With the demands for resource usage dependant on algorithm structure, no single set of resource management policies is likely to meet all requirements. Therefore, future architectures out of necessity will expose the low level execution model to the programmer and compiler for direct intervention and control. This requires that even hardware support mechanisms and all levels of the memory hierarchy be represented in the low level system name space.
TECHNOLOGY
1. New advances in superconducting technology, specifically in the area of rapid single flux quantum (RDFQ) gates, is an important development that can provide two orders-of-magnitude in clock rate over competing semiconductor technologies and extremely low power consumption. Better than 100 GHz processor clock rate are achievable in the next two to three years with the total power consumption within the cryo-environment of less than 50 watts for a petaflops.
2. Optical communications technology will be crucial to meet bandwidth requirements of a petaflops computer. Latency for a fully loaded million-way network can be under 40 nanoseconds. Free space optics give reduced packaging and high bandwidth for a large number of fully connected components, and provides fair treatment of message passing using readily available technology.
3. Emerging optical 3-D photorefractive holographic storage technology may be crucial to reducing memory cost, size, and power consumption for a petaflops scale computer. This technology offers order of magnitude improvement in storage density, power, and cost to anticipated CMOS DRAM devices. However, while bandwidth is comparable, access latency will be an order of magnitude longer, requiring new architecture solutions to this level of memory hierarchy.
APPLICATIONS/ALGORITHMS
1. A wide array of problems of scientific and engineering importance requiring petaflops scale computing has been identified. However, the case has not been made that applications of this scale are so compelling as to warrant the economic committment to its development. A more complete picture of the potential impact of this projected technology must be formulated.
2. New scalable algorithms will be required that expose a high degree of parallelism on the order of 100 K concurrent threads to keep parallel resources busy and for latency hiding.
3. New algorithmic techniques will be necessary for latency tolerant applications.
4. Application kernels may appropriate for domain specific architecture solutions, in which case, petaflops capability may be feasible in as little as five years.
5. To facilitate analysis of competing approaches to petaflops architecture, a small set of simple but typical calculations has been identified, the Peta-Kernels, including:
a. array transpose
b. strided large vector storage and access
c. irregular gather and scatter
d. nearest neighbor (local) communication
e. representation and processing of tree structured data
6. Need to broader applications community participation to quantify merits of various architecture design.
7. Need abstract model of machine as target of algorithm design.
SYSTEM SOFTWARE
1. Although there is a basic level of software required to make a petaflops computer usable, there appear to be no new system software challenges that are unique to petaflops level computing. Nonetheless, the degree of scaling of critical system parameters such as parallelism and latency will significantly aggravate the responsibilities of system software.
2. A closer integration of architecture with its support system software will be required for petaflops computing. A good architecture will enable the system software focus on the real problems and will include features that simplify software solutions.
3. System software will be responsible for exposing and managing the many orders-of-magnitude greater parallelism likely to be required for effective operation of a petaflops computer.
4. Latency management including maximizing locality will be crucial to petaflops systems exhibiting close to 6 orders-of-magnitude register:memory access times ratio. The challenge will be increased by the deep and complex bandwidth hierarchy intrinsic to many proposed petaflops architectures. Support will include advanced latency management visualization and performance monitoring tools.
5. Operating system scaling issues will impose serious performance constraints especially related to dynamic task scheduling, virtual memory management (demand paging appears less important), and checkpointing along with other fault tolerance features.
GENERAL
1. While previous assessments for the prospect of petaflops computers using commodity components estimated the date of availability at 2014, using alternative technologies, architectures, and software methods may make such performance, at least to a limited degree, available by 2007 or earlier.
2. Practical considerations that make petaflops computer systems affordable, usable, and reliable are crucial if this emerging technology is to have any substantive value. The key challenges are cost-performance tradeoffs, power consumption, and industry vendor support.
3. The point design process sponsored by the NFS was enormously important to move the exploration of future petaflops computing from the domain of general discussions to specific details exposing the depth of technical issues. More information will be required from these studies and other such studies in the areas of applications, algorithms, and device technology may yield equivalent value.
4. An approach to developing and sustaining a petaflops computer community including joint industrial and academic R\&D and hardware and software manufacturing and support on an economically sound basis is crucial to the realizability of useful petaflops computing. A new business model is required redefining the relationship and roles of all participating entities.
Go To:
lpicha:5/30/96