Guang Gao - Konstantin K. Likharev - Paul C. Messina - Thomas L. Sterling
A short-term research study will address the challenge of achieving near PetaOps scale performance within a decade through innovative systems that exhibit wide applicability and ease of use. The objective of the research is to determine the feasibility and detailed structure of a parallel architecture integrating the combined capabilities of semiconductor, superconductor, and optical technologies. This hybrid architecture approach is motivated by the realization that a mix of technologies may yield superior operational attributes to those possible based solely on semiconductor technology in the same time frame. An interdisciplinary team of collaborators has been assembled to conduct the required point-design study in order to devise a complete system structure and evaluate its performance characteristics using a small set of scientific application kernels.
The hybrid technology approach exploits critical opportunities enabled by key emerging devices. Specifically, computational performance can be dramatically improved through recent advances in Superconducting Rapid Single Flux Quantum logic which will make 100 GHz clock rates feasible in the next two years. Memory capacity may be drastically increased through a new memory hierarchy merging advanced semiconductor high density memory and future (admittedly more speculative) optical 3-D holographic storage capable of storing 0.1 Petabits or more. Interconnection bandwidth will be greatly enhanced by means of optical networks with 100 Gbps channels.
Ease of programming in a NUMA environment and simplicity of processor logic design with clock cycle times of less than 10 picoseconds are addressed by employing a multi-threaded processor architecture with rapid context switching between thread activations in a shared memory framework. This will provide user-transparent hardware latency management that can adapt to a broad range of memory access patterns with low overhead. A physical memory hierarchy (probably five levels) will provide balance between access demand rates and storage capacity of memory. All elements of the structure will be fully exposed in the system address space with program control variable-granularity block moves managed within the memory structure itself. Simple compound atomic operations for barriers will be performed in memory as well.
Applications with a range of characteristics will be employed in the study including the PPM code with uniform static data structure and local communications, an adaptive mesh FEM code with non-uniform and time varying data structure and local communications, a Tree-code for N-body gravitational simulation with non-uniform and time varying data structure and global communications, and a PIC code which mixes local and global communications with regular static and irregular dynamic data structures. Characteristics of these applications at the near petaflops scale will be derived from analysis presented at previous workshops and will be used to estimate computational demands on processor, memory, and communications resources.
While the general concepts, approach, and structure have been identified and analyzed, many details of the architecture have yet to be determined. These are the design parameters that specify the sizes of the principal components and the low level policies that manage the resources in support of parallel application execution. The primary consequence of the study will be to determine realistic and effective values for the design parameters and to evaluate the resulting performance that would be achieved by the resulting system design.
The point design study will be conducted to 1) characterize device properties, 2) develop a balanced structure of the different technology components and hybrid subsystems, 3) estimate demands on distributed resources under the test application kernels to determine critical weaknesses, 4) assess engineering feasibility, and 5) identify important focus areas for future research. The findings of this research will be documented and presented at The Second Petaflops Frontier (TPF-2) workshop and the 2nd Pasadena Workshop on Enabling Technologies for Petaflops Computing.
Recent advances in superconducting devices have motivated a reexamination of the potential of this technology as a basis for very high performance computing. In previous decades, logic devised from Josephson Junctions was implemented much as transistor based circuits were in CMOS. While this did yield high performance in the range of a few GHz clock rate, it was insufficient to overcome the other inconveniences and narrow range of usage. In the last few years, a new superconducting technology has been developed that employs innovative mechanisms along with Josephson Junctions to achieve binary storage and logical operations. Rapid Single Flux Quantum logic (RSFQ) uses a kind of magnetic flux to store and manipulate information in an almost lossless manner and at speeds that exceed the original superconducting methods by more than an order of magnitude. As a consequence, the high performance computing community is offered the opportunity of exploiting a processor technology that will be capable of operating at 100 GHz clock rate in a couple of years in comparison to 1 GHz using CMOS ten years from now.
Ironically, the success of this emerging technology is also the source of many of its challenges, and not primarily due to the cooling requirements. Even using generally inefficient contemporary cryogenic equipment, cooling the superconducting processors for a Petaflops scale computer would require only about 10 Kwatts total power. The real problem is architectural in nature: what structure will permit effective use of processors with clock cycle times less than 10 picoseconds, where a round trip signal can only cover a small fraction of the chip size in a single cycle. The opportunity and challenge of this technology must be considered now in sufficient detail to determine the feasibility of realistically achieving near-petaflops performance in five to seven years and identifying the class of research that must be undertaken now to ensure that the necessary devices are available at that time. With the apparent aggressiveness of other nations in pursuit of this technology and the limited active research currently sponsored in this country, US leadership in future high performance computing is at risk and potentially in jeopardy. This project addresses this important issue and requests sponsorship to undertake a point-design study of the architecture related to implementing a hybrid technology computer integrating superconducting processor logic, advanced semiconductor memory, and optical interconnect. A brief examination of 3-D holographic photo-refractive technology for data storage will be carried out, as well, but at less depth, to examine the possibility of significantly reducing the total amount of DRAM required.
The challenges to architecture resulting from high clock rate and anticipated modest device density (approximately 0.5 micron) require that the overlapping fine-grain parallel actions be decoupled from one another within the processor, between processors, and throughout the memory hierarchy. The approach is based on multi-threaded principals that enable independent instruction streams to be interleaved at each level with minimal overhead to eliminate performance degradation due to data hazards and control hazards. Although not yet widely employed in conventional microprocessor architectures, multithreading techniques are well understood and the basis of at least one large commercial computer project. Multi-threaded organization provides runtime adaptive control of processor operations and memory accesses in response to variability of service times from different elements of the distributed computing system. Multi-threading provides a general latency management mechanism, permits a global memory address space, and enables a simple parallel programming model for end-users.
The primary objective of the point-design study is to evaluate the potential of superconductor technology in combination with semiconductor memory and optical networks for achieving near petaflops scale computation within the next decade. Evaluation of a hybrid technology approach for very high performance computing will quantify the relative benefits of mixed technology versus single technology structures. Information resulting from such an analysis will determine if throughput and response time requirements within the system can be achieved with different and appropriate technologies now or in the near future.
A second objective is to devise an architecture capable of harnessing the potential of superconducting logic based on multi-threaded architecture techniques. This class of architecture has been identified as particularly well suited to the operational characteristics of a distributed shared memory system running at extremely high clock rates with moderate density of component integration. An aspect of this work will be to examine the multi-threaded model in terms of the implied parallel programming paradigm and its suitability for representing the small set of test applications.
The architecture combines semiconductor, optical, and superconductor technologies in a single-system structure to achieve superior performance to that which could be attained in the same time frame and comparable cost using semiconductor devices alone. The Hybrid Technology Multi-Threaded Architecture (HTMT) is a shared memory NUMA architecture employing superconducting processors and data buffers (liquid helium cooled), cyro-SRAM semiconductor buffers (liquid nitrogen cooled), semiconductor DRAM main memory, and (possibly) optical holographic storage. The system will be integrated by means of optical interconnects including a sophisticated packet switched network to the highly interleaved main memory.
The architecture to be studied is anticipated to be feasible within the next ten years. Estimates of semiconductor characteristics for the year 2007 are derived from the Semiconductor Industry Association's National Roadmap. Estimates of operational characteristics for the optical and superconductor devices are more conservative and are directly derived from specific research projects in the related areas demonstrating what can be achieved with confidence within the next five years. This limitation on extrapolating performance figures for the high risk technology components is intended to give the findings of this study high credibility and relevance to realistic near-term directions for high performance computing. The investigators consider this warranted because even the conservative performance estimates are dramatic and sufficient to meet the NSF stated objectives of 100 TeraOps performance. Only the semiconductor memory density and the optical holographic storage require longer development times. In the first case, higher cost will provide earlier memory capacity and the optical storage, even if not possible, is not a critical pacing item.
A major challenge is to match the throughput at the interfaces at the various levels within the system. The processors are anticipated to operate at between 100 and 200 GHz with one to four instruction issue per cycle per processor. The system is likely to be sized at a thousand processors which could yield a sustained performance in excess of a 100 TeraOps. The total memory access demand rate would be on the order of requests per second or more for the class of applications to be studied. Even assuming high hit rates to local high-speed data buffers, the main memory may have to service a demand rate of accesses per second which will probably require 10,000 interleaved memory banks and a packet network that can accept and distribute that traffic density. Even at the top of the memory hierarchy, a superconducting level-0 buffer may be 10 cycles or more away from the register and pipeline just due to the physical displacement on the chip and the 10 picosecond or less cycle time. This study will develop a structure that satisfies this set of challenges and will provide scaling properties, at least statistically, with respect to key structure parameters.
The multi-threaded execution model is adapted as the principal scheduling and latency management mechanism. Even so, the high ratio of register access time (one cycle) to main memory access time (10,000 cycles or more) requires that steps be taken to limit latency wherever possible. Memory-to-memory block moves managed within the memory itself, and explicitly addressed memory hierarchy with small fast memory buffers at the top are added to improve throughput and enhance effectiveness. There are open questions remaining about the operation and management of the system resources as well as the actual sizing of key components. These will be studied in detail during this work and a complete system description (one that could be built and operated in the next few years) will be provided.
The remaining elements of this section highlight the key technologies and discuss the important concepts related to multi-threaded resource management. Most of these contributions are derived directly from research being conducted by the investigators.
Several features of superconductor integrated circuits make them uniquely suitable for processing digital information at extremely high speed and with very low power consumption. These features include:
These factors were responsible for the large volume of work in this field during the past three decades, notably the IBM project (1969-83) and Japanese MITI project (1981-90). These projects have not brought us a practical superconductor digital technology, due an unfortunate choice of circuitry, which concentrated on various ``latching'' logics. The designers of these circuits tried to mimic semiconductor transistor logics, coding digital bits by presence/absence of DC voltage across Josephson junctions. Due to several dynamic problems, no prospects have been found to increase the clock frequencies of the latching-logic circuits beyond a few Ghz. This speed is higher than, but comparable to that of the fastest semiconductor digital circuits. This marginal advantage is probably insufficient to justify the helium cooling necessary for operation of superconductor integrated circuits for the practical success of superconductor digital technology.
Recently, however, the field of superconductor digital electronics was revived by the advent of alternative ``Rapid Single-Flux-Quantum'' (RSFQ) logics. These circuits use the unique, fundamental property of superconductors to quantize the magnetic flux through any closed loop:
Circuits of the RSFQ family consist of ``elementary cells'' using overdamped Josephson junctions and superconducting loops. Physically, the cell stores and processes digital bits in the form of the number N of trapped single flux quanta. Functionally, each elementary cell may be considered as a natural combination of a logic gate with an output latch. The cells can pick up and release the data in the form of return-to-zero signals, namely, picosecond voltage pulses V(t) with quantized ``area'':
The pulse duration is determined by the ratio of capacitance C of the Josephson junction to its critical current and, ultimately, by the superconductor energy gap :
For present-day low- Josephson junction technologies, is of the order of a few picoseconds (Table 1).
The SFQ pulses may be passed between the cells ballistically, with a velocity approaching the speed of light, using passive superconducting microstrip lines. This property, combined with the hand-shaking-type distribution of the clock pulses, allows the ultrahigh speed of operation to be retained even in complex integrated circuits (see the bottom line in Table 1). Recent experiments using a 1.5- m niobium-trilayer fabrication technology have demonstrated the operation of a simple RSFQ circuit (a frequency divider) at frequencies up to 370 Ghz, in accordance with theoretically predicted scaling.
Besides the ultra-high speed, RSFQ technology offers other advantages which were missing in the old superconductor logics. These advantages include DC power supply, very small power consumption (ultimately J/bit), and natural self-timing capability which enables one to combine synchronous and asynchronous timing. The major drawback of this new digital technology is the necessity of chip refrigeration to 4-5 K, an obstacle for many potential users. Though simple RSFQ circuits using high- superconductors have been demonstrated at temperatures up to 65 K, the technology of fabrication of Josephson junctions using these materials should mature substantially before more complex high- circuits become feasible. Nevertheless, recently there was dramatic progress in closed-cycle refrigerator technology. There is every hope that reliable and compact refrigerators, if produced in any substantial volume, will cost no more than $1,000 a piece.
During the past 2 to 3 years, there has been rapid growth in the integration scale of experimentally demonstrated RSFQ circuits, presently up to a few thousand Josephson junctions. Laboratory prototypes of the first practical single-chip RSFQ systems, including digital SQUIDs, A/D converters, and digital signal correlators should be demonstrated later this year. Design of more complex circuits and systems, including dense random access memories, digital switches, and general-purpose microprocessors, has been started. Presently there is little doubt that the RSFQ technology will be able to win at least some niches of the digital device market, mostly due to its unparalleled speed.
In contrast with digital signal processing, analog-to-digital conversion, and digital signal switching, a universal von-Neumann-type computer is probably the worst case for gaining speed performance from the RSFQ (or any other superfast) technology. The reason is that such a system should rely on frequent data exchange between the processor and memory with the exchange rate being limited by at least the time of signal propagation with the speed of light ( 100 ps/1-cm distance in the usual dielectric environment). Even if we accept that the average distance between the microprocessor and memory is 5 cm, this fact alone imposes the lower bound of 1 Gbps on the processor-memory exchange rate using conventional architectural techniques. Thus, alternative resource management methods are required.
For very large-scale (say, ``PetaFLOPS'') computing, RSFQ circuits may have another decisive advantage over the semiconductor competition.An estimate of the power consumption of a petaflops computer implemented with today's semiconductor technology yields a value of approximately a billion watts. Repeating a similar crude estimate for RSFQ circuits, we may use one of two figures for energy consumption: either 0.5 mW/gate for the static power of the present-day resistively-biased circuits (scaled down to 1.0- m technology), or J/bit for prospective complementary RSFQ circuits with negligible static dissipation. For the petaFLOPS system as a whole, in the first case we arrive at 5 W, while the second number gives five times less power. Both numbers refer to the power dissipation at liquid helium temperature; in order to get the total power consumption, they should be multiplied by a factor 300, reflecting the efficiency of present-day helium closed-cycle refrigerators. Thus we arrive at power of the order of 1 kW (which may be safely pulled out of a standard wall outlet).
This simplistic estimate may be criticized from several points of view, but we believe it gives the right idea about the big advantage of RSFQ technology in power consumption. As a by-product, an RSFQ petaFLOPS computer could be much simpler for package design: with the conservative number of 10 K gates per chip, the chip count might be as low as 1,000 (cf. 1,000,000 for semiconductors). However, in order to implement this advantage, special architectural solutions are necessary. First of all, they should circumvent the gap between the unparalleled intrinsic speed of RSFQ circuits (above 100 Ghz for 1.0- m technology) and much lower external bandwidths (hardly higher than 30 Gbps per channel for interchip traffic, even if special multichip modules are used, and 1 Gbps per channel for the processor-memory exchange, see above). Secondly, the architectures should take full advantage of the unusual informational structure of the elementary RSFQ cells which in effect present finite-state machines and thus are uniquely suitable for flexible combinations of synchronous and asynchronous computing. We are not aware of any serious work done in this important direction.
The interaction between the fast memory buffers and the main memory will be provided by an optical interconnection network. One important example is a network being developed by a team under the leadership of Dr. Coke Reed of IDA-CCS in cooperation with Princeton University and Interactics Inc. which will provide the basis for this study. The data path of the Multiple Level Minimal Logic (MLML) network is purely optical and the switching elements are electo-optical. The number of ports is where k and n are positive integers. The network complexity is nodes. The MLML network employs multiple Omega Networks in a 3-D structure. This network exhibits favorable routing properties under conditions of moderate to heavy loads and provides control flow for new packet insertions at the entry points using a distributed protocol. This guarantees a deadlock free operation. Its excellent scaling characteristics enable topologies involving large numbers of ports. Analysis of this network has shown that a million port configuration will produce communication paths with an end-to-end latency of approximately 40 nanoseconds.
Conventional strategies for supporting high-performance parallel computing often face the problem of large software overheads for process switching, synchronization and interprocessor communication, and other operations for coordinating asynchronous actions. These large overheads become an obstacle for applications to be effectively mapped to high-performance architectures where the compute power of hundreds or thousand of processors are essential.
Modern multithreaded architecture models have emerged as a promising approach to high-performance computing that supports: the potential of a simultaneous exploitation at all levels (e.g. fine, medium, and coarse-grain parallelism), and an efficient and smooth integration of interprocessor communication/synchro-nization with computation. This approach derives its power from supporting multiple threads of control within one node of the a multiprocessor machine, the advantages from computation models for fine-grain parallelism (including but is not restricted to models as radical as the pure dataflow models) as well as the traditional sequential control flow model (a la Von Neumann model) and architectures with over 50 years of optimization experience.
With a Petaflops computer architecture based on the hybrid technology such as the one addressed in this study, the latency problem between the fast processor and the rest of the system becomes even more challenging. As mentioned earlier, the gap between the register access time cycle and the main memory access times can be as large as 10,000 times.
A multithreaded program execution model in which code is divided into threads at different levels, with our focus initially at the fine-grain level will be studied.
Threads are not ordered sequentially, but are scheduled according to data and control dependences. In general, a thread is enabled (ready for execution) when all data required by the thread is available. Threads at fine-grain level are synchronizing and communicating with each other using a set of thread primitives (for instance: see references on the EARTH model). Primitive thread operations comprise a rich set of primitives, including remote loads and stores, synchronization operations, block data transfers, etc. These thread operations are initiated by the threads themselves. One type of thread is called ``atomic'' in the sense that once a thread is started, it runs to completion, and instructions within it are executed in sequential order. Control transfer instructions such as branches or function calls are allowed within threads. That is, our processor architecture can use a conventional instruction sequencing mechanism to execute a thread efficiently. In this research, we are not limited to atomic threads, and plan to investigate other types of threads, e.g. threads which can be suspended and resumed, etc..
Threads can be supported at function/procedure invocation level. Under a multithreaded model, threads from different invocations of the same function/procedure may be active at the same time. Therefore, procedure frames cannot be placed on a stack as in conventional processors. Similar to other multithreaded models (including EARTH), the frames in our model are dynamically allocated from a heap. Each frame represents the instantiation of one function, and contains the local state of that instantiation, including local variables and sync slots. A function invocation may itself contain more than one thread. For an in-depth discussion of possible implementation, the readers are referred to the papers on the EARTH project.
There are several design issues and options which should be examined when evaluating the processor design to support a multithreaded program execution model. Below, we list a a few of them.
One important issue is: should the processor design support interleaved threads? Interleaved thread execution requires a fundamental change from the conventional processor design and is intended to exploit greater amounts of fine-grain parallelism with less complexity than other processing element designs. Processing elements that support the interleaved execution of threads use instruction execution pipelines designed to achieve high resource utilization by sharing logic resources (floating point functional units, for example) among several threads. Each pipeline stage may perform work for a different thread on each clock tick. In this way, latency of operations in one thread need not lead to low utilization of pipeline stages. On the other hand, if no interleaved thread execution is to be supported, then the processor design can be quite similar to conventional processors for the most part.
Another design issue is in the processor design of fast context switching. In multithreaded processors without interleaved instruction execution, the instruction execution mechanism is conventional and the novelty lies in the mechanism used to ``switch context''. The objective is to reduce the time a process must run to amortize the cost of switching control to a new process, thereby permitting a finer grain of process scheduling without incurring overwhelming overhead costs. The importance of the cost of context switching has several components. One is the saving and restoring of the register context. Another cost is more implicit: the temporal or special locality may be disturbed.
A third design issue involves schemes used to implement synchronization including dataflow sequencing, the fork and join primitives, tags on memory words, and Futures. Each of these methods should be viewed in terms of how well it serves to implement the program execution model used to guide the system design.
A fourth design issue is the support of thread scheduling. One simple way for the scheduling of threads in hardware, is the first-come, first-served management of the instruction queue. A design option is to introduce additional support with priority scheduling.
Finally, when processing elements are combined into a parallel computers new issues arise concerning interprocessor communication and the treatment of distributed memory. The former concerns the smooth integration of external asynchronous events into the computation (e.g. interrupt, tolling, or combined--see our reference list). The later concerns the memory models discussed in the next section.
Options for addressing these design issues on the present generation multithreaded processor architecture is well documented elsewhere (see our reference list). The challenge here is to evaluate these solutions in our point-design study of the hybrid technology multithreaded architecture, and exploit new ones when necessary.
The threaded model supports a globally shared memory on a NUMA architecture. The memory consistency model that has been most commonly used in past work is sequential consistency (SC), which requires the execution of a parallel program to appear as some interleaving of the memory operations on a sequential machine. To reduce the rigid constraints of the SC model, several relaxed consistency models have been proposed, notably weak consistency, release consistency (RC), etc. These models allow performance optimizations to be correctly applied, while guaranteeing that sequential consistency is retained for a specified class of programs. These models are referred to as SC-derived models. For the HTMT architecture, these SC-derived model may be still too restrictive and may generate excess cache-coherence traffic (if employed) in the machine. Therefore, in addition to these models, other more relaxed memory models will be examined. One alternative with which the investigators have experience is the location consistency (LC) model for memory consistency (see the reference of Gao and Sarkar's paper). A distinguishing feature of the LC model is that it does not rely on the assumption of memory coherence, and the memory model only ensures that the order implied by the program semantics will be enforced. The LC model provides a very simple ``contract'' between software and hardware; only the partial order defined by the program needs to be obeyed, and there is no requirement for all processors to observe the same ordering of concurrent write operations.
Dynamic adaptivity for locality management and load balancing is vital to applications where resource demand and parallelism evolution is dynamic in nature and multithreaded architecture may make this feasible in practice.
Dynamic locality management is an interesting challenge and opportunity where a multithreaded program execution model may provide additional benefit. The basic idea is that non-local memory access is detected at runtime, upon which a decision will be made as to whether a data or thread migration is more profitable. Experience with EARTH, shows that the compiler may be able to judiciously insert such runtime checks. The runtime system and architecture support will work to make the check and decision efficient. In our study, we will evaluate the advantages and tradeoffs for such support.
Automatic load balancing is another essential support for applications in which a good task distribution cannot be determined statically at compile time. For such applications, our model provides an instruction in which the programmer can simply encapsulate a function invocation as a unit of scheduling--called a token--visible to the runtime load balancer. If a processor does not have any ready threads and its token queue is empty or below certain water mark, a token request message is sent to its neighbor. An approach currently being pursued is load stealing based load balancing algorithms based on recent experiences, but in this project, other competitive algorithms will be studied to determine an optimal approach for the HTMT environment.
In the context of petaflops computing architectures, communication management emerges as a key issue. Some parallel programming models proposed today require or encourage programmers to explicitly perform such communication management. This includes both locality management: minimizing communication by providing detailed data decomposition, allocation, distribution, etc. and the latency management: such as prefetching in the face of long latency operations. This is not an easy task. The position of the investigators is that the proposed multithreaded program execution and architecture model should make the programming task much easier. Accordingly, a programming model and paradigm will be evaluated which matches well with the underlying multithreaded architecture.
In particular, a programming model will be studied similar (but not restricted to) a model recently merged with the EARTH multithreaded architecture project. The investigators have substantial experience with this model running more than 20 benchmark programs (see the reference list). The machine can be programmed at two level. At a lower level, the language (e.g. EARTH threaded-C) provides primitives for concurrent threads--e.g. fork-join, explicit synchronization between threads, block moves, threaded function invocation, thread invocation subjected to runtime dynamic load balance, etc. At the higher level, the language (e.g. EARTH-C) provides a programming model where the programmers can use more familiar programming constructs to specify parallelism such as doall loop, par-begin/end, and shared memory primitives. It is hoped users that users will be provided convenient constructs to express parallelism at coarser grain level, but do not require explicit data partitioning in detail. Also, provide will be provided to guide the users to write analyzable programs with locality annotation (for advanced users) such that a reasonable parallelizing compiler can be expected to generate efficient threaded code for the machine. The programming environment will be evaluated with the proposed applications.