Using semiconductor electronics alone is not the best path to high-performance computing.
So say leading researchers who are designing a supercomputer 1,000 times faster than any available today. Their Hybrid Technology MultiThreaded (HTMT) architecture is a mix of emerging technologies they believe can shave 10 years off the race for petaflops (one million billion floating-point operations per second) to get there by 2007.
NASA, the Defense Advanced Research Projects Agency and the National Security Agency are funding the exploration to support mission-critical areas ranging from simulating Earth's climate system to breaking communication codes of rogue nations.
While seemingly radical, "hybrid technology, the partnership between architecture and technologies, is nothing new," said HTMT principal investigator Thomas Sterling, senior research scientist at NASA's Jet Propulsion Laboratory (JPL). In the 1940s, "the first computers were made of vacuum tubes for processing, pulse transformers for communication and mercury delay lines for storage."
The HTMT architecture calls for helium-cooled superconducting processors, memory chips with onboard processing capabilities, an optical communications network and holographic storage.
For the two year period 1997 to 1999 "we are funded for a combination of detailed design and simulation studies and experiments for characterizing the capabilities of key devices," Sterling said. The research, which is managed by JPL's Larry Bergman, spans nearly 20 laboratories, universities and companies.
Losing resistance in extreme cold
HTMT's processors are superconductor chips manufactured primarily of the metal niobium. "The unique property of niobium is that when cooled to the temperature of liquid helium (minus 452 degrees Fahrenheit), it becomes superconductive, which means it offers no resistance to electric current," said Dmitry Zinoviev, research scientist in physics, State University of New York at Stony Brook.
Within a chip, twin loops of niobium are connected to each side of a layer of aluminum oxide fashioned into a Josephson Junction. "The electrical current can potentially flow forever" through the loops, Zinoviev said. "The Josephson Junction is similar to a turnstile door; when you push on it, it opens and lets you go from one loop to another."
Stony Brook's Konstantin Likharev co-invented a process in which the magnetic properties of the currents flowing through each loop are translated into the ones and zeros a computer needs to operate. "We keep a bit of data as a current circulating in a loop," Zinoviev explained.
These rapid single-flux quantum (RSFQ) circuits are 100 times faster than semiconductors and use far less power because they offer no electrical resistance. The HTMT 2007 target is RSFQ processors crunching at least 100 billion cycles per second (gigahertz) and capable of 400 gigaflops apiece. Total power requirements are similar to current supercomputers 500 kilowatts, including helium refrigeration.
A 100-gigahertz speed also presents a challenge. At that clock rate, an electrical signal travels one millimeter in 10 trillionths of a second. "If your chip is two centimeters by two centimeters, the distance is 20 times more than the distance between two processor stages that can be synchronized with your clock," said Mikhail Dorojevets, assistant professor of electrical and computer engineering at Stony Brook.
The brainchild of Tera Computer Company's Burton Smith, the MultiThreaded technique manages the delay, or latency. "A thread is a set of tasks, a sequence of actions," Sterling said. "We want to keep our resources constantly working on tasks."
Most supercomputer processors keep busy with one thread per clock cycle. "We must have several instruction streams and have all of them running simultaneously," stressed Dorojevets, who developed parallel multithreading for the Soviet Union's MARS-M supercomputer. "We will have 128 instruction streams for each RSFQ processor."
Intertwining memory with processing
With no high-density superconductor memory on the horizon, "you must use normal semiconductor memory," Dorojevets said. "The problem is, the difference in speed between semiconductor memory and an RSFQ processor is very big" a factor of 10,000.
A bridge across this huge latency is being implemented in system software by University of Delaware professor Guang Gao. The HTMT solution includes two kinds of processor-in-memory (PIM) chips, which can perform operations not normally associated with memory. This second multithreading layer guarantees that "you only have to reach nearby memory that is fast," Sterling said. "The memory system can make sure that information the processor needs is very close to the processor."
Abutting the RSFQ processors, then, is high-speed static random access memory (SRAM). "It takes sets of data and makes them available to the superconducting processors," said the University of Notre Dame's Peter Kogge, McCourtney professor of computer science and engineering. "As soon as the processors are finished with one set, the next is ready to go."
The main workhorse is the slower but more dense dynamic random access memory (DRAM). Its primary function is supporting the SRAM. "It acts as a funnel between the full data set and the pieces entering the superconductors," Kogge said. The DRAM PIMs also execute simple mathematical operations on the entire data set and manage input/output to the HTMT system.
Kogge said that a petaflops computer requires 32 trillion bytes of DRAM. Based on his projections, this translates into a need for 120,000 chips. Determining SRAM requirements is more difficult, as they depend on how much data the RSFQ processors will need at any one time.
Optical communication and storage
Moving HTMT data quickly enough among the PIM chips only a billionth of a second delay is tolerable is delegated to an optical communications network. "You cannot do it electronically," said Keren Bergman, assistant electrical engineering professor at Princeton University. "You would have millions, if not billions, of wires. And the heat would be impossible."
NASA's Jet Propulsion Laboratory manages nearly 20 institutions that conduct detailed design and simulation studies and experiments for characterizing the capabilities of key devices.
One of the more successful electronic networks is a street-like grid of router nodes folded into a doughnut shape. Because streets have intersections, they need street lights, or buffers in network parlance. "When you have a red light, cars start to pile up," Bergman said.
Designed by Coke Reed of the Institute for Defense Analyses, the optical Data Vortex network is a one-way whirlpool that dispenses with buffers. "That is completely unheard of!" Bergman said. "Our network acts like a multilane highway with no traffic lights."
At each network input node, a small laser generates ultrashort optical pulses that can travel at the speed of light. The electronic data are encoded onto the pulses and arranged as packets, which enter and leave the node through one of two ports on either side.
The high-speed optical network keeps packet routing as simple as possible. "The data has header information that tells the node to switch to one of the two output ports," said Princeton researcher Mark Arend.
Since all-optical switching with low latency is very difficult, Princeton uses electronic switches instead. A detector in each node "is how the optical and electronic components talk to each other," explained Qimin Yang, electrical engineering graduate student. "The data payload stays in the optical domain, so you can have a very high-speed network." Arend said there is nothing to prevent each optical beam from transmitting one trillion bits of data per second.
Optics also plays a role in HTMT's storage system. By using two laser beams and changing the angle of one beam thousands of times, it is possible to record data-encoded light into holograms. "It is a way to store information in three dimensions," said Demetri Psaltis, Thomas G. Myers Professor of electrical engineering, California Institute of Technology. "Other methods record information only on the surface of the media."
Three-dimensional storage means data access occurs in parallel, 100 times faster than conventional means. Such speed facilitates passing partial results to and from the DRAM units, Psaltis said, giving holographic storage a hybrid memory function. Full petabyte capability is targeted for 2007.
The HTMT project's current phase ends in June 1999. If funded, "our next step will be to design and implement a prototype testbed and integrate the technologies," Sterling said. "We expect the first prototype in 2001, a 400-gigaflops, four-processor system."
NASA, the Defense Advanced Research Projects Agency and the National Security Agency are funding the HTMT exploration to support mission-critical areas.
Building a petaflops system will require industrial partners. Most of the technology components are already being manufactured to some extent. However, Dorojevets is adamant that scaling to petaflops without government support is hopeless.
"No one else is going to invest money in this technology," he said. "We have known about superconducting circuits for 30 years, and we're still working on the first stage of the RSFQ processor design." Arnold Silver, TRW Inc.'s superconductor electronics manager, said it would cost approximately $30 million to make commercial manufacturing of petaflops-scale chips a reality. The other components would need similar amounts of funding.
Sterling estimates the total cost of a petaflops HTMT computer at $200 million in 2007, about double the cost of a 1998 semiconductor system that is only 1/1,000 as fast.