2 New NSF-Funded Systems at PSC to Scale HPC for Data Science, AI

November 25, 2020

Nov. 25, 2020 — The Oct. 20, 2020, XSEDE ECSS Symposium featured overviews of two new NSF-funded HPC systems at the Pittsburgh Supercomputing Center (PSC). The new resources, called Bridges-2 and Neocortex, will continue the center’s exploration in scaling HPC for data science and AI on behalf of new communities and research paradigms. Both systems are currently preparing  their early user programs. The two systems will be available at no cost for research and education, and at cost-recovery rates for other purposes.

Bridges-2 at PSC will continue Bridges’ mission to ease entry to heterogeneous HPC for new research communities by enabling rapidly evolving research such as scalable HPC-powered AI; data-centric computing both in fields that require massive datasets and many small datasets; and research via popular cloud-based applications, containers and user-focused platforms.

Bridges-2: Scaling Deep Learning and Data Science for Expanding Applications

“One of the motivations for us to build Bridges-2 was rapidly evolving science and engineering,” said Shawn Brown, PSC’s director and PI for that system, in introducing that $20-million, XSEDE-allocated HPC platform, integrated by HPE. “The landscape of high performance computing and computational research has changed drastically over the last decade; we really wanted to build a machine that supported the new ways of doing computational science and not necessarily only traditional computational science,” especially in the areas of artificial intelligence and complex data science.

Bridges-2’s predecessor, Bridges, broke new ground in easing entry to heterogeneous HPC for research communities that never before required computing, let alone supercomputing. Bridges-2 will continue this mission and add expanded capabilities for fields such as scalable HPC-powered AI; data-centric computing both in fields that require massive datasets and many small datasets; and research via popular cloud-based applications, containers and user-focused platforms.

“We’re not just going to be supporting the command line, we want to be able to support all sorts of modes of computation to make this as applicable to [new] communities as possible,” Brown said. “We [want to] remove barriers to people using high performance computing for their research rather than us training them to do it the way that we do things—we want to … enable them to do their research in their own particular idiom.”

Like Bridges, Bridges-2 will offer a heterogeneous system designed to allow complex workflows leveraging different computational nodes with speed and efficiency. This will include:

  • 488 256-GB-RAM regular-memory (RM) nodes and 16 512-GB-RAM large-memory (LM) nodes, featuring two AMD EPYC “Rome” 7742 CPUs each
  • Four 4-TB extreme-memory (EM) nodes with four Intel Xeon Platinum 8260M “Cascade Lake” CPUs
  • 24 GPU nodes with eight NVIDIA Tesla V100-32 GB SXM2 GPUs, two Intel Xeon Gold “Cascade Lake” CPUs and 512 GB RAM
  • A Mellanox ConnectX-6 HDR InfiniBand 200Gb/s interconnect
  • An efficient tiered storage system including a flash array with greater than 100 TB usable storage; a Lustre file system with 21 PB raw storage; and an HPE StoreEver MSL6480 Tape Library with 7.2 PB uncompressed, ~8.6 PB compressed space

“We want Bridges-2 … to work interoperably with all sorts of different [computational resources], including workflows, engines, heterogeneous computing, cloud resources,” Brown said. “We want this thing to be a member of the ecosystem—not just a standalone machine, but really a resource that’s widely available and applicable to a number of different rapidly evolving research paradigms.”

PSC will be integrating Bridges-2 with its extant Bridges-AI system, featuring an NVIDIA DGX-2 enterprise research AI system tightly coupled with 16 NVIDIA Tesla V100 (Volta) GPUs with 32 GB of GPU memory each.

Brown encouraged researchers to take advantage of Bridges-2’s Early User Program, which is now accepting proposals and is scheduled to begin early in 2021. This program will allow users to port, tune and optimize their applications early, and make progress on their research, while providing PSC with feedback on the system and how it can be better tuned to users’ needs. Information on applying as well as program updates can be found at https://psc.edu/bridges-2/eup-apply.

Updates on the system in general can be found at http://www.psc.edu/resources/computing/bridges-2.

Neocortex: Democratizing Access to Game-Changing Compute Power in Deep Learning

The CS-1, a new generation of “wafer-scale” engine, is the largest chip ever built: a 46-square-centimeter processor featuring 1.2 trillion transistors. Its design principle is to accelerate training to shorten this critical and lengthy component of deep learning.

Sergiu Sanielevici, Neocortex’s co-PI and director of user support for scientific applications at PSC, introduced the $5 million, Cerebras Systems/HPE system on behalf of PI Paola Buitrago, director of artificial intelligence & data science at PSC. Neocortex was granted via the NSF’s new category 2 awards, which fund systems intended to explore innovative HPC architectures. Neocortex will feature two Cerebras CS-1 systems and an HPE Superdome Flex HPC server robustly provisioned to drive the CS-1 systems simultaneously at maximum speed and support the complementary requirements of AI and high performance data analytics workflows.

“Neocortex is specifically designed for AI training—to explore how [the CS-1s] can be used, how that can be integrated into research workflows,” Sanielevici said. “We want to get to this ecosystem that [spans] from what Bridges-2 can do … to the things that really require this specialized hardware that our partners at Cerebras provide.”

The CS-1, a new generation of “wafer-scale” engine, is the largest chip ever built: a 46-square-centimeter processor featuring 1.2 trillion transistors. Its design principle is to accelerate training to shorten this critical and lengthy component of deep learning.

“Machine-learning workflows are of course not simple,” Sanielevici said. “Training is not a linear process … it’s a highly iterative  process with lots of parameters. The goal here is to vastly shorten the time required for deep learning training and in the larger ecosystem foster integration of deep learning with scientific workflows—to really see what this revolutionary hardware can do.”

The CS-1 fabric connects cluster-scale compute in a single system to eliminate communication bottlenecks and make model-parallel training easy, he added. Without orchestration or synchronization headaches, the system offers a profound advantage for machine learning training with small batches at high utilization, obviating the need for tricky learning schedules and optimizers.

A major design innovation was to connect the two CS-1 servers via an HPE Superdome Flex system. The combination is expected to provide substantial capability for preprocessing and other complementary aspects of AI workflows, enabling training on very large datasets with exceptional ease and supporting both CS-1s independently and together to explore scaling.

Neocortex accepted early user proposals in August through September 2020; 42 applications are currently being assessed. Proposals represent research areas including AI Theory, Bioinformatics, Neurophysiology, Materials Science, Electrical and Computer Engineering, Medical Imaging, Geophysics, Civil Engineering, IoT, Social Science, Drug Discovery, Fluid Dynamics, Ecology and Chemistry. Information about the system and its progress can be found at https://www.cmu.edu/psc/aibd/neocortex/.

You can find a video and slides for both presentations at https://www.xsede.org/for-users/ecss/ecss-symposium.


Source: XSEDE

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

Illinois Considers $20 Billion Quantum Manhattan Project Says Report

May 7, 2024

There are multiple reports that Illinois governor Jay Robert Pritzker is considering a $20 billion Quantum Manhattan-like project for the Chicago area. According to the reports, photonics quantum computer developer PsiQu Read more…

The NASA Black Hole Plunge

May 7, 2024

We have all thought about it. No one has done it, but now, thanks to HPC, we see what it looks like. Hold on to your feet because NASA has released videos of what it is like to orbit and enter a black hole. And yes, it c Read more…

2024 Winter Classic: Meet the Mentors Round-up

May 6, 2024

To make navigating easier, we have compiled a collection of all the mentor interviews and placed them in this single page round-up. Meet the HPE Mentors The latest installment of the 2024 Winter Classic Studio Update S Read more…

2024 Winter Classic: The Complete Team Round-up

May 6, 2024

To make navigating easier, we have compiled a collection of all the teams and placed them in this single page round-up. Meet Team Lobo This is the other team from University of New Mexico, since there are two, right? T Read more…

How Nvidia Could Use $700M Run.ai Acquisition for AI Consumption

May 6, 2024

Nvidia is touching $2 trillion in market cap purely on the brute force of its GPU sales, and there's room for the company to grow with software. The company hopes to fill a big software gap with an agreement to acquire R Read more…

2024 Winter Classic: Oak Ridge Score Reveal

May 5, 2024

It’s time to reveal the results from the Oak Ridge competition module, well, it’s actually well past time. My day job and travel schedule have put me way behind, but I am dedicated to getting all this great content o Read more…

The NASA Black Hole Plunge

May 7, 2024

We have all thought about it. No one has done it, but now, thanks to HPC, we see what it looks like. Hold on to your feet because NASA has released videos of wh Read more…

How Nvidia Could Use $700M Run.ai Acquisition for AI Consumption

May 6, 2024

Nvidia is touching $2 trillion in market cap purely on the brute force of its GPU sales, and there's room for the company to grow with software. The company hop Read more…

Hyperion To Provide a Peek at Storage, File System Usage with Global Site Survey

May 3, 2024

Curious how the market for distributed file systems, interconnects, and high-end storage is playing out in 2024? Then you might be interested in the market anal Read more…

Qubit Watch: Intel Process, IBM’s Heron, APS March Meeting, PsiQuantum Platform, QED-C on Logistics, FS Comparison

May 1, 2024

Intel has long argued that leveraging its semiconductor manufacturing prowess and use of quantum dot qubits will help Intel emerge as a leader in the race to de Read more…

Stanford HAI AI Index Report: Science and Medicine

April 29, 2024

While AI tools are incredibly useful in a variety of industries, they truly shine when applied to solving problems in scientific and medical discovery. Research Read more…

IBM Delivers Qiskit 1.0 and Best Practices for Transitioning to It

April 29, 2024

After spending much of its December Quantum Summit discussing forthcoming quantum software development kit Qiskit 1.0 — the first full version — IBM quietly Read more…

Shutterstock 1748437547

Edge-to-Cloud: Exploring an HPC Expedition in Self-Driving Learning

April 25, 2024

The journey begins as Kate Keahey's wandering path unfolds, leading to improbable events. Keahey, Senior Scientist at Argonne National Laboratory and the Uni Read more…

Quantum Internet: Tsinghua Researchers’ New Memory Framework could be Game-Changer

April 25, 2024

Researchers from the Center for Quantum Information (CQI), Tsinghua University, Beijing, have reported successful development and testing of a new programmable Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Intel’s Server and PC Chip Development Will Blur After 2025

January 15, 2024

Intel's dealing with much more than chip rivals breathing down its neck; it is simultaneously integrating a bevy of new technologies such as chiplets, artificia Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

China Is All In on a RISC-V Future

January 8, 2024

The state of RISC-V in China was discussed in a recent report released by the Jamestown Foundation, a Washington, D.C.-based think tank. The report, entitled "E Read more…

Leading Solution Providers

Contributors

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

Eyes on the Quantum Prize – D-Wave Says its Time is Now

January 30, 2024

Early quantum computing pioneer D-Wave again asserted – that at least for D-Wave – the commercial quantum era has begun. Speaking at its first in-person Ana Read more…

The GenAI Datacenter Squeeze Is Here

February 1, 2024

The immediate effect of the GenAI GPU Squeeze was to reduce availability, either direct purchase or cloud access, increase cost, and push demand through the roof. A secondary issue has been developing over the last several years. Even though your organization secured several racks... Read more…

Intel Plans Falcon Shores 2 GPU Supercomputing Chip for 2026  

August 8, 2023

Intel is planning to onboard a new version of the Falcon Shores chip in 2026, which is code-named Falcon Shores 2. The new product was announced by CEO Pat Gel Read more…

GenAI Having Major Impact on Data Culture, Survey Says

February 21, 2024

While 2023 was the year of GenAI, the adoption rates for GenAI did not match expectations. Most organizations are continuing to invest in GenAI but are yet to Read more…

The NASA Black Hole Plunge

May 7, 2024

We have all thought about it. No one has done it, but now, thanks to HPC, we see what it looks like. Hold on to your feet because NASA has released videos of wh Read more…

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing po Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire