Carnegie Mellon University

Google Cloud ORCHARD Cluster

The Google Cloud Cluster, branded ORCHARD, is an innovative, cloud-based computing resource created in collaboration with Google specifically for research projects. It offers high-performance computing tools, direct access to Google engineers, and robust AI and advanced computing capabilities to help us tackle today's challenges. This resource significantly boosts our cloud-based GPU capabilities.

Capabilities

  • Architecture
    There are 37 compute nodes in total. Each node is a reserved instance VM on the Google Cloud Platform, specifically an a3-megagpu-8g model. These nodes have eight 80 GB Nvidia GPUs, 208 vCPUs, 1.872 TB of memory, a 6 TB local SSD, and nine network interfaces.
  • Storage
    The cluster is equipped with 48 TB of storage for project and home directories, utilizing Google’s Filestore service, which offers a convenient NFS interface. Additional storage may be purchased.  

Eligibility

Faculty and Staff

Request

Ready to use these expanded GPU capabilities? Email us:

Cost

Faculty will not be directly charged for access to the compute cluster during a two-year pilot period. 


Frequently Asked Questions

Find answers to common questions. If you need more help, reach out to our support team.

CMU has acquired a compute cluster with 296 Nvidia H100 GPUs and 37 compute nodes. Each compute node is a Google Cloud Platform reserved instance Virtual Machine (VM), an a3-megagpu-8g, a 6 TB local SSD, and nine network interfaces.

With input from faculty, Carnegie Mellon University (CMU) will offer two primary partitions and governance models to meet the complementary requirements of our research enterprise.

The first partition will primarily support advanced research on AI foundation and language models for which single training runs can take thousands of H100 GPU hours and require scaling to several hundred GPUs. This partition will simultaneously provide research projects with access to up to 256 H100 GPUs to meet these intensive compute demands. CMU’s Foundation and Language Model (FLAME) Center will govern access to and usage of the 256 H100 GPU partition, called the FLAME Partition. The FLAME partition will be available for research requiring the same high degree of scale through a proposal process.

The second partition will provide access to 40 H100 GPUs for other research areas that require high-performance computing resources beyond those available in individual colleges, schools, or labs at CMU. This will be called the Community Partition, and a Community Partition Committee (CPC) comprised of college representatives will govern access. To best use all 296 H100 GPUs in the cluster, jobs in one portion can use idle nodes in the other. Jobs taking advantage of this will be preempted algorithmically if the other portion needs the nodes.

If you are researching AI foundation and language models, we recommend joining FLAME. The 256 GPU partition is set up for foundation and language models, and FLAME manages that portion of the cluster on behalf of CMU and provides the governance for it.

 

CMU has allocated 40 H100 GPUs for general use on faculty projects. While the larger segment of 256 GPUs is primarily intended for GPU-intensive foundation and language models research and development, faculty who have a project that needs more than 40 GPUs can submit a proposal that provides details about the project, the number of GPU hours needed, and the likely impact of the research to the CPC for review. The CPC will review and recommend the project to the Provost and Vice President for Research. If approved, the CPC will work with FLAME to ensure that scheduling and resources are available.

Faculty will not be directly charged for access to the compute cluster during a two-year pilot period. FLAME and the CPC will allocate those resources. The schools and CMU central offices will carry the costs of operating the cluster and computing. The pilot project will help the university assess demand and operating costs and determine the best way to support long-term, large-scale computing.

Researchers are responsible for the storage and API call expenses required to use the cluster, which are billed monthly. 50 TB of storage will run about $1,000 each month. Please see Public Cloud Computing for more information. 

Request a consultation to understand the costs for your project.

The term ORCHARD refers to the Office of Research Computing, highlighting the cluster's capabilities in high-performance computing, analytics, research, and data management.