As a veteran of the technology industry, I have experienced the ebbs and flows of the “next big thing”. E-Commerce, blockchain, cloud computing, IoT, edge computing, quantum computing, big data, etc. The current buzz or “next big thing” is Artificial Intelligence (AI). I recently had an opportunity to deploy an AI architecture. I thought I would share some thoughts from my adventure.
My focus, as a solution architect, is on the infrastructure. As use cases are presented, my focus is to design an architecture that will meet the technical requirements. In this case, the requirement was to architect a solution to support a Large Language Model (LLM). This solution would also leverage Retrieval-Augmented Generation (RAG) for future training of the model. The architecture would be used for benchmarking so there was an emphasis on optimal performance.
The number of GPU and instance shapes were provided. In this case, H100’s from NVIDIA. Next, the question became what we would use to schedule the jobs on the GPU’s. With more than one H100 we needed a cluster management solution for the design and our options were narrowed down to the following:
- SLURM – Originally short for Simple Linux Utility for Resource Management and released in 2002, has been around for a long time. Has high performance compute (HPC) marketplace images for most cloud service providers and has grown in sophistication as a machine learning scheduler.
- RKE2 with NVIDIA run.ai – Rancher Kubernetes Government is the next iteration of RKE. It changes the container runtime from Docker to containerd. It also enhances the security for government workloads. NVIDIA Run:ai is a cloud native AI orchestration platform. It simplifies and accelerates AI and ML through dynamic resource allocation. It provides comprehensive AI lifecycle support and strategic resource management.
- OKE with run.ai – Oracle Cloud Infrastructure Kubernetes Engine (OKE) is a fully-managed, scalable, and highly available service. You can use it to deploy your containerized applications to the cloud. It is integrated into the OCI platform. Deployment and control are available from OCI console, CLI, or REST API. It is also integrated with OCI Identity and Access Management. Options for both virtual and self-managed nodes, for the purposes of AI and ML, self-managed is the only viable option.
The scheduler drove the decision between SLURM and Kubernetes. Run:ai is a Kubernetes based solution. This meant SLURM was out of the running. We needed to decide between RKE2 and OKE. Delving into certifications, we noted that the NVIDIA GPU operators are not certified for Oracle Linux. That lead us to initially believe that OKE was out of the running for the cluster management. After further digging, we were able to determine that OKE now supports Ubuntu worker nodes. We followed the steps from the OKE Ubuntu Node Packages General Availability blog. This allowed us to put OKE back into consideration. We evaluated three main categories:
- Cost – RKE2 was going to require an additional management cluster to be deployed. That would have added four additional virtual machines to the design. When comparing to OKE, the cost to run RKE2 was higher because of the additional nodes which also leads to higher operational costs (more nodes to manage)
- Ease of Deployment – OKE is easily deployed through the console, via CLI or through terraform stacks. RKE2, can be automated using an OCI marketplace terraform stack. I have to note that in both cases, the control plane for RKE2 (marketplace) and OKE are defaulted to Oracle Linux. OKE does come integrated with the OCI Block Volume (BV) and File Storage Service (FSS) storage classes. Slight advantage goes to OKE because of the storage integration.
- Performance – One compelling performance gain for OKE was the ability to use the OCI VCN-Native CNI. This removed the need for an overlay network. Not only does it simplify the network setup, removing the overlay eliminates overhead on the networking and is thus a performance gain. I will cover the use of the OCI VCN-Native CNI in more detail in a follow-up blog entry.
Based on our criteria above, we chose OKE with run:ai. Choosing our AI scheduler and cluster management was the first step. Our next decision point focused on the required shared storage options. I will cover that in the next blog entry on this AI Adventure.
