Category: Artificial Intelligence
-
An AI journey continues – configure the scheduler
Now that the infrastructure has been deployed (software defined network, OKE, H100, storage, etc) it was now time to configure the scheduler (run:ai). The first question posed, post installation, “do we need any special configuration for the network operator in order for the scheduler pods to leverage RDMA?” Would we need Single Root I/O Virtualization…
-
An AI journey continues – GPU Deployment!
With our OKE cluster successfully deployed, it was time to start working on the GPU node deployment. Our GPU node/s have a requirement to run Ubuntu 22.04 because of the support for the NVIDIA GPU Operators that are required by the run:ai scheduler. For optimal performance between the GPU worker node instances, we needed to…
-
An AI journey continues – Network design
I wish I had been brilliant enough to have planned out the network deployment without issue but as the saying goes….we live and we learn. Here are some key networking decisions that will need to considered: After running through all of the pre-requisites for the run:ai cluster installation. We made sure that we had an…
-
An AI journey continues – storage
In the last blog entry, we left off with the scheduler and Kubernetes cluster decision in place. Our focus quickly turned to storage options. Since we will have two GPU worker nodes we required a shared storage option. The throughput objective requirement that was provided was 50 Gbps. OCI AI Architecture documentation lists Lustre, BeeGFS,…
-
An AI journey begins – choosing a scheduler
As a veteran of the technology industry, I have experienced the ebbs and flows of the “next big thing”. E-Commerce, blockchain, cloud computing, IoT, edge computing, quantum computing, big data, etc. The current buzz or “next big thing” is Artificial Intelligence (AI). I recently had an opportunity to deploy an AI architecture. I thought I…
