I wish I had been brilliant enough to have planned out the network deployment without issue but as the saying goes….we live and we learn. Here are some key networking decisions that will need to considered:
- Overlay or Native
- As it turns out, there are quite a few choices for Kubernetes overlay networking: Flannel, Calico, Canal, Cilium and Weavenet to name a few. Since time was a factor for this deployment, ease of implementation and low learning curve were top priorities.
- Once we had chosen to go with OKE as the Kubernetes Cluster, we were left with two options. OCI offers Flannel as an overlay option, or VCN Native Pod Networking can be used. We chose to go with the Native option to remove the complexity of adding in an overlay network and for the performance gain.
- Number of pods
- Kubernetes has a limit of 110 pods per node
- OCI VCN Native Pod Networking automatically limits the number pods per VNIC to 31
- That means a node would need a minimum of four VNIC’s in order to run 110 pods on the node.
- The number of pods will drive subnet sizing as well as instance shapes, we initially thought that a /24 would be more than sufficient but wound up re-sizing to a /23 (really wanted to do a /22 but that would have meant terminating and re-deploying the cluster again)
- The number of pods per node and the the number of nodes will have a direct impact on the CIDR range required for your VCN and subnets. I cannot stress the importance of this planning step!
- Load balancer
- Personally, I would opt to deploy the cluster with a load balancer by default just for future scalability (even if a load balancer is not required today)
- The load balancer is not deployed at cluster deployment, you are just defining the subnet that would be used once a load balancer is created by the cluster
- Public or private
- Thank goodness that run:ai created the load balancer during installation, I still have a lot to learn concerning load balancer or network load balancer integration with Kubernetes
- DNS
- Run:ai required an fqdn
- We opted to use a private view for the resolution of the ip address
After running through all of the pre-requisites for the run:ai cluster installation. We made sure that we had an nginx ingress controller that would integrate with OCI to create and manage the load balancer as a service when the time came for installation. I started looking into creating a manifest to create and manage the OCI load balancer. There are a ton of options, so I was definitely thankful that the run:ai installer took care of the deployment and configuration of the backend sets.
We did experience a delay on the run:ai installer due to a missing egress rule. The private zones require udp port 53 and I had only opened tcp 53 which caused the installation script to fail as it could not resolve the fqdn we were applying. An fqdn is a requirement (pre-requisite) for a run:ai installation.
My biggest takeaway from the network planning for the AI architecture was that the number of nodes really needs to be defined. Not just for today but will it need to scale in the future? How many pods are required per node. Since we were spinning up worker nodes via CLI, we were able to work around the network limitation by spinning up an additional subnet. We would have had to do some re-work if we were just using the console to spin up additional nodes. Using the CLI is additional work but it is good to know there is a workaround should we run into a subnet limitation in the future.
After deploying our virtual cloud networks, subnets and other components of the OCI software defined network, we deployed our OKE cluster (more than once, as I had alluded to in the previous storage blog entry) and we were ready to deploy our GPU worker node to have it join the cluster. I will cover that in the next entry.
