Techblog

Supporting GPU with CoreOS

Von Sebastian Wöhrl @swoehrl auf Twitter
11. September 2019

Machine learning is becoming an ever-bigger factor in Big Data analytics. To work with machine learning efficiently, we need to provide a fast way to train models. The best way to do that is by using GPUs to speed up the training.

Since we have been using D2iQ DC/OS in our projects, we needed a way to integrate GPUs into our clusters. From the hardware side this is a no-brainer as all the major cloud providers (like Amazon AWS or Microsoft Azure) provide special compute machine types with integrated GPUs. But on the software side this provides us with a challenge. We use CoreOS Container Linux as basis for our cluster nodes, as it is a minimalistic, thus efficient operating system optimized for containers. Because of its slimness, the attack surface is reduced and managing nodes is simple. But Nvidia, which all the major cloud providers use as their GPU vendor, does not officially support CoreOS. So, we had to find a custom way to get Nvidia drivers installed and running on CoreOS.

In this blog post we would like to describe the solution we came up with. Please take a look at Github if you are interested in the actual implementation.

Compiling the core modules

The first step is to compile the Nvidia kernel modules for the exact kernel version of the CoreOS version we are running. To do this independently of any CoreOS instance in our cluster, we start a docker container using an image that contains the CoreOS developer image (see here) with kernel sources and additional tooling. Inside the container we download and prepare the kernel sources for module compilation (as documented here) and download the regular Nvidia installer from Nvidia's website.

Instead of just running the installer, we ask it to only extract its files and then run only the parts for compiling the kernel modules, pointing it to our CoreOS kernel sources and telling it to ignore the running kernel and also not to install the compiled modules by providing the appropriate command line arguments.

  1. ./NVIDIA-Linux-x86_64-${NVIDIA_DRIVER_VERSION}.run -x
  2. cd NVIDIA-Linux-x86_64-${NVIDIA_DRIVER_VERSION}
  3. IGNORE_MISSING_MODULE_SYMVERS=1 ./nvidia-installer -s -n --kernel-source-path=/usr/src/linux --no-check-for-alternate-installs --kernel-install-path=$WORKDIR --skip-depmod || echo "install failure is ok"

Braoden the DC/OS setup prcoess

We then want to install these modules and all supporting libraries and binaries on our CoreOS instances in the cluster. To make the process as easy as possible we forego building custom CoreOS images and instead just extend our already existing DC/OS setup process. For that we first compress the Nvidia kernel modules, libraries and binaries into a single archive file and upload it to a location reachable by all cluster nodes.

Our setup process is automated using terraform to allow us to easily add, update and remove nodes to the cluster. With terraform we create new vm instances using the cloud provider APIs, provision them with our base configuration and then install DC/OS. Our tooling is heavily based on the official DC/OS terraform installer from D2iQ. We simply extended the installer to download the archive file, extract its contents and load the kernel modules. With these steps the GPU instances are already available to DC/OS.

Since this process is not officially supported the normal GPU integration in DC/OS does not fully work. GPUs will be detected and can be used in marathon apps and metronome jobs (starting with DC/OS 1.13, for 1.12 you need a patch for metronome to make it work). But normally there is a little helper that collects all Nvidia libraries and binaries and makes them available for GPU containers. This part does not work with CoreOS.

The workaround is to mount the directory in which you installed the drivers as a volume in your containers. You can then use the GPU normally, for example to speed up model training with PyTorch.

 

The process we described sp far is not persistent between reboots. To make CoreOS load the Nvidia kernel modules on startup we created a simple system unit that starts before DC/OS and loads the kernel modules. This unit must run before DC/OS is started as otherwise DC/OS would not be able to detect the GPUs.

We have baked all these steps into some simple scripts and open-sourced them on github. You can use them to add GPU support to your own CoreOS-based DC/OS cluster.

Using this automated process, we can easily add GPU-enabled instances to our cluster on-the-fly (as our clusters are running on AWS, we use instances with a p3 machine type) and then run our PyTorch-based model training. We use metronome jobs to do that. A simplified job definition looks like this:

 

  1. {
  2.   "id": "model-training",
  3.   "run": {
  4.     "cpus": 1,
  5.     "mem": 32768,
  6.     "disk": 0,
  7.     "gpus": 1,
  8.     "cmd": "python train.py",
  9.     "env": {
  10.       "LD_LIBRARY_PATH": "/mnt/mesos/sandbox/nvidia/lib"
  11.     },
  12.     "ucr": {
  13.       "image": {
  14.         "id": "docker.analytics.corp/model-training",
  15.         "kind": "docker"
  16.       }
  17.     },
  18.     "volumes": [
  19.       {
  20.         "containerPath": "nvidia",
  21.         "hostPath": "/opt/nvidia/",
  22.         "mode": "RO"
  23.       }
  24.     ],
  25.     "restart": {
  26.       "policy": "NEVER"
  27.     }
  28.   },
  29.   "schedules": []
  30. }

 

 

Important to note is that we added the Nvidia folder as a volume and set the LD_LIBRARY_PATH environment variable so that PyTorch picks up the Nvidia libraries.

And last but not least please not our disclaimer: The information in this blog post and in the github repository is provided as-is. The procedure is not officially supported by D2iQ or Nvidia. Use it at your own risk.

Add new comment

Public Comment form

  • Allowed HTML tags: <a> <em> <strong> <cite> <blockquote> <code> <ul> <ol> <li> <dl> <dt> <dd><p><h1><h2><h3>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.

ME Landing Page Question