Setting up Machine Learning Environment on High Performance Computing Server

In the last article, I discussed the architecture of the HPC. If you have not read that article, I would recommend that you read it before proceeding with this one.

Architecture of High-Performance Computing Server at BIT Mesra

The power of HPC can be utilized for its most important application in the field of computer science: Machine Learning. I am assuming that you already have obtained your SSH credentials to log on to the master node. Also, we will be setting up the environment in Python. Let's jump straight to the steps.

How To?

Step 1: Download and Install Anaconda on the Master Node

Note that you are not the root user of the HPC. You are just a regular user, and therefore, administrative commands (such as sudo or su) will not work. Anaconda has made it much easier to install Python packages for non-root users, and we will be using Anaconda for setting up Python 3 and installing the required packages.

> ssh be1005815@172.16.23.1

Go to https://www.anaconda.com/distribution/ and copy the link for the 64-bit Linux Python 3 distribution. Download the package on the master node using wget. The --no-certificate-check flag is because the OS is too old to check new TLS certificates.

> wget https://repo.anaconda.com/archive/Anaconda3-2018.12-Linux-x86_64.sh --no-check-certificate

Change the permission of the downloaded file.

> chmod u+x Anaconda3-2018.12-Linux-x86_64.sh

Install Anaconda.

> ./Anaconda3-2018.12-Linux-x86_64.sh

Follow the steps to install Anaconda on the master node. Anaconda will be installed in /home/<username>/anaconda3/.

Log out and log in again to the master node.

Step 2: Create an Environment and Install Packages

In the next step, we will log in to one of the compute nodes and activate a Python environment.

Create a new Python environment using Conda. I want to set up a Tensorflow environment, so I will name it 'TensorFlow'.

> conda create -n tensorflow python=3.6

Activate the environment.

> conda activate tensorflow

Install the required packages. Here, I will be installing TensorFlow CPU and Keras. TensorFlow GPU is currently not supported by CentOS 6.5 due to the older version of glibc. You can install other packages like Pandas or PyTorch.

> conda install tensorflow keras

Test the installation by starting Python.

> python -m "tensorflow"

Deactivate the created environment on the master node. Remember, the master node is only for distribution. Do not use the master node for computation.

> conda deactivate

Log in to one of the compute nodes from the master node (Remember, all installation done on the master node is now available on the compute nodes as well). You can use commands like free (to check memory usage) or top (to list all running processes) to check which node can be used for your application. Here I am using Compute Node 1.

> ssh be1005815@10.10.1.2

Activate the environment on the compute node.

> conda activate tensorflow

Step 3: Start Jupyter Notebook and Setup SSH Tunnel

Jupyter Notebook is installed by default along with Anaconda. You can start Jupyter Notebook on one of the compute nodes and setup an SSH Tunnel on the master node to access the compute node.

On the same compute node, create a Jupyter Notebook configuration file.

> jupyter notebook --generate-config

Open the configuration in your favourite command line editor and modify two lines to allow remote access from any IP.

> nano .jupyter/jupyter_notebook_config.py

c.NotebookApp.allow_origin = '*' #allow all origins

c.NotebookApp.ip = '0.0.0.0' # listen on all IPs

Start the Jupyter Notebook. Once started, you can copy the token from the Command Line.

> jupyter notebook

Don't close the existing SSH session. Open a new terminal and log on to the master node again.

> ssh be1005815@172.16.23.1

Create an SSH tunnel on the master node. Here I am mapping port 8888 of Compute Node 1 to port 8000 of the master node (If this port is not available on the master node, then try some other port).

> ssh -g -L 8000:localhost:8888 -f -N be1005815@10.10.1.2

Go to the browser and open the address http://172.16.23.1:8000/. Enter the copied token and you would be able to access the Notebook.

Done

Words of Wisdom

Do not consume excessive resources (such as ports) on the master node. Once done, you can kill the SSH tunnel by running the following command on the master node.

> killall ssh

You can use process managers like PM2 or nohup to permanently start a Jupyter Notebook session on compute nodes. Don't forget to exit the process managers once done.

TensorFlow GPU requires glibc >= 2.14, while the one installed on the GPU compute node is glibc 2.12. That's why TensorFlow cannot utilize the GPU as of now. If someone finds an alternative way of using the GPU with TensorFlow, please let me know, and I would be more than willing to link to your article.

Respect computing power. It is shared by everyone in the college. Do not waste the resource by running unnecessary computations.

Conclusion

So, that is all you need to set up an ML environment on the HPC. Note that you are still using a single compute node. Sixteen cores should be enough for a dataset of less than 1 GB. I will be posting another article in which we will set up Big Data processing on the HPC.

Architecture of High Performance Computing Server at BIT Mesra

A High-Performance Computing (HPC) server was installed a few years back. It was a replacement for PARAM 10000, the supercomputer that is no longer available for use. Initially, the HPC was under the Department of Computer Science. The Department of Chemical Engineering and Biotechnology was the primary user of the HPC (mostly for simulation purposes), and so the administration decided to move it under the Central Instrumentation Facility (CIF). You need permission from the CIF to access the HPC. HPC is only available for research purposes, and you need to provide a good reason along with a proper recommendation from a professor to gain access to the HPC. The HPC is at least 20 times more powerful than the most powerful PC that anyone has on campus. Also, I recently checked the usage and realized that not even 10% of its power is being utilized. I hope this blog post will help you in understanding the core architecture of the HPC. Architecture The Architecture of High Performance Compu...

Nano

Search This Blog