In the last article, I discussed the architecture of the HPC. If you have not read that article, I would recommend that you read it before proceeding with this one.
The power of HPC can be utilized for its most important application in the field of computer science: Machine Learning. I am assuming that you already have obtained your SSH credentials to log on to the master node. Also, we will be setting up the environment in Python. Let's jump straight to the steps.
How To?
Step 1: Download and Install Anaconda on the Master Node
Note that you are not the root user of the HPC. You are just a regular user, and therefore, administrative commands (such as sudo or su) will not work. Anaconda has made it much easier to install Python packages for non-root users, and we will be using Anaconda for setting up Python 3 and installing the required packages.
- Log in to the master node
> ssh be1005815@172.16.23.1
- Go to https://www.anaconda.com/distribution/ and copy the link for the 64-bit Linux Python 3 distribution. Download the package on the master node using wget. The --no-certificate-check flag is because the OS is too old to check new TLS certificates.
> wget https://repo.anaconda.com/archive/Anaconda3-2018.12-Linux-x86_64.sh --no-check-certificate
- Change the permission of the downloaded file.
> chmod u+x Anaconda3-2018.12-Linux-x86_64.sh
- Install Anaconda.
> ./Anaconda3-2018.12-Linux-x86_64.sh
- Follow the steps to install Anaconda on the master node. Anaconda will be installed in /home/<username>/anaconda3/.
- Log out and log in again to the master node.
Step 2: Create an Environment and Install Packages
In the next step, we will log in to one of the compute nodes and activate a Python environment.
- Create a new Python environment using Conda. I want to set up a Tensorflow environment, so I will name it 'TensorFlow'.
> conda create -n tensorflow python=3.6
- Activate the environment.
> conda activate tensorflow
- Install the required packages. Here, I will be installing TensorFlow CPU and Keras. TensorFlow GPU is currently not supported by CentOS 6.5 due to the older version of glibc. You can install other packages like Pandas or PyTorch.
- Test the installation by starting Python.
> python -m "tensorflow"
- Deactivate the created environment on the master node. Remember, the master node is only for distribution. Do not use the master node for computation.
> conda deactivate
- Log in to one of the compute nodes from the master node (Remember, all installation done on the master node is now available on the compute nodes as well). You can use commands like free (to check memory usage) or top (to list all running processes) to check which node can be used for your application. Here I am using Compute Node 1.
> ssh be1005815@10.10.1.2
- Activate the environment on the compute node.
> conda activate tensorflow
Step 3: Start Jupyter Notebook and Setup SSH Tunnel
Jupyter Notebook is installed by default along with Anaconda. You can start Jupyter Notebook on one of the compute nodes and setup an SSH Tunnel on the master node to access the compute node.
- On the same compute node, create a Jupyter Notebook configuration file.
> jupyter notebook --generate-config
- Open the configuration in your favourite command line editor and modify two lines to allow remote access from any IP.
> nano .jupyter/jupyter_notebook_config.py
c.NotebookApp.allow_origin = '*' #allow all origins
c.NotebookApp.ip = '0.0.0.0' # listen on all IPs
- Start the Jupyter Notebook. Once started, you can copy the token from the Command Line.
> jupyter notebook
- Don't close the existing SSH session. Open a new terminal and log on to the master node again.
> ssh be1005815@172.16.23.1
- Create an SSH tunnel on the master node. Here I am mapping port 8888 of Compute Node 1 to port 8000 of the master node (If this port is not available on the master node, then try some other port).
> ssh -g -L 8000:localhost:8888 -f -N be1005815@10.10.1.2
- Go to the browser and open the address http://172.16.23.1:8000/. Enter the copied token and you would be able to access the Notebook.
- Done
Words of Wisdom
- Do not consume excessive resources (such as ports) on the master node. Once done, you can kill the SSH tunnel by running the following command on the master node.
- > killall ssh
- You can use process managers like PM2 or nohup to permanently start a Jupyter Notebook session on compute nodes. Don't forget to exit the process managers once done.
- TensorFlow GPU requires glibc >= 2.14, while the one installed on the GPU compute node is glibc 2.12. That's why TensorFlow cannot utilize the GPU as of now. If someone finds an alternative way of using the GPU with TensorFlow, please let me know, and I would be more than willing to link to your article.
- Respect computing power. It is shared by everyone in the college. Do not waste the resource by running unnecessary computations.
Conclusion
So, that is all you need to set up an ML environment on the HPC. Note that you are still using a single compute node. Sixteen cores should be enough for a dataset of less than 1 GB. I will be posting another article in which we will set up Big Data processing on the HPC.
Got answer to my first question in your previous post.
ReplyDeletemakes sense !!
ReplyDelete