deploy an NGC environment on a GPU-accelerated instance - Elastic GPU Service

NVIDIA GPU Cloud (NGC) is a deep learning ecosystem that is developed by NVIDIA. NGC allows you to access deep learning software stacks for free and use the stacks to build development environments for deep learning. This topic describes how to deploy an NGC environment on a GPU-accelerated instance. In the example, the TensorFlow deep learning framework is used.

Background information

To use the NGC deep learning ecosystem, Alibaba Cloud provides NGC container images that are optimized for NVIDIA Pascal GPUs in Alibaba Cloud Marketplace. You can use the NGC container images to quickly deploy NGC container environments and instantly access optimized deep learning frameworks. This way, you can develop and deploy services, and pre-install development environments in an efficient manner. The NGC container images also support optimized algorithm frameworks and constant updates.
The NGC website provides various image versions for mainstream deep learning frameworks, such as Caffe, Caffe2, Microsoft Cognitive Toolkit (CNTK), MXNet, TensorFlow, Theano, and Torch. You can select an image based on your business requirements to deploy an environment.

Procedure

You can deploy an NGC environment on an instance that belongs to one of the following instance families:

gn5i, gn6v, gn6i, gn6e, gn7i, gn7e, and gn7s
ebmgn6i, ebmgn6v, ebmgn6e, ebmgn7i, and ebmgn7e

Note

Before you deploy an NGC environment on an instance, make sure that an NGC account is created on the NGC website.

This section describes how to create a GPU-accelerated instance and deploy an NGC environment on the instance. In this example, a gn6i instance is created.

Create a gn6i instance.

For more information about how to create an instance, see Create an instance on the Custom Launch tab. The following table describes key parameters.

Parameter	Description
Region	Select a region. Valid values: China (Qingdao), China (Beijing), China (Hohhot), China (Hangzhou), China (Shanghai), China (Shenzhen), China (Guangzhou), China (Heyuan), China (Chengdu), China (Hong Kong), Singapore, US (Silicon Valley), US (Virginia), Germany (Frankfurt), Japan (Tokyo), and Malaysia (Kuala Lumpur).
Instance	Select an instance that belongs to the gn6i instance family.
Image	On the Marketplace Images tab, click Select Image from Alibaba Cloud Marketplace (with Operating System). In the Alibaba Cloud Marketplace dialog box, enter NVIDIA GPU Cloud Virtual Machine Image in the search box and click Search. Find the image that you want to use and click Select.
Public IP Address	Select Assign Public IPv4 Address. Note If you do not select Assign Public IPv4 Address, you can associate an elastic IP address (EIP) with the instance after the instance is created. For more information, see Associate one or more EIPs with an instance.
Security Group	Select a security group. You must enable TCP port 22 for the security group. If your instance is required to support HTTPS or Deep Learning GPU Training System (DIGITS) 6, you must enable TCP port 443 for HTTPS or TCP port 5000 for DIGITS 6.

Use one of the following methods to connect to the instance.
Method
References
Workbench
Connect to a Linux instance by using a password or key
Virtual Network Computing (VNC)
Connect to an instance by using VNC
Run the nvidia-smi command.
You can view the GPU information about the instance, such as the GPU model and driver version. The following figure shows the GPU information.
Obtain the path of the TensorFlow image.
1. Log on to the NGC website. In the left-side navigation pane, choose CATALOG > Containers.
2. On the Containers page, enter TensorFlow in the search box. Find the TensorFlow card and click TensorFlow.
3. On the TensorFlow page, click the Tags tab. On this tab, find the TensorFlow image version that you want to use and copy the image path.
  In this example, the TensorFlow image whose version is 20.01-tf1-py3 is downloaded. The nvcr.io/nvidia/tensorflow:20.01-tf1-py3 image path is copied.
On the logon page of the GPU-accelerated instance, run the following command to download the TensorFlow image of the desired version:
```
docker pull nvcr.io/nvidia/tensorflow:20.01-tf1-py3
```
Important
The download task may require a long period of time to complete.
After the TensorFlow image is downloaded, run the following command to check the TensorFlow image:
```
docker image ls
```
Run the following command to deploy the TensorFlow development environment by running the container:
```
docker run --gpus all --rm -it nvcr.io/nvidia/tensorflow:20.01-tf1-py3
```
Run the following commands in sequence to run a simple test for TensorFlow:
```
python
```
```
import tensorflow as tf
hello = tf.constant('Hello, TensorFlow!')
with tf.compat.v1.Session() as sess:
    result = sess.run(hello)
    print(result.decode())
 
```
If TensorFlow loads the GPU device as expected, the Hello, TensorFlow! result appears. The following figure shows an example.
Save the modified TensorFlow image.
1. Run the following command to query the container ID that is specified by CONTAINER_ID.
```
docker ps
```
2. Run the following command to save the modified TensorFlow image:
```
# Replace CONTAINER_ID with the container ID that you queried by running the docker ps command. Example: 619f7b715da5. 
docker commit   -m "commit docker" CONTAINER_ID  nvcr.io/nvidia/tensorflow:20.01-tf1-py3
```
  Important
  Make sure that the modified TensorFlow image is properly preserved. Otherwise, the modification may be lost the next time you log on to the instance.

Method	References
Workbench	Connect to a Linux instance by using a password or key
Virtual Network Computing (VNC)	Connect to an instance by using VNC