RDFMG Guide

Cluster Information

The RDFMG cluster is currently a 15 nodes computing cluster where the Head Node is equipped with 2 AMD OPTERON 6320 Processors ( 16 cores) , 8 Seagate 4TB HDD 7200RPM SAS 12GB/s and 32GB of Memory. The 14 compute nodes are each equipped with 4x QUAD AMD Opteron 6320 (64 Core Hyper-threading), HGST 3.5’’ 6TB SAS 6GB/s, Kingston 16x 8GB 1600MHz DDR3 (128GB memory) and 40GB QDR Infiniband.

 

1. How to establish connections to the cluster?

You will be notified by the administrator about your account information. Usually your username is your NCSU Unity ID (email address before @) and the initial password is the same.

You need a terminal emulator on a Windows system to connect to the cluster. Highly recommended options are PuTTY (free) and MobaXterm (free limited edition). MobaXterm is equipped with useful network tools (such as SSH, X11, RDP, FTP …) and usually does not require further configuration.

The VPN connection is necessary if the access is attempted when you are off-campus. More information about the client software can be found here.

Once you use SSH to remotely connect to the RDFMG cluster (rdfmg.ne.ncsu.edu), please change your password immediately using command passwd.

 

2. What is the structure of RDFMG cluster?

The RDFMG cluster consists of 1 head node and 14 compute nodes (node001-014), each with 64 processors. You may log in to the cluster, submit jobs, and transfer/move data through the head node; however, do not run programs on the head nodes. For more detailed information please refer to this guide.

 

3. What are environment modules? How to load modules?

Typically users initialize their environment when they log in by setting environment information for every application they will reference during the session. The Environment Modules package is a tool that simplify shell initialization and lets users easily modify their environment during the session with modulefiles. Modules can be loaded or unloaded dynamically and atomically.

Here is an example of loading a module on RDFMG cluster:

$ module load gcc/5.2.0

$ which gcc

/cm/local/apps/gcc/5.2.0/bin/gcc

 

To unload a module

$ module unload gcc/5.2.0
$ which gcc
gcc not found

Here are some module commands you may find useful:

module avail List the available modules. Note that if there are multiple versions of a single package that one will be denoted as (default). If you load the module without a version number you will get this default version.
module whatis List all the available modules along with a short description.
module load MODULE Load the named module.
module unload MODULE Unload the named module, reverting back to the OS defaults.
module list List all the currently loaded modules.
module help Get general help information about modules.
module help MODULE Get help information about the named module.
module show MODULE Show details about the module, including the changes that loading the module will make to your environment.

 

4. Where are all the codes that we are ready to play with?

All the computer codes are installed under /cm/shared/apps/ncsu and /cm/shared/codes, and how many of them are accessible to you depend on the permission granted to you

5. How to submit, monitor and delete my jobs?

RDFMG cluster use the TORQUE resource manager (based on OpenPBS) and the Moab Workload Manager to manage and schedule jobs. TORQUE is a resource management system for submitting and controlling jobs on supercomputers, clusters, and grids. TORQUE manages jobs that users submit to various queues on a computer system, each queue representing a group of resources with attributes necessary for the queue’s jobs.

Commonly used TORQUE commands include:

qsub Submit a job.
qstat Monitor the status of a job.
qdel Terminate a job prior to its completion.


TORQUE includes numerous directives, which are used to specify resource requirements and other attributes for batch and interactive jobs. TORQUE directives can appear as header lines (lines that start with
#PBS) in a batch job script or as command-line options to the qsub command.

Job scripts

To run a job in batch mode on a high-performance computing system using TORQUE, first prepare a job script that specifies the application you want to run and the resources required to run it, and then submit the script to TORQUE using the qsub command. TORQUE passes your job and its requirements to the system’s job scheduler, which then dispatches your job whenever the required resources are available. For example:

A TORQUE job script for an MPI job might look like this:

 #!/bin/bash

 #PBS -k o

 #PBS -l nodes=2:ppn=6,walltime=30:00

 #PBS -M jthutt@tatooine.net

 #PBS -m abe

 #PBS -N JobName

 #PBS -j oe

 mpiexec -np 12 -machinefile ~/bin/binaryname

In the above example, the first line indicates the script should be read using the bash command interpreter. Then, several header lines of TORQUE directives are included:

#PBS -k o Keeps the job output
#PBS -l nodes=2:ppn=6,walltime=30:00 Indicates the job requires two nodes, six processors per node, and 30 minutes of wall-clock time
#PBS -M jhou8@ncsu.edu Sends job-related email to jhou8@ncsu.edu
#PBS -m abe Sends email if the job is (a) aborted, when it (b) begins, and when it (e) ends
#PBS -N JobName Names the job JobName
#PBS -j oe Joins standard output and standard error

 

A TORQUE job script for a serial job might look like this:

 #!/bin/bash

 #PBS -k o

 #PBS -l nodes=1:ppn=1,walltime=30:00

 #PBS -M jthutt@tatooine.net

 #PBS -m abe

 #PBS -N JobName

 #PBS -j oe

 ./a.out

The only difference from the the previous example is that the current job requires one node, one processor per node, and 30 minutes of wall-clock time:

#PBS -l nodes=1:ppn=1,walltime=30:00


Submitting jobs

To submit your job script (e.g., job.script), use the TORQUE qsub command. If the command runs successfully, it will return a job ID to standard output, for example:

$ qsub job.script

5917.rdfmg.cm.cluster


For more, see the
qsub manual page (enter man qsub).


Monitoring jobs

To monitor the status of a queued or running job, use the qstat command.

Useful qstat options include:

-u user_list Displays jobs for users listed in user_list
-a Displays all jobs
-r Displays running jobs
-f Displays the full listing of jobs (returns excessive detail)
-n Displays nodes allocated to jobs


For more, see the
qstat manual page (enter man qstat).

Deleting jobs

To delete queued or running jobs, use the qdel command:

To delete a specific job (jobid), enter:

$ qdel jobid

To delete all jobs, enter:

$ qdel all

 

For more, see the qdel manual page (enter man qdel).