RDFMG Guide

Cluster Information

The RDFMG cluster is computing cluster with 58 compute nodes, articulated around one Head node. It provides powerful computational resources to the Nuclear Engineering group at NCSU. The current cluster provides users with 3712 computing cores.

The Head node is equipped with 2 AMD OPTERON 6320 Processors (64 cores), eight Seagate 4TB HDD 7200RPM SAS 12GB/s and 32GB of Memory.

14 of the 58 compute nodes are equipped with 4 × QUAD AMD Opteron 6320 (64 Cores), 44 of the 58 compute nodes are equipped with Dual AMD EPYC (ROME) 7452 CPUs (64 cores)

 

Server administrators:

Pascal Rouxelin:     pnrouxel@ncsu.edu
Mario Milev:           mlmilev@ncsu.edu
Jason Hou:             jason.hou@ncsu.edu

 

1. How to establish connections to the cluster?

First-time users (Windows):

1- You need an SSH client such as PuTTY or mobaXterm (open source software). The latter is recommended.

https://mobaxterm.mobatek.net/download.html

2- You need to install an FTP client, typically WinSCP (open source).

https://winscp.net/eng/download.php

3- In MobaXterm, click on the top left tab “Session”. A window pops up, click on “SSH”. A window will prompt you to enter the host name (rdfmg.ne.ncsu.edu) and your username (<NCSU_ID>). For example, if your name is Christopher Waddle with an NCSU ID “cwaddl”:

Host name: rdfmg.ne.ncsu.edu
Username: cwaddl

Your temporary password is your NCSU ID, (in the example, cwaddl).

4- Change your password as soon as you sign in: type in the terminal the command passwd.

5- (optional) If you are off-campus or decide to use the ncsu_guest wifi, you will need a VPN to access the cluster. Connect to the VPN before typing in your credentials in MobaXterm (or Putty) and WinSCP.

https://oit.ncsu.edu/campus-it/campus-data-network/vpn/

Current users:

Contact the server administrators to retrieve your credentials.

 

2. What is the structure of RDFMG cluster?

The RDFMG cluster consists of one head node called rdfmg, and 58 compute nodes called node001-058. After logging in, you will access your Home Directory on the rdfmg node. Simulations are assigned from the rdfmg node to the compute nodes via bash scripts (see section 5). The workload manager currently used on the server is SLURM. SLURM is a job scheduling system widely used on Linux servers, it fulfills three essential functions:

  • Allocates resources to the fourteen compute nodes based on user’s requests. Resources include number of nodes, number of CPUs and simulation time (etc.).
  • Handles the starting and execution of a “job” on the nodes assigned.
  • Distributes available resources between users based on a queue system.

Basic Linux commands and script execution such as C, Python, Java (etc.) can be executed on the head node. Do not execute computationally intensive simulations on the head node.

One of the fourteen compute nodes (node014) is dedicated to so-called “interactive sessions” (see section 5). Interactive sessions are used for code debugging. It lets the user log in onto node014 to run simulation codes without SLURM scripts to facilitate the debugging process and avoid the queue system applied by the workload on other nodes.

 

3. What are environment modules? How to load modules?

Typically users initialize their environment when they log in by setting environment information for every application they will reference during the session. The Environment Modules package is a tool that simplify shell initialization and lets users easily modify their environment during the session with modulefiles. Modules can be loaded or unloaded dynamically and atomically.

Here is an example of loading a module on RDFMG cluster:

$ module load gcc/5.2.0

$ which gcc

/cm/local/apps/gcc/5.2.0/bin/gcc

To unload a module

$ module unload gcc/5.2.0
$ which gcc
gcc not found

Here are some module commands you may find useful:

module avail List the available modules. Note that if there are multiple versions of a single package that one will be denoted as (default). If you load the module without a version number you will get this default version.
module whatis List all the available modules along with a short description.
module load MODULE Load the named module.
module unload MODULE Unload the named module, reverting back to the OS defaults.
module list List all the currently loaded modules.
module help Get general help information about modules.
module help MODULE Get help information about the named module.
module show MODULE Show details about the module, including the changes that loading the module will make to your environment.

 

4. Where are all the codes that we are ready to play with?

All the computer codes are installed in two directories:

/cm/shared/apps/ncsu

/cm/shared/codes

Restricted access is enforced. Some codes, such as RAVEN, OpenFoam or DAKOTA, are open source: contact the server administrators to be granted access. For most of the codes however, Export Control regulations restrict accessibility. License agreements are required to obtain executable or source privileges. The license agreement process is established on a code-dependent basis.

 

5. How to submit, monitor and delete my jobs?

The RDFMG cluster uses the SLURM Workload Manager to manage and schedule jobs (see section 2). Submitting, monitoring and cancelling jobs is handled by three terminal commands:

sbatch <SlurmScript> Submits a job.

Example: sbatch  run.sh

squeue Displays the status of all jobs of all users. To display the jobs of one user:

squeue  -u cwaddl

scancel <JobID> Terminates a job prior to its completion. <JobID> is the job ID number displayed by squeue.

Example: scancel 215896

SLURM includes numerous directives, which are used to specify resource requirements. SLURM directives appear as header lines (lines that start with #SBATCH) in a script (i.e. a text file).

5.1 Job scripts

To run a job in batch mode on a high-performance computing system using SLURM, the resource requirements are written in SLURM script.

Example of a SLURM job script (saved for example in a text file called run.sh):

 #!/bin/bash

 #SBATCH -J “jobname”           # Name of your job, optional

 # SBATCH -N 1                          # Number of nodes requested

 #SBATCH -n 32                         # Number of CPU. Max: N × 64

 #SBATCH -t 24:00:00                # Maximum simulation time

 #PBS -p defq                             # queue Name. Must be gradq or all

 /cm/shared/codes/dummyCode/exec   myInput.inp


5.2 Submitting jobs

To submit your job, use the SLURM sbatch command. If the command runs successfully, it will return a job ID to standard output, for example:

$ sbatch run.sh

5917.rdfmg.cm.cluster

5.3 Interactive sessions

To request an interactive session, load the workbench module and call the command isqp:

$ module load workbench

$ isqp

The user will automatically be logged onto node014 with 4 CPUs. The execution of a code will not require the sbatch command or a SLURM script. Executing the code presented in section 5.1 would simply be:

/cm/shared/codes/dummyCode/exec  myInput.inp