|The RDFMG cluster is computing cluster with 58 compute nodes, articulated around one Head node. It provides powerful computational resources to the Nuclear Engineering group at NCSU. The current cluster provides users with 3712 computing cores.
The Head node is equipped with 2 AMD OPTERON 6320 Processors (64 cores), eight Seagate 4TB HDD 7200RPM SAS 12GB/s and 32GB of Memory.
14 of the 58 compute nodes are equipped with 4 × QUAD AMD Opteron 6320 (64 Cores), 44 of the 58 compute nodes are equipped with Dual AMD EPYC (ROME) 7452 CPUs (64 cores)
Pascal Rouxelin: email@example.com
Mario Milev: firstname.lastname@example.org
1. How to establish connections to the cluster?
First-time users (Windows):
1- You need an SSH client such as PuTTY or mobaXterm (open source software). The latter is recommended.
2- You need to install an SFTP client, typically WinSCP (open source). mobaXterm can also be used.
3- In MobaXterm, click on the top left tab “Session”. A window pops up, click on “SSH”. A window will prompt you to enter the host name (rdfmg.ne.ncsu.edu) and your username (<NCSU_ID>). For example, if your name is Christopher Waddle with an NCSU ID “cwaddl”:
Host name: rdfmg.ne.ncsu.edu
Your temporary password is your NCSU ID, (in the example, cwaddl).
4- Change your password as soon as you sign in: type in the terminal the command passwd.
5- (optional) If you are off-campus or decide to use the ncsu_guest wifi, you will need a VPN to access the cluster. Connect to the VPN before typing in your credentials in MobaXterm (or Putty) and WinSCP.
First-time users (Mac):
1- Open a terminal and type:
For example, if your name is Chris Waddle and your user ID is cwaddl:
2- For file transfer, you can use a software such as FileZilla, or operate file transfer from your terminal.
Here are useful commands to perform the transfers:
List items that you have on the server: ls
List items on your local machine: lls
Find your location on the server: pwd
Find your location on your local machine: lpwd
Download a file from server to your local machine: get <filename_on_server>
Upload a file from local machine to server: put <filename_on_local_machine>
Change directory locally: lcd
Change directory on remotely: cd
3- Change your password as soon as you sign in: type in the terminal the command passwd.
4- (optional) If you are off-campus or decide to use the ncsu_guest wifi, you will need a VPN to access the cluster.
Contact the server administrators to retrieve your credentials.
2. What is the structure of RDFMG cluster?
The RDFMG cluster consists of one head node called rdfmg, and 58 compute nodes called node001-058. After logging in, you will access your Home Directory on the rdfmg node. Simulations are assigned from the rdfmg node to the compute nodes via Slurm scripts (see section 5). The workload manager currently used on the server is Slurm. Slurm is a job scheduling system widely used on Linux servers, it fulfills three essential functions:
- Allocates resources to the fourteen compute nodes based on user’s requests. Resources include number of nodes, number of CPUs and simulation time (etc.).
- Handles the starting and execution of a “job” on the nodes assigned.
- Distributes available resources between users based on a queue system.
Basic Linux commands and script execution such as C, Python, Java (etc.) can be executed on the head node. Do not execute computationally intensive simulations on the head node.
Two of the 58 compute nodes (node014 and node026) are dedicated to “interactive sessions” (see section 5). Interactive sessions are used for code debugging or quick on-the-fly simulations. It lets the user log in onto node014 or node026 to run simulation codes without Slurm scripts to facilitate the debugging process and avoid the queue system applied by the workload on other nodes.
3. What are environment modules? How to load modules?
Typically users initialize their environment when they log in by setting environment information for every application they will reference during the session. The Environment Modules package is a tool that simplify shell initialization and lets users easily modify their environment during the session with modulefiles. Modules can be loaded or unloaded dynamically and atomically.
Here is an example of loading a module on RDFMG cluster:
$ module load gcc/5.2.0
$ which gcc
To unload a module
$ module unload gcc/5.2.0
$ which gcc
gcc not found
Here are some module commands you may find useful:
|module avail||List the available modules. Note that if there are multiple versions of a single package that one will be denoted as (default). If you load the module without a version number you will get this default version.|
|module whatis||List all the available modules along with a short description.|
|module load MODULE||Load the named module.|
|module unload MODULE||Unload the named module, reverting back to the OS defaults.|
|module list||List all the currently loaded modules.|
|module help||Get general help information about modules.|
|module help MODULE||Get help information about the named module.|
|module show MODULE||Show details about the module, including the changes that loading the module will make to your environment.|
4. Where are all the codes that we are ready to play with?
All the computer codes are installed in two directories:
Restricted access is enforced. Some codes, such as RAVEN, OpenFoam or DAKOTA, are open source: contact the server administrators to be granted access. For most of the codes however, Export Control regulations restrict accessibility. License agreements are required to obtain executable or source privileges. The license agreement process is established on a code-dependent basis.
5. How to submit, monitor and delete my jobs?
The RDFMG cluster uses the SLURM Workload Manager to manage and schedule jobs (see section 2). Submitting, monitoring and cancelling jobs is handled by three terminal commands:
|sbatch <SlurmScript>||Submits a job.
Example: sbatch run.sh
|squeue||Displays the status of all jobs of all users. To display the jobs of one user:
squeue -u cwaddl
|scancel <JobID>||Terminates a job prior to its completion. <JobID> is the job ID number displayed by squeue.
Example: scancel 215896
SLURM includes numerous directives, which are used to specify resource requirements. SLURM directives appear as header lines (lines that start with #SBATCH) in a script (i.e. a text file).
5.1 Job scripts
To run a job in batch mode on a high-performance computing system using SLURM, the resource requirements are written in SLURM script.
Example of a SLURM job script (saved for example in a text file called run.sh):
#SBATCH -J “jobname” # Name of your job, optional
# SBATCH -N 1 # Number of nodes requested
#SBATCH -n 32 # Number of CPU. Max: N × 64
#SBATCH -t 24:00:00 # Maximum simulation time
#PBS -p defq # queue Name. Must be gradq or all
5.2 Submitting jobs
To submit your job, use the SLURM sbatch command. If the command runs successfully, it will return a job ID to standard output, for example:
$ sbatch run.sh
5.3 Interactive sessions
To request an interactive session, load the workbench module and call the command isqp:
$ module load workbench
The user will automatically be logged onto node014 with 4 CPUs. The execution of a code will not require the sbatch command or a SLURM script. Executing the code presented in section 5.1 would simply be:
To choose one of the two interactive nodes, add the -w argument followed by <nodeName> to the isqp command. For example, to log on node026:
isqp -w node026