0% found this document useful (0 votes)

197 views58 pages

NCI Intro

This document provides an introduction to the National Computational Infrastructure (NCI) in Australia. It discusses that NCI operates several high performance computing systems including the petascale system Raijin. It allocates time on these systems through various schemes including a national merit allocation scheme. It describes how users can apply for projects and accounts to access NCI resources. It provides information on connecting to systems, using the UNIX environment, job scheduling, filesystems and troubleshooting. The document is intended to help users understand and make use of NCI's high performance computing and data resources.

Uploaded by

Mahadi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

197 views58 pages

NCI Intro

Uploaded by

Mahadi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 58

Introduction to NCI

National Computational Infrastructure

Download training materials here:

http://nci.org.au/services-support/training/

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

Outline
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting

2 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

What is the NCI?

I
I

Peak Facility, Raijin, Cloud Service and Data management

Specialised Support
I
I
I
I
I

Climate system science

Astronomy
Earth Observation
Geophysics
Cloud Computing

3 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

Allocation Schemes
I

National Computational Merit Allocation Scheme

Partner allocations
I
I

Major Partners: e.g. CSIRO, INTERSECT, GA, QCIF, BoM

University Partners: e.g. ANU, Monash, UNSW, UQ, USyd, Uni
Adelaide, Deakin

Flagship Projects
I

NCMAS includes NCI(raijin), iVEC(magnus, epic, fornax),

VLSCI(avoca), SF in Bioinformatics(barrine) and SF in Imaging and
Visualisation(MASSIVE).

Astronomy/Astrophysics, CoE in Climate Systems Science, CoE

Optics

Startup allocation
Director

4 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

Distrubtions of Allocations :2014

Approximate distribution of allocations across all compute systems for
2014:
I
I
I
I
I
I
I
I
I

NCMAS 15%
CSIRO 21.4%
BOM 18.9%
ANU 17.7%
Flagships 5.0% (including CoECSS, TERN, Astro, CoE Optics)
INTERSECT 3.8%
GA 3.4%
Monash, UNSW, UQ, USyd, Uni Adelaide, 1.7% each
Directors share, QCIF, Deakin, MSI 6.3% in total

5 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

NCI HPC System

Integrated Infrastructure and Services
I
I
I
I
I

RAIJIN Fujitsu Primergy information.

Lustre Filesystems - raijin (/home and /short) and global (/g/data)
Cloud - OpenStack cloud (hosting services, specialised virtual labs,
new services, special interactive use)
High-end visualisation services and support (Vizlab)
Software Packages

6 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

Getting Information
I
I
I
I
I
I
I
I
I

URL http://nci.org.au/
Detailed usage information
Raijin Quick Reference Guide
Detailed software information
Raijin FAQs
/g/data FAQs
Message of the Day (/etc/motd)
Emergency and Downtime Notices
NCI help email help@nci.org.au

7 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

New Petascale System

Fujitsu Primergy - raijin
I
I
I
I
I
I
I

3592 2X Intel Sandy Bridge E5-2670 (8 core, 2.6GHz)

57472 cores
Total memory 158Tb
Lustre filesystems: (/short, /home, /g/data)
$PBS JOBFS local to each node.
Infiniband network
See the system being installed.

8 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

Cloud
NCIs Cloud services focus around:
I
I
I

Computation using the cloud

Data services using the cloud
Complementary services to NCIs HPC that are best provided
through cloud

NCI offers a NeCTAR node (National eResearch Collaboration Tools and

Resources):
I

Designed to optimize for computation and floating point (Intel

CPUs)

Designed for high speed data transfer (56Gigabit network between

nodes)

Designed for high speed IO (All SSD disk storage in the cloud)

NCI can offer a high speed interconnect between the NCI Lustre based
filesystems and NCI Cloud services.
9 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

Data Storage
I
I

global Lustre filesystem /g/data/ - stores persistent data,

mounted on raijin and cloud nodes.
Mass Data storage - HSM storage with dual copies across two NCI
data centres. Effective storage for managing data that can be
staged in/out as part of batch processing.
RDSI national data collections - to be stored across the NCI data
resources listed above.

10 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

Outline
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting

11 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

How to Apply a New Project (for CI)

I
I

Project leaders (Chief Investigators) will fill out on-line forms with
required details and be given a project ID.
Application process:
I
I
I
I

Partner (anytime)
Merit scheme (once a year, deadline Nov)
Start-up (anytime, max 5000 SU per year)
Commercial (anytime)

12 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

How to Apply a New Account (for User)

I
I
I
I
I
I
I
I
I

Register as a New User: register first. The registration ID is a

number such as 12345, it is not a user ID.
Connect to Project: connection form should be submitted.
Accounts are set up when a CI approves a connection request.
New user will receive an email with account details.
NCI usernames are of the form abc123 - abc for your initials and
123 for affiliation.
Passwords are sent by SMS to the mobile number provided when
you registered.
Passwords can be given over the phone if necessary, but not by
email.
Use the passwd command to change this when you first log in.
An automated on-line tool for users to set passwords is being
developed, expected availability mid 2015.
13 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

Project accounting
I
I
I
I

All use on the compute systems is accounted against projects. Each

project has a single grant of time per 3 month quarter.
If your username is connected to more than one project you will
have to select which project to run jobs under.
A project may span several stakeholders (eg BoM and CSIRO).
To change or set the default project, edit your .rashrc file in your
home directory, and change the PROJECT variable as desired. A
typical .rashrc file looks like
setenv PROJECT c25
setenv SHELL /bin/bash

Login after editing .rashrc to see the changes.

14 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

Default Project
I

The following displays the usage of the project in the current quarter
against each of the stakeholder funding the project.
nci_account

By adding -v you can see who is using the compute time.

nci_account -v

You can also use -P for other project and -p for different quarter, ie:
nci account -P c25 -p 2014.q2 -v

I
I

Further information will be presented under nci account - most

notably storage usage.
If you have a project that is externally funded and requires more
resource than provided, please contact us. It is possible to set up
special funding, and track under nci account.
15 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

Outline
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting

16 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

Establish Connection
I

Connection under Unix/Mac:

I
I
I

For ssh - ssh (terminal)

For scp/sftp - scp/sftp (terminal)
For X11 - ssh -X, make sure to install XQuartz for OSX 10.8 or
above. (terminal)

Connection under Windows:

I
I
I

For ssh - putty, mobaxterm

For scp/sftp - putty, Filezilla, winscp
For X11 - Cygwin, XMing, mobaxterm, Virtual Network Computing.

Caution!
Be sure to logout of xterm sessions, and quit the Window Manager
before leaving the system.

17 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

Connecting to raijin
The hostname of the Fujitsu Primergy Cluster is
raijin.nci.org.au

and can be accessed using the secure shell (ssh) command, for example,
ssh -X abc123@raijin.nci.org.au

Your ssh connection will be to one of six possible login nodes, raijin{1,6}
(If ssh to raijin fails, you should try specifying one of the nodes, i.e.
raijin3.nci.org.au).

18 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

Secure use of ssh

passphrase-less ssh keys: allow ssh to log in without a password.

Caution!
Day-to-day use is strongly discouraged.
This considerably weakens both NCI and home institution system
security. (Instead consider a key with passphrase + ssh-agent on your
workstation.)
Can be useful to support copyq batch jobs:
I
I

Generate a new key specifically for such transfers

Use rrsync to restrict what it can do

More information: Using ssh keys

19 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

Outline
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting

20 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

UNIX environment
The working environment under UNIX is controlled by shells
(command-line interpreter). The shell interprets and executes user
commands.
I
I
I
I

The default is bash shell (also popular is tcsh, you may use ksh)
Shell can be changed by modifying .rashrc
Shell commands can be grouped together into scripts
Unix Quick Reference Guide

Note
Unix is case sensitive!!

21 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

UNIX environment
The shell provides environment variables that can be accessed across all
the processes initiated from the original shell e.g. login environment.
exec on login and compute nodes
exec on login nodes only
modules

csh/tcsh
.cshrc
.login
.login

sh/bash/ksh
.bashrc
.profile
.profile

tcsh syntax
setenv VARIABLE value

bash syntax
export VARIABLE=value

For an explanation of environment variables see Canonical user

environment variables
22 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

Environment Modules
Modules provide a great way to easily customize your shell environment
for different software packages. The module command syntax is the
same no matter which command shell you are using.
Various modules are loaded into your environment at login to provide a
workable environment.
module list
module avail
module show name
module load
module unload

# To see the modules loaded

# To see the list of software for which environments
have been set up via modules
# To see the list of commands that are carried out
in the module
# To load the environment settings required by a
software package
# To remove extras added to the environment for a
previously loaded software package. This is
extremely useful in situations where different
package settings clash.
23 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

Environment Modules
Note
To automate environment customisation at login module load
commands can be added to the .login (tcsh) or .profile (bash) files.
Users should be aware that different applications can have incompatible
environment requirements so loading multiple application modules in
your dot file may lead to problems. We recommend that modules are
loaded in scripts as needed at runtime and likewise discourage the use of
module commands in shell configuration (dot) files.
More advanced information on modules can be found in the Modules
User Guide.

24 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

Editors
Several editors are available
I
I
I

vi
emacs
nano

If you are not familiar with any of these you will find that nano has a
simple interface. Just type nano.

Caution!
Use dos2unix if your input/job script files were edited on a windows
machine.

25 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

Exercise 1: Getting started

Logging on to raijin - use the course account.
ssh -X aaa777@raijin.nci.org.au

Remember to read the Message of the Day (MOTD) as you login.

Commands to try:
hostname
nci_account
module list
module avail

#
#
#
#

to see the node you are logged into

to see the current state of the project
to check which modules are loaded on login
to see which software packages are installed
and accessible in this way.
module show pbs # to see what environments are set by a module

Note
In .cshrc (tcsh) or .bashrc (bash) that the intel-fc, intel-cc and openmpi
modules are loaded by default.
26 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

Outline
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting

27 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

Batch Queueing System

Most jobs require greater resources than are available to interactive

processes and must be scheduled by the batch job system
(interactive mode available).
Queueing system:
I
I

I
I
I

distributes work evenly over the system

ensures that jobs cannot impact each other (e.g. exhaust memory or
other resources)
provides equitable access to the system

Raijin uses a customised version of PBSPro.

nf limits display the limits that are set for your projects.
Default queue limit

Note
Job charging is based on wall clock time used, number of cpus requested,
queue choice.
28 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

Queue Limit

29 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

Batch queue structure

normal
I
I
I
I

express
I
I
I

Default queue designed for production use

Charging rate of 1 SU per processor-hour (walltime) on raijin
Requests for ncpus > a node (16 cores) need to be in multiples of 16.
If your grant is exhausted -> lower priority (bonus).
High priority for testing, debugging etc.
Charging rate of 3 SUs per processor-hour (walltime)
Smaller limits to discourage production use
(ncpus limits to 128, memory per core is 32GB, check nf limits for
project-specific detail. )

copyq
I

Used for file manipulation - e.g. copying files to MDSS

30 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

Using the Queueing System

I
I
I

Read the How to Use PBS

Use nf limits to see your user/project queue limits.
Request resources for your job (using qsub).
I
I
I
I

walltime
memory (32GB, 64GB, 128GB per node)
disk (jobfs)
number of cpus

PBSPro will then

I
I
I
I
I

schedule the job when the resources become available

prevent other jobs from infringing on the allocated resources
display progress of the jobs (qstat, nqstat or nqstat anu)
terminate the job when it exceeds its requested resources
return stdout and stderr in batch output files
31 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

Job Script Example

Example
#!/bin/bash
#PBS
#PBS
#PBS
#PBS
#PBS
#PBS

-l
-l
-l
-l
-l
-l

walltime=20:00:00
mem=2GB
jobfs=1GB
ncpus=16
software=xxx (for licenced software)
wd (to start the batch job in the working
directory from which it was submitted.)

my_program.exe

32 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

Job Scheduling
I
I

Job priority is based on resourse requested, currently running jobs

under the user/project, and grant allocation.
Jobs start when sufficient resources are available. (qstat -s
jobid to see comment why its not running)

Tips
I
I

Near the end or at beginning of a quarter, busy period.

higher priority
I
I
I

shorter walltime request.

smaller memory request.
larger number of cpus request (to some extend).

33 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

Long-running jobs

34 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

Long-running jobs
I
I

When run jobs last longer than the queue limits

checkpoint/restart functionality is recommended for workflows that
require long run times. Long run times expose users to system
and/or numerical instabilities.
Example scripts for self-submitting jobs can be found at FAQs

Caution!
Checkpoint/restart is not a filesystem or PBSPro capability - It must be
implemented by the user or software vendor.

35 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

stdout and stderr files

PBSPro returns the standard output and standard error from each job in
.o***** and .e***** files, respectively.

Example script.o123456
============================================================
Resource Usage on 2013-07-20 12:48:04.355160:
JobId:
123456.r-man2
Project:
c25
Exit Status: 0 (Linux Signal 0)
Service Units: 0.01
NCPUs Requested: 1
NCPUs Used: 1
CPU Time Used: 00:00:43
Memory Requested: 50mb
Memory Used: 13mb
Vmem Used: 52mb
Walltime requested: 00:10:00
Walltime Used: 00:00:49
jobfs request: 100mb
jobfs used: 1mb
============================================================
36 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

stdout and stderr files

I
I

.o***** file contains the output arising from the script (if not
redirected in the script) and additional information from PBS.
.e***** file contains any error output arising from the script (if not
redirected in the script) and additional information from PBS. For a
successful job it should be empty.
Common errors to look for in the .e***** file:
I
I

Command not found. (check module list, path)

=>> PBS: job terminated: walltime 172818sec exceeded limit
172800sec (Increase runtime request)
=>> PBS: job terminated: per node mem 2227620kb exceeded limit
2097152kb (Increase memory per node request)
Segmentation fault. (check your program)

37 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

Monitoring the progress of jobs

Useful commands
qstat
nqstat
nqstat_anu
qstat -s
qps jobid
qls jobid
qcat jobid
qcp jobid
qdel jobid

#
#
#
#
#
#
#
#
#

show the status of the PBS queues

enhanced display of the status of the PBS queues
enhanced display of the status of the PBS queues
display additional comment on the status of the job
show the processes of a running job
list the files in a job's jobfs directory
show a running job's stdout, stderr or script
copy a file from a running job's jobfs directory
kill a running job

Caution!
Please use nqstat anu -a | grep $USER to see the cpu% of your
jobs. An efficient parallel job should be close to 100%.

38 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

Exercise 2: Submitting jobs to the batch queue

cd /short/$PROJECT/$USER/
tar xvf /short/c25/intro_exercises.tar
cd INTRO_COURSE
cat runjob
qsub runjob
watch qstat -u $USER
... (wait until job finishes, use Ctrl+C to quit)...

runjob
I

This job searches the first n prime number. Please feel free to
change the number n, or the PBS resource to see the behaviour of
the outcome.

View the output in the file runjob.o**** and any error messages
in runjob.e**** after the job completes.
39 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

Interactive jobs
When running jobs on login nodes, users may see the following message
when running interactive process on login nodes:
RSS exceeded.user=abc123, pid=12345, cmd=exe,
rss=4028904, rlim=2097152 Killed
Each interactive process you run on the login nodes has imposed on it a
time (30mins) limit and a memory use (2GB) limit. If you want to run
longer or more memory intensive interactive job, please submit an
interactive job.
I The -I option for qsub will result in an interactive shell being
started out on the compute nodes once your job starts.
I A submission script cannot be used in this mode you must provide
all qsub options on the command line.
I To use X windows in an interactive batch job, include the -X option
when submitting your job this will automatically export the
DISPLAY environment variable.
40 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

Exercise 3: Interactive Batch Jobs

Sometimes the resource requirements (mem, walltime etc) are larger than
allowed. You can run an interactive batch job as follows:
qsub -I -l walltime=00:10:00,mem=500Mb -P c25 -q express -X
qsub: waiting for job 215984.r-man2 to start
qsub: job 215984.r-man2 ready
[aaa777@r73 ]$ xeyes &
[aaa777@r73 ]$ module list
Currently Loaded Modulefiles:
1) pbs
4) intel-fc/12.1.9.293
2) dot
5) openmpi/1.6.3
3) intel-cc/12.1.9.293
[aaa777@r73 ]$ cd /short/$PROJECT/$USER/INTRO_COURSE
[aaa777@r73 ]$ ./matrix.exe (use Ctrl+C to quit)
[aaa777@r73 ]$ logout
qsub: job 215984.r-man2 completed
41 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

Outline
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting

42 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

Filesystems
Things to consider I
I
I
I
I
I
I

Transferring large data files to and from raijin: scp, rsync, filezilla
Use designated data mover nodes, not interactive login nodes.
r-dm.nci.org.au
How much data do you really need to keep?
Do you need metadata or a self-describing file format?
Decide on a structure for archived data before you start.
Staging in archived data from tape (Offline) to disk before starting
jobs.
Archiving results automatically at the end of batch jobs.

43 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

RAIJIN Filesystems Overview

The Filesystems section of the userguide has this table in greater detail:
Filesystem

Purpose

Quota

Backup

Availability

Time limit

/home

Irreproducible data eg.

source code

2GB (user)

Yes

raijin

None

/short

Input/output data files

72GB (project)

raijin

365 days

/g/data/

Processing large data

project dependent

Global

$PBS JOBFS

IO intensive data

100MB per node

default

Local to node

Duration of job

MDSS

Archiving
files

20GB

2
copies
in
two
different
locations

External
access
using
mdss
commands

large

data

Note
These limits can be changed on request.
44 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

Monitoring disk usage

I
I
I

lquota gives lustre filesystem usage (/home, /short, /g/data).

nci account gives other filesystem usage (/short, /g/data, mdss)
short files report
gdata1 files report
gdata2 files report gives breakdown:
-G <project> lists files owned by group <project>.
-P <project> lists files in /short/<project>.

Caution!
/short and /g/data are not backed up so it is the users responsibility to
make sure that important files are archived to the MDSS or off-site.

45 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

Input/Output Warning
I

Lots of small IO to /short (or /home) can be very slow and can
severely impact other jobs on the system.

Avoid dribbly IO, e.g. writing 2 numbers from your inner loop.
Writing to /short every second is far too often!

Avoid frequent opening and closing of files (or other file operations)

Use /jobfs instead of /short for jobs that do lots of file

manipulation

To achieve good IO performance, try to read or write binary files in

large chunks (of around 1MB or greater)

46 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

Exercise 4: Writing to /short

Use the lquota and du commands to find how much disk space
you have available in your home, short and gdata directories.

Use the short files report or gdata1 files report to

see who uses most of the quota. Look at your projects /short
area. Anyone from your project can create their own directories and
files here. There will be a directory of your own under your project
area.

Note the different group ownership in the DATA directory.

ls -l /short/c25/DATA

47 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

Exercise 4: Writing to /short (cont)

Change the permissions on your files and directories to allow/disallow
others in your group to access them.
man chmod
chmod g+r
chmod g-r
chmod g+w
chmod g+x

filename
filename
filename
filename

#
#
#
#

allow group read to filename

disallow group read to filename
allow group write to filename
allow group execute to filename

Verify with your neighbour that your file permissions are as expected.

Note
I
I

To be able to go into a directory requires execute permission

(chmod -R +X folder)
You may not want to share files by making your /home directory
world readable. For members of the same project you can use
/short/$PROJECT. Talk to us about alternatives if you need to
share source code, data files etc.
48 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

ACL Access Control Lists

ACLs are an addition to the standard Unix file permissions (r,w,x,-) for
User, Group, and Other for read, write, execute and deny permissions.
ACLs give users and administrators flexibility and direct fine-grained
control over who can read, write, and execute files.

Caution!
We strongly recommend that you consult with NCI before using ACLs.

49 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

Using the MDSS

The Mass Data Store was migrated to a new SGI Hierarchical Storage
Management System in January 2012.
I
I

MDSS is used for long term storage of large datasets.

If you have numerous small files to archive - bundle into a tarfile
FIRST.
Watch our tape robot at work
Every project has a directory on the MDSS.
All members of the project group have read and write access to the
top project directory.
mdss dmls -l gives information what is online (on disk cache)
and what is on tape.

50 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

Using the MDSS

The mdss command can be used to get and put data between
the login and copyq nodes of the raijin and the MDSS, and also
list files and directories on the MDSS.
netcp and netmv can be used from within batch jobs to
I
I

Generate a batch script for copying/moving files to the MDSS

Submit the generated batch script to the special copyq which runs
copy/move job on an interactive node.

netcp and netmv can also be used interactively to save you work
creating tarfiles and generating mdss commands.
I
I

-t create a tarfile to transfer

-z/-Z gzip/compress the file to be transferred

Caution!
Always use -l other=mdss when using mdss commands in copyq. This
is so that jobs only run when the the mdss system is available.
51 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

Exercise 5: Using the MDSS

To see these commands in action do
cd /short/$PROJECT/$USER
mdss get Data/data.tar
ls -l
tar xvf data.tar
ls
rm data.tar
mdss mkdir $USER
netmv -t $USER.tar DATA $USER
watch qstat -u $USER
... (wait until job finishes, use Ctrl+C to quit)...
less DATA.o*
mdss ls $USER
mdss rm $USER/$USER.tar

52 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

Using /jobfs
I

Only available through queueing system:

Request like -ljobfs=1GB
Access via $PBS JOBFS environment variable

All files are deleted at end of job. Copy what you need to /short
or other global filesystem in job script.

Request larger than 396GB will be automatically redirected to

/short (but will still be deleted at the end of the job).

Cannot use mdss or netcp commands for files on /jobfs.

53 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

Exercise 6: Managing Files between /short, /jobfs and MDSS

Submit a batch job with a /jobfs request, where the job:
I
I
I
I

Copies an input file from /short to /jobfs

Runs a code to use the input file and generate some output
Saves the output data back to the /short area
Uses the netcp command to archive the data to the MDSS

54 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

Exercise 6: Managing Files between /short, /jobfs and MDSS

Read the runjobfs script then submit it to the queueing system,
monitor the job with qstat, and examine the job output files:
cd /short/$PROJECT/$USER/INTRO_COURSE
qsub runjobfs
watch qstat -u $USER
... (wait until job finishes, use Ctrl+C to quit)...
cat runjobfs.e*
cat runjobfs.o*

Check out the output file that this job created on /short and the copy on
the MDSS
cd /short/$PROJECT/$USER
ls -ltr
less save_data.o*
mdss ls $USER
mdss rm -r $USER
55 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

Outline
Introduction
Accounting
Connecting
UNIX
Job Scheduling
Filesystems
Troubleshooting

56 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

Troubleshooting
I
I
I

.e and .o stderr and stdout files - check your input!

PBS emails, MOTD and Notices and News
Read the FAQs
I
I

I
I

Why are my jobs not running?

Why does my job run fine on my local machine, but not work on
raijin?
My PBS job script generates the error message module: command
not found. Whats wrong?
How do I access files on NCI systems using a graphical user interface?
How do I transfer files between massdata and my local machine?

Read the /g/data FAQs

57 / 58

Introduction

Accounting

Connecting

UNIX

Job Scheduling

Filesystems

Troubleshooting

Issues with Running Jobs

CPU Over/Under subscription

I
I

I
I
I

I
I

Due to inconsistent number of ncpus=X request vs mpirun -np Y,

where Y != X.
OMP NUM THREADS != $PBS NCPUS
Use mpirun --bind-to-socket -npernode 2 <exe>-T 8
<args>
Use mpirun --bind-to-none program.exe for ncpus
<16 jobs
software specific keywords.
%nproc in gaussian.
NPAR in VASP. The recommended value should be somewhere
between SQRT( ncpus ) ... ncpus/2 and be a factor of 16.

Unbalanced %cpu usage

0% cpu usage (sleep, hung, or dead job). If its a file manipulation
job, use copyq instead.
58 / 58

Nci Intro 06jan17-Compressed
No ratings yet
Nci Intro 06jan17-Compressed
58 pages
Introduction To The Gadi Supercomputer
No ratings yet
Introduction To The Gadi Supercomputer
40 pages
Linux Networking and User Management
No ratings yet
Linux Networking and User Management
9 pages
Linux Mpio
No ratings yet
Linux Mpio
44 pages
Cics Db2 - Inicio Ao Fim - Comandos e Operacoes
No ratings yet
Cics Db2 - Inicio Ao Fim - Comandos e Operacoes
297 pages
HPC Introduction Lecture 2
No ratings yet
HPC Introduction Lecture 2
55 pages
Intro To Linux and HPC
No ratings yet
Intro To Linux and HPC
67 pages
Introductory Supercomputing PDF
No ratings yet
Introductory Supercomputing PDF
94 pages
Itec 2210
No ratings yet
Itec 2210
2 pages
Chapter 5 and 6 Application Servers
No ratings yet
Chapter 5 and 6 Application Servers
35 pages
CICS Administration Reference
No ratings yet
CICS Administration Reference
575 pages
Secure Firewall Lab-Advanced 7.2-1
No ratings yet
Secure Firewall Lab-Advanced 7.2-1
38 pages
CICS Operations and Utilities Guide
No ratings yet
CICS Operations and Utilities Guide
285 pages
Implementing Cisco UCS Solutions Sample Chapter
No ratings yet
Implementing Cisco UCS Solutions Sample Chapter
44 pages
TCI2511 Lab Guide Generic v1-1 PDF
No ratings yet
TCI2511 Lab Guide Generic v1-1 PDF
34 pages
CICS Recovery and Restart Guide
No ratings yet
CICS Recovery and Restart Guide
295 pages
CICS Explorer Travelers Guide
No ratings yet
CICS Explorer Travelers Guide
27 pages
Secure Firewall Lab-Basics 7
No ratings yet
Secure Firewall Lab-Basics 7
88 pages
Nagios Usage Guide
No ratings yet
Nagios Usage Guide
29 pages
Distributed File Systems Concepts and e 61384
No ratings yet
Distributed File Systems Concepts and e 61384
54 pages
NetApp Study Notes
100% (6)
NetApp Study Notes
32 pages
HA PACEMAKER.2014. Ha Linux Clustering
No ratings yet
HA PACEMAKER.2014. Ha Linux Clustering
290 pages
BCS7L1 - Grid & Cloud Computing Lab
No ratings yet
BCS7L1 - Grid & Cloud Computing Lab
35 pages
Dast6 NJ3+-+10 144 160 6
No ratings yet
Dast6 NJ3+-+10 144 160 6
17 pages
Solutions of Assignment ISM: Task 1 - Perform Administrative Tasks To Manage Network Users and Resources
No ratings yet
Solutions of Assignment ISM: Task 1 - Perform Administrative Tasks To Manage Network Users and Resources
12 pages
Windows Server Proposal For Fixing Windows LLC
No ratings yet
Windows Server Proposal For Fixing Windows LLC
8 pages
High Performance Parallel I O 1st Edition I Foster All Chapters Instant Download
100% (15)
High Performance Parallel I O 1st Edition I Foster All Chapters Instant Download
81 pages
Usc Csci555 f12 Part2
No ratings yet
Usc Csci555 f12 Part2
222 pages
Cifs SDFSDF Wefadsfsdfsfwerwedmi Nistrafsdfsdfsdfsdfsdtio Nsdfsfsdfsdfsfsfdsfsdfs DFSDFSDDFDFG FDSF Sdfasdasfsfasfsdfsadfdf Sdafdfwrwer Fsdfsfwersdwifasdfsadff Sadfsdfasdfasdfndows 2000Sdfsdfsdfsdfs
No ratings yet
Cifs SDFSDF Wefadsfsdfsfwerwedmi Nistrafsdfsdfsdfsdfsdtio Nsdfsfsdfsdfsfsfdsfsdfs DFSDFSDDFDFG FDSF Sdfasdasfsfasfsdfsadfdf Sdafdfwrwer Fsdfsfwersdwifasdfsadff Sadfsdfasdfasdfndows 2000Sdfsdfsdfsdfs
23 pages
DFSDSDDFDFDFSDFDFDFDFSDDF
No ratings yet
DFSDSDDFDFDFSDFDFDFDFSDDF
23 pages
Include File Rules
No ratings yet
Include File Rules
5 pages
Storage Systems NPTEL Course Jan 2012: (Lecture 09)
No ratings yet
Storage Systems NPTEL Course Jan 2012: (Lecture 09)
10 pages
Cics Problem Determination Guide
No ratings yet
Cics Problem Determination Guide
389 pages
CIFS Setup and Management Guide
No ratings yet
CIFS Setup and Management Guide
23 pages
Iscsi
No ratings yet
Iscsi
8 pages
SAN & NAS Storage Guide
No ratings yet
SAN & NAS Storage Guide
8 pages
Dfha1h00 PDF
No ratings yet
Dfha1h00 PDF
488 pages
CICS Asynchronous dfht1 PDF PDF
No ratings yet
CICS Asynchronous dfht1 PDF PDF
428 pages
High Performance Parallel I O 1st Edition I Foster Latest PDF 2025
100% (13)
High Performance Parallel I O 1st Edition I Foster Latest PDF 2025
105 pages
2-Day 6 5 Integration Training Lab Manual
No ratings yet
2-Day 6 5 Integration Training Lab Manual
38 pages
Practical Guide To OCI Lab Guide
No ratings yet
Practical Guide To OCI Lab Guide
156 pages
SNA Lecture 5
No ratings yet
SNA Lecture 5
38 pages
Opportunities For Collaboration With IBM: Allen Malony
No ratings yet
Opportunities For Collaboration With IBM: Allen Malony
44 pages
CICS Problem Determination Guide
No ratings yet
CICS Problem Determination Guide
387 pages
Scratch Tutorial Finalv02
No ratings yet
Scratch Tutorial Finalv02
54 pages
Mathematics Sequence of Achievement
No ratings yet
Mathematics Sequence of Achievement
3 pages
Mathematical Methods (Cas) Written Examination 2: Victorian Certifi Cate of Education 2006
No ratings yet
Mathematical Methods (Cas) Written Examination 2: Victorian Certifi Cate of Education 2006
24 pages
rjanisispubsOMAHATMA2121OMenuATMA2121OM - PDF 5
No ratings yet
rjanisispubsOMAHATMA2121OMenuATMA2121OM - PDF 5
690 pages
Getting Started PDF
No ratings yet
Getting Started PDF
1 page
tinyML Talks Eben Upton 210304
No ratings yet
tinyML Talks Eben Upton 210304
22 pages
BSEE24 Final
No ratings yet
BSEE24 Final
3 pages
Hitachi Content Platform: Enterprise-Class, Backup-Free Cloud and Archive
No ratings yet
Hitachi Content Platform: Enterprise-Class, Backup-Free Cloud and Archive
2 pages
Amara Raza - 29 - 1
No ratings yet
Amara Raza - 29 - 1
12 pages
Xilsem
No ratings yet
Xilsem
42 pages
Sample Critical Lens Quotes
No ratings yet
Sample Critical Lens Quotes
2 pages
Ajr e Risalat
No ratings yet
Ajr e Risalat
7 pages
Quizizz - Circumference and Area of A Circle
No ratings yet
Quizizz - Circumference and Area of A Circle
6 pages
Old Cisco Interfaces Mib
No ratings yet
Old Cisco Interfaces Mib
24 pages
Few More Results On Sum Labeling of Split Graphs
No ratings yet
Few More Results On Sum Labeling of Split Graphs
4 pages
MOde Frontier Tutorial
No ratings yet
MOde Frontier Tutorial
35 pages
Vocabulary Selecting AWL
No ratings yet
Vocabulary Selecting AWL
26 pages
Deity Worship & Bhakti Yoga Exam
No ratings yet
Deity Worship & Bhakti Yoga Exam
6 pages
Empowering Faith Through Preaching
No ratings yet
Empowering Faith Through Preaching
3 pages
2nd Demo LP
No ratings yet
2nd Demo LP
13 pages
Compre Reviewer
No ratings yet
Compre Reviewer
143 pages
Filipino Curriculum & K-12 Overview
No ratings yet
Filipino Curriculum & K-12 Overview
2 pages
Research Methods in Media Studies: What Do We Mean by Research?
No ratings yet
Research Methods in Media Studies: What Do We Mean by Research?
9 pages
FortiManager-7 4 0-Administration - Guide
No ratings yet
FortiManager-7 4 0-Administration - Guide
937 pages
Guru Stotram-1
No ratings yet
Guru Stotram-1
5 pages
Thomas Aquinas On Establishing The Identity of Aristotle's Categories
No ratings yet
Thomas Aquinas On Establishing The Identity of Aristotle's Categories
38 pages
bEAMEX hART PDF
No ratings yet
bEAMEX hART PDF
5 pages
CCS - View Topic - SOLVED - Problem With INT - RDA Not Beeing F
100% (3)
CCS - View Topic - SOLVED - Problem With INT - RDA Not Beeing F
5 pages
21CSE354T - Full Stack Web Development Question Bank
100% (1)
21CSE354T - Full Stack Web Development Question Bank
9 pages
Programming Language and Paradigms
No ratings yet
Programming Language and Paradigms
28 pages
DOCS
No ratings yet
DOCS
83 pages
DidIndo EuropeanlanguagesstemfromaTrans Eurasianoriginallanguagewithreview
No ratings yet
DidIndo EuropeanlanguagesstemfromaTrans Eurasianoriginallanguagewithreview
12 pages
Notes - Howson, A. (2004) - The Body in Everyday Life
No ratings yet
Notes - Howson, A. (2004) - The Body in Everyday Life
4 pages
Malayalam Historyy
No ratings yet
Malayalam Historyy
3 pages
TOC Crash Notes
No ratings yet
TOC Crash Notes
5 pages