Hoffman2:Submitting Jobs

From Center for Cognitive Neuroscience
Jump to navigation Jump to search

Back to all things Hoffman2

If you remember from Anatomy of the Computing Cluster, the Sun Grid Engine on Hoffman2 is the scheduler for all computing jobs. It takes your computing job request, considers what resources you are asking for and then puts your job in a line waiting for those resources to become available.

Ask for a simple 1GB of memory and a single computing core with a short time window, and your job will likely get placed at the front of the line and start running soon if not immediately. And for the vast majority of people, this will be the case.

Ask for a lot of memory or many computing cores, and your job will get put further back in the line because it will have to wait for more things to become available. If your job needs these types of resources, you are probably at a level where reading this tutorial isn't very helpful.

Ask for too little RAM or too little time and your job will be killed or end prematurely leaving you with no results to examine.

Job Submission Types

So how does one submit a computing job request? You've got some options:

  1. job.q
    Use a simple tool that ATS wrote. It has a menu and walks you through submitting things but has been known to possibly forget certain necessary flags.
  2. qsub
    Get under the hood and do it yourself. It can get messy but it can also be faster and you have more flexibility with options.
  3. command files
    You've graduated to a higher level of operations, but we can help you get there with examples of our own command files.
  4. job arrays
    You've got a lot of repetitive tasks to run, these will be your friend.


job.q

Once you've identified or written a script you'd like to run, SSH into Hoffman2 and enter job.q. Then it is just a matter of following its step-by-step instructions.

From the tool's main menu, you can type Info to read up about how to use it and we highly encourage you to do so.

But we know patience is a virtue that most of us aren't blessed with. So we'll walk you through submitting a basic job so you can hit the ground running.

Example

  1. Once on Hoffman2, you'll need to edit one file so pull out your favorite text editor and edit the file
    ~/.queuerc
  2. Add the line
    set qqodir = ~/job-output
  3. You've just set the default directory where your job command files will be created. Save the configuration file and close your text editor.
  4. Make that directory using the command
    $ mkdir ~/job-output
  5. Now execute
    $ job.q
  6. Press enter to acknowledge the message about some files that get created (READ IT FIRST THOUGH).
  7. Type Build <ENTER> to begin creating an SGE command file.
  8. The program now asks you which script you'd like to run, enter the following text to use our example script
    /u/home/FMRI/apps/examples/qsub/gather.sh
  9. The program now asks how much memory the job will need (in Megabytes). This script is really simple, so let's go with the minimum and enter 64.
  10. The program now asks how long will the job take (in hours). Go with the minimum 1 hour; it will complete in much less than this.
  11. The program now asks if your job should be limited to only your resource group's cores. Answer n because you do not need to be limiting yourself here and the job is not going to be running for more than 24 hours.
  12. Soon, the program will tell you that gather.sh.cmd has been built and saved.
  13. When it asks you if you would like to submit your job, say no. Then type Quit <ENTER> to leave the program.
  14. Now you should be able to run
    ls ~/job-output
    and see gather.sh.cmd. This file will stay there until you delete it and can be run over and over again. Making a command file like this is especially useful if there is a task you'll be running repeatedly on Hoffman2. But if this is something you only need to run once, you should delete the file so you don't needlessly approach your quota.
  15. The time has come to actually run the program (thought we'd never get to that, didn't you?). Type
    qsub job-output/gather.sh.cmd
    and after hitting enter, a message similar to this will pop up:
    Your job 1882940 ("gather.sh.cmd") has been submitted
    where the number is your JobID, a unique numerical identifier for the computer job you have submitted to the queue.
  16. Now you can check if the job has finished running by doing
    ls ~/job-output
  17. When two files named gather.sh.output.[JOBID] and gather.sh.joblog.[JOBID] (where JOBID is your job's unique identifier) appear, your job has run.
    gather.sh.output.[JOBID]
    This file has all the standard output generated by your script. In this case it will just have the line
    Standard output would appear here.
    gather.sh.joblog.[JOBID]
    This file has all the details about when, where, and how your job was processed. Useful information if you are going to be running this job over and over and need to fine tune the resources it uses.
  18. Better ways of checking on your job can be found here.
  19. The script you ran is an aggregator. It looks in a list of directories, each assumed to contain a specifically named file, and gathers the contents of each of those files into one central file in your home directory. This file is named gather-[TIMESTAMP].txt where TIMESTAMP is when the script was run and follows ISO 8601 style encoding. You are encouraged to type
    /u/home/FMRI/apps/examples/qsub/gather.sh -h
    or
    /u/home/FMRI/apps/examples/qsub/gather.sh --help
    to see how this script works.
  20. Finally, go check the inbox of the email you used to sign up for your Hoffman2 account. There will be two emails from "root@mail.hoffman2.idre.ucla.edu" that indicate when the job was started and when the job was completed. This is one of the neat features of the queue so that you can be alerted about the progress of your job without having to stay logged into Hoffman2 and checking on it constantly.


qsub

Everything that job.q did can be done on the command line. And it can be done better.

Example

Run the command:

$ qsub -cwd -V -N J1 -l h_data=64M,express,h_rt=00:05:00 -M eplau -m bea /u/home/FMRI/apps/examples/qsub/gather.sh

And something like the following will be printed out:

Your job 1875395 ("J1") has been submitted

Where the number is your JOBID, a unique numerical identifier for your job.

Let's break down the arguments in that command.

-cwd
Change working directory
When your script runs, change the working directory to where you currently are in the filesystem.
e.g. If you were in the director /u/home/mscohen/data/ when you ran the command, the queue will change directories to that location and then execute the script you gave it. This means output and error directories will be placed here for that job.
-V
Export environment variables
Exports all the environment variables to the context of the job. Useful if you have extra environment variables that are needed in your script.
e.g. If you had defined the variable SUBJECT_ID in your session on Hoffman2 (export SUBJECT_ID=42) before submitting a job and that variable was called on by your script, then you would need to use this flag. Tools like FreeSurfer look for certain environment variables to be set.
-N J1
Name my job
Names your job "J1." When you look at the queue, this will be the text that shows up in the "name" column. This will also be the beginning of the output (J1.o[JOBID]) and error (J1.e[JOBID]) files for your job.
-l h_data=64M,express,time=00:05:00
Resource allocation (that's a lower case "elle")
This is the resources flag meaning that the text immediately after it will ask for things like:
  • certain amount of memory, in Megabytes, or Gigabytes
    • h_data=64M (64 MB RAM) or h_data=1G (1 GB RAM)
    • "mem" no longer works
In this case, our demands for RAM are really low, so we are requesting only 64MB.
Edit (2013.09) - If your job uses more RAM than it requested, your job WILL be killed in order to avoid it hurting other jobs running on the same node. It is imperative that you set this RAM request properly.
  • certain length of computing time, in the form HH:MM:SS
    • h_rt=00:05:00 or
    • time=00:05:00
In this case the script will complete its task rapidly, hence we are only asking for 5 minutes of computing time.
  • queue type, only a few options here
    • express
      Time limit of 2 hours, and it tends to be overloaded so it isn't recommended
    • highp
      Job length maximum of 14 days but can only be run on nodes belonging to your resource group (type mygroup to see what type of resources you have available). If you are in the mscohen or sbook usergroups on Hoffman2, you have access to some of these highp nodes.
    • [blank] (nothing, nada, zilch)
      Standard queue, which has a maximum job length of 24 hours
In this case, we are asking to be put on the express queue since this is such a short job, but the standard queue would have worked just as well if not better.
-M eplau
Define mailing list
This defines the list of users that will be mailed if email updates are requested. The default address is that of the job-owner, but multiple emails can be specified using a comma separated list.
e.g. In this case, the email will be sent to the address on file for the user "eplau"
-m bea
Define mailing rules
This defines when Hoffman2 should email you about your job. There are five options here
  • b - when the job begins
  • e - when the job ends
  • a - when the job is aborted
  • s - when the job is suspended
  • n - never
The first four can be used in any combination, but the last obviously nullifies the others.

There are many other flags that you could use, but these are the basics that will get you through most of your computing. Feel free to explore the others in the qsub Man page.


Command Files

Typing accurately can be difficult at times, so why put yourself through the trouble of having to retype the same arguments over and over if you will always be using about the same values? Enter command files.

You already have experience making a command file (~/job-output/gather.sh.cmd) from when you used the tool job.q. But did you know that you can edit that command file to make changes to how it runs, or write your own?

The command files generated by job.q are fairly well commented, so if you take a look at them with your favorite [Text Editors|text editor] you should be able to change their behavior. For instance, if you go into the command file from the job.q example, find the lines that say

#  Notify at beginning and end of job
#$ -m bea

You recognize that this is the flag about when to send email messages. Go ahead and change this to

# Notify at the end and on abort
#$ -m ae

And you should only receive one email when your job finishes.


q.sh

You could make a generic command file that contains all the basic flags that you care about. We've even got an example ready and available for you at

/u/home/FMRI/apps/examples/qsub/q.sh

The script contents are shown below:

qsub <<CMD
#!/bin/bash
# Use current working directory
#$ -cwd
# Error stream is merged with the standard output
#$ -j y
# Use the bash shell for job execution
#$ -S /bin/bash
# Use your normal environment variables in the job
#$ -V
# Use 1GB of RAM and the main queue, with a maximum of 2 hours computing time
#$ -l h_data=1024M,h_rt=2:00:00
$@
CMD

To use this command file to submit the gather.sh example script, you would execute the command:

$ q.sh gather.sh

You can do this because if you have set up your Bash profile correctly, they are in your Unix PATH variable. You can replace gather.sh with any script you want executed and it will be submitted as a job on the cluster. We recommend that you make your own copy of q.sh and keep it in your local bin directory (~/bin) so that you can edit it to suit your needs.


Job Arrays

There is an SGE qsub argument that allows you to submit multiple jobs in parallel that use the same script. It is

-t lower-upper:interval

where

lower
is replaced with the starting number
upper
is replaced with the ending number
interval
is replaced with the step interval

So adding the argument

-t 10-100:5

will step through the numbers 10, 15, 20, 25, ..., 100 submitting a job for each one.

In jobs that are called with this flag, there will be an environment variable called SGE_TASK_ID whose value will be incremented over the range you specified. Each possible value of SGE_TASK_ID will be submitted as its own job, so your work will be parallelized.


Examples

Why would anyone use this? Here are some examples

Lots of numbers

Let's say you have a script, myFunc.sh, that takes one numerical input and computes a bunch of values based on that input. But you need to run myFunc.sh for input values 1 to 100. One solution would be to write a wrapper script myFuncSlowWrapper.sh as

#!/bin/bash
# myFuncSlowWrapper.sh
for i in {1..100};
do
    myFunc.sh $i;
done

The only drawback is that this will take quite a while since all 100 iterations will be done on a single processor. With job arrays, the computations will be split among many processors and can finish much more quickly. You would instead write a wrapper script called myFuncFastWrapper.sh as

#!/bin/bash
# myFuncFastWrapper.sh
echo $SGE_TASK_ID
myFunc.sh $SGE_TASK_ID

And submit it with

qsub -cwd -V -N PJ -l h_data=1024M,h_rt=01:00:00 -M eplau -m bea -t 1-100:1 myFuncWrapper.sh

Lots of files

Let's say you have a script, myFunc2.sh, that takes the name of a file as input and opens that file and runs a bunch of computations on its contents. But you have 100 such files to process. One solution would be to write a wrapper script myFunc2SlowWrapper.sh as

#!/bin/bash
# myFunc2SlowWrapper.sh
for FILE in `ls dir/of/files`;
do
    myFunc2.sh $FILE
done

But this will take quite a while since all 100 iterations will be done on a single processor. With job arrays, the computations will be split among many processors since they are submitted as their own jobs and can finish much more quickly. You could instead create a file that contains a list of all 100 files that need to be processed and call it filesToProcess. Then write a wrapper script called myFunc2FastWrapper.sh as

#!/bin/bash
# myFunc2FastWrapper.sh
echo $SGE_TASK_ID
myFunc2.sh `sed -n ${SGE_TASK_ID}p /path/to/list/of/files`

where you replace /path/to/list/of/file with the path to fileToProcess. The code

`sed -n ${SGE_TASK_ID}p /path/to/list/of/files`

uses sed to grab the ${SGE_TASK_ID}'th line from the file /path/to/list/of/files and returns it (thanks to the tick marks, SHIFT + ~).

Then you'd submit it with

qsub -cwd -V -N PJ -l h_data=1024M,express,h_rt=01:00:00 -M eplau -m bea -t 1-100:1 myFunc2Wrapper.sh

If your files were named regularly with a '-number' at the end (e.g. 'file-1', 'file-2', 'file-3', ... 'file-n'), you could just make myFunc2FastWrapperB.sh as

#!/bin/bash
# myFunc2FastWrapperB.sh
echo $SGE_TASK_ID
myFunc2.sh file-${SGE_TASK_ID}

and submit it the same way.