Hoffman2:Monitoring Jobs

From Center for Cognitive Neuroscience
Revision as of 21:32, 14 November 2017 by Dmargolis (talk | contribs) (→‎l33t tip)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Back to all things Hoffman2

You've submitted jobs to the queue, but you have no idea how they are doing. There are some great tools you need to get familiar with:

  • qstat - Status updates about jobs. More interesting than Facebook?
  • qhold - Put that job on hold so others can cut in line.
  • qrls - Release a job from hold, things have to get done.
  • qdel - Delete a job when it starts misbehaving or you realize you made a typo in your script.


qstat

Say the name slowly, and you'll understand immediately what it does. It gives you the status of the queue...the whole queue. We don't recommend running qstat without any flags because it will spit out a firehose of information, feel free to try though.

Instead, it should usually be run as

$ qstat -u USER

where you replace USER with your username. This instructs qstat to only return the jobs that belong to USER. If there are no such jobs, nothing will be returned.

Examples

An example output of qstat is shown below

job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
1887245 0.50088 FQt3.cmd   kerr         r     03/17/2012 15:44:26 mscohen_idre.q@n130                1 77
1887245 0.50088 FQt3.cmd   kerr         r     03/18/2012 00:27:28 mscohen_idre.q@n129                1 90
1887245 0.50088 FQt3.cmd   kerr         r     03/18/2012 08:08:56 mscohen_idre.q@n132                1 97
1887245 0.50088 FQt3.cmd   kerr         r     03/18/2012 13:21:11 mscohen_idre.q@n131                1 102
1887245 0.50088 FQt3.cmd   kerr         r     03/18/2012 17:43:23 mscohen_idre.q@n130                1 107
1887245 0.50088 FQt3.cmd   kerr         r     03/18/2012 18:09:55 mscohen_idre.q@n127                1 109
1887245 0.50088 FQt3.cmd   kerr         r     03/18/2012 18:11:50 mscohen_idre.q@n130                1 110
1887245 0.50088 FQt3.cmd   kerr         r     03/18/2012 18:14:41 mscohen_idre.q@n129                1 111
1887245 0.50088 FQt3.cmd   kerr         r     03/18/2012 18:30:59 mscohen_idre.q@n132                1 113
1887245 0.50088 FQt3.cmd   kerr         r     03/18/2012 18:49:04 mscohen_idre.q@n128                1 115
1887245 0.50088 FQt3.cmd   kerr         r     03/18/2012 19:05:59 mscohen_idre.q@n132                1 116
1887245 0.50088 FQt3.cmd   kerr         r     03/18/2012 19:58:29 mscohen_idre.q@n129                1 118
1887245 0.50088 FQt3.cmd   kerr         r     03/18/2012 20:15:08 mscohen_idre.q@n127                1 120
1887245 0.50088 FQt3.cmd   kerr         r     03/18/2012 20:48:27 mscohen_idre.q@n132                1 121

Let's break down each column

job-ID
This is the unique numerical identifier given to each job.
prior
The priority rating given to a job, values range from 0 to 1. Zero priority means it isn't running. Priority at about 0.5 is typical while your job is running.
name
The name of the job. Remember you can set this to clever text by using the -N flag in qsub.
user
The user to whom the job belongs. If you used the -u flag, only one user should be showing up in this column for the whole output of qstat. But if you ran it without that flag, you will see all sorts of usernames here.
state
What state is the job in. There are (at least) four options to choose from:
  • w
    waiting in line, so execution hasn't started yet
  • r
    running, yay!
  • E
    error...error...your job isn't working and should be deleted soon
  • h
    hold, meaning somehow your job has been held up either in line or while running so that no progress is being made. You may have intentionally held some of your jobs to let other ones you have submitted jump ahead in the queue. How and why would I do this?
submit/start at
The date when the job was submitted or started running.
queue
Which queue is this job in. Some possible values include:
  • idre.q
    General IDRE queue
  • msa.q
    General IDRE queue in the Math Sciences data center
  • pod.q
    General IDRE queue in the POD data center
  • express.q
    This is the express queue which has a maximum computing time of 2 hours.
  • mscohen_idre.q
    These are the nodes that belong to the mscohen resource group. If you are in that resource group, you can use these computing nodes to run jobs as long as 14 days.
  • inter_ext.q
  • eeskin_test.q
  • hadoop.q
  • hadoopdev.q
    other generally available queues (as of March 2012)
slots
There are multiple slots available on a given computing node. These represent individual CPUs. You generally only need one, but can request more just remember that requesting more resources generally means it will take longer for your job to begin running.
ja-task-ID
There are things called job arrays which let you run a bunch of parallel jobs using the same script but changing one number each time. If you submit a job array, that changing number will show up in this column for each of the parallel jobs. See our tutorial about there here.

The Man page is your friend if you want to learn about other option flags, like how to get the output in XML format, or how to get extended information about resource requirements for each job.


myjobs

Use the command

$ myjobs

to achieve the same thing as

$ qstat -u MYUSERNAME

with less typing.

qhold

Let's say you submitted 1000 jobs that will take a while to run and are waiting in the queue, but then all of a sudden need to run a single job immediately. If you submit that single job now, it will have to wait behind all those other jobs.

Alternatively, you could hold the 1000 jobs so that your new single job can cut them in line. It's like having friends go ahead of you to the movie premier to save you a spot in line.

Examples

Using the output of qstat, you can know the Job-ID of a particular computing job. Let's say you have a job with ID=1234567. To hold it execute

$ qhold 1234567

To hold all of your jobs, you could execute

$ qhold -u USERNAME

where you replace USERNAME with your username.

The Man page is your friend if you want to learn about other option flags.

Now that you held all those jobs to let your new one cut in line, how do you release them so they can run? Keep reading...


qrls

This tool lets you release jobs from holds so they may continue running.

Examples

If you had previously held job 1234567, execute

$ qrls 1234567

to release it so it may run.

Or to release all of your jobs, execute

$ qrls -u USERNAME

where you replace USERNAME with your username.

The Man page is your friend if you want to learn about other option flags.


qdel

The tool for that moment when you realize the job you submitted will actually delete all your data instead of preprocess it (don't laugh, it can happen). This is very similar to qhold and qrls.

Examples

To delete job 1234567, execute

$ qdel 1234567

and it will stop in its tracks.

To delete all the jobs belonging to you, execute

$ qdel -u USERNAME

where you replace USERNAME with your username.

The Man page is your friend if you want to learn about other option flags.


External Links