Hoffman2:Job Array: Difference between revisions
No edit summary |
|||
(38 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
[[Hoffman2:Batch_Mode|Back to Hoffman2 Batch Mode]] | [[Hoffman2:Batch_Mode|Back to Hoffman2 Batch Mode]] | ||
Job | Job array is a type of batch mode. It makes it possible to process different subjects using the same script on multiple Hoffman2 working nodes at the same time. | ||
Here, we use the this template code to show how it can be done | Here, we use the [[Hoffman2:Submit_jobarray|this template code]] to show how it can be done | ||
#!/bin/bash | #!/bin/bash | ||
#$ -cwd | #$ -cwd | ||
# error = Merged with joblog | # error = Merged with joblog | ||
#$ -o joblog.$JOB_ID | #$ -o joblog.$JOB_ID.$TASK_ID | ||
#$ -j y | #$ -j y | ||
#$ -pe shared 2 | #$ -pe shared 2 | ||
Line 16: | Line 15: | ||
#$ -M $USER@mail | #$ -M $USER@mail | ||
# Notify when | # Notify when | ||
#$ -m | #$ -m a | ||
#$ -t 1- | # Job array indexes | ||
#$ -t 1-5:1 | |||
The only | The only differences comparing with the [[Hoffman2:Submit_job|single subject version]] are | ||
#$ -o joblog.$JOB_ID.$TASK_ID | |||
#$ -t 1-5:1 | |||
:<code>-o joblog.$JOB_ID.$TASK_ID</code> is for splitting logs into separate files for each subject with file name joblog.$JOB_ID.$TASK_ID. | |||
:<code>-t 1-5:1</code> is giving numbers [1 2 3 4 5] to step through. | |||
:This <code>-t</code> option should be followed by a lower number and a higher number range together with the step interval in the following format: | |||
-t lower-upper:interval | -t lower-upper:interval | ||
Line 34: | Line 39: | ||
-t 10-100:5 | -t 10-100:5 | ||
will step through the numbers 10, 15, 20, 25, ..., 100 submitting a job for each one. | will step through the numbers 10, 15, 20, 25, ..., 100 submitting a job for each one. | ||
There will be an [[Hoffman2:UNIX Tutorial#Environment Variables|environment variable]] called <code>SGE_TASK_ID</code> whose value will be incremented over the range you specified. Hoffman2 job scheduler will submit one job for each SGE_TASK_ID, so your work will be parallelized. | |||
===When to use it?=== | |||
Let's see how job array can replace a loop which is limited to run only in one computing node. | |||
#!/bin/bash | |||
# myFuncSlowWrapper.sh | |||
for i in {1..100}; | |||
do | |||
myFunc.sh $i; | |||
done | |||
With job arrays, the work load will be split among many processors and can finish much faster. Here's how you rewrite it using job array in '''myFuncFastWrapper.sh''' as | |||
#!/bin/bash | |||
# myFuncFastWrapper.sh | |||
echo $SGE_TASK_ID | |||
myFunc.sh $SGE_TASK_ID | |||
===Example=== | |||
In this [[Hoffman2:Submit_jobarray|sample code]], each SGE_TASK_ID is the index of the array of subjects, so each job in different node knows which subject it should process. | |||
#!/bin/bash | |||
#$ -cwd | |||
# error = Merged with joblog | |||
... | |||
... | |||
# Set up the subjects list | |||
declare -a subjects | |||
subjects[1]="su3v3hkaykw2" | |||
subjects[2]="wxg5mk5u5xbz" | |||
subjects[3]="6q2bgkqu5grp" | |||
subjects[4]="whjue68jmwyh" | |||
subjects[5]="pfx3ju9wz8rr" | |||
echo "This is sub-job $SGE_TASK_ID" | |||
echo "This is subject ${subjects[$SGE_TASK_ID]}" | |||
At the end, call your script to process the subject | |||
# Your script content goes here... | |||
myFunc.sh ${subjects[$SGE_TASK_ID]} | |||
Here's another [[Hoffman2:Submit_jobarray_(readarray)|example code]], which reads a list of subjects from a file into an array. | |||
In this way, there's no need to manually assign indexes to your subjects. |
Latest revision as of 21:43, 27 December 2019
Job array is a type of batch mode. It makes it possible to process different subjects using the same script on multiple Hoffman2 working nodes at the same time.
Here, we use the this template code to show how it can be done
#!/bin/bash #$ -cwd # error = Merged with joblog #$ -o joblog.$JOB_ID.$TASK_ID #$ -j y #$ -pe shared 2 #$ -l h_rt=8:00:00,h_data=4G # Email address to notify #$ -M $USER@mail # Notify when #$ -m a # Job array indexes #$ -t 1-5:1
The only differences comparing with the single subject version are
#$ -o joblog.$JOB_ID.$TASK_ID #$ -t 1-5:1
-o joblog.$JOB_ID.$TASK_ID
is for splitting logs into separate files for each subject with file name joblog.$JOB_ID.$TASK_ID.-t 1-5:1
is giving numbers [1 2 3 4 5] to step through.- This
-t
option should be followed by a lower number and a higher number range together with the step interval in the following format:
-t lower-upper:interval
where
lower
- is replaced with the starting number
upper
- is replaced with the ending number
interval
- is replaced with the step interval
So adding the argument
-t 10-100:5
will step through the numbers 10, 15, 20, 25, ..., 100 submitting a job for each one.
There will be an environment variable called SGE_TASK_ID
whose value will be incremented over the range you specified. Hoffman2 job scheduler will submit one job for each SGE_TASK_ID, so your work will be parallelized.
When to use it?
Let's see how job array can replace a loop which is limited to run only in one computing node.
#!/bin/bash # myFuncSlowWrapper.sh for i in {1..100}; do myFunc.sh $i; done
With job arrays, the work load will be split among many processors and can finish much faster. Here's how you rewrite it using job array in myFuncFastWrapper.sh as
#!/bin/bash # myFuncFastWrapper.sh echo $SGE_TASK_ID myFunc.sh $SGE_TASK_ID
Example
In this sample code, each SGE_TASK_ID is the index of the array of subjects, so each job in different node knows which subject it should process.
#!/bin/bash #$ -cwd # error = Merged with joblog ... ... # Set up the subjects list declare -a subjects subjects[1]="su3v3hkaykw2" subjects[2]="wxg5mk5u5xbz" subjects[3]="6q2bgkqu5grp" subjects[4]="whjue68jmwyh" subjects[5]="pfx3ju9wz8rr" echo "This is sub-job $SGE_TASK_ID" echo "This is subject ${subjects[$SGE_TASK_ID]}"
At the end, call your script to process the subject
# Your script content goes here... myFunc.sh ${subjects[$SGE_TASK_ID]}
Here's another example code, which reads a list of subjects from a file into an array.
In this way, there's no need to manually assign indexes to your subjects.