Skip to content

Clusters

Pipelines

Luigi vs Airflow vs Pinball

Mesos

Kubernetes

Middleware

SAGA

DRMAA

  • https://www.drmaa.org/
  • https://en.wikipedia.org/wiki/DRMAA

Batch processing

HTCondor

- HTCondor

LSF family

  • Evil because taking down OpenLava
  • http://www-03.ibm.com/systems/platformcomputing/products/lsf

Platform LSF

  • Proprietary IBM grid engine.

OpenLava

- Took down by IBM, claiming copyright infingement. - Developed by formet Platform LSF employees. - Being sued by IBM for supposed copyright infindgement.

Slurm

- SLURM

See also


Comparison

User commands PBS/Torque SGE Slurm LSF LoadLeveler
Job submission qsub [script_file] qsub [script_file] sbatch [script_file] bsub [script_file] llsubmit [script_file]
Job deletion qdel [job_id] qdel [job_id] scancel [job_id] bkill [job_id] llcancel [job_id]
Job status (by job) qstat [job_id] qstat -u * [-j job_id] squeue [job_id] bjobs [job_id] llq -u [username]
Job status (by user) qstat -u [user_name] qstat [-u user_name] squeue -u [user_name] bjobs -u [user_name] llq -u [user_name]
Job hold qhold [job_id] qhold [job_id] scontrol hold [job_id] bstop [job_id] llhold -r [job_id]
Job release qrls [job_id] qrls [job_id] scontrol release [job_id] bresume [job_id] llhold -r [job_id]
Queue list qstat -Q qconf -sql squeue bqueues llclass
Node list pbsnodes -l qhost sinfo -N OR scontrol show nodes bhosts llstatus -L machine
Cluster status qstat -a qhost -q sinfo bqueues llstatus -L cluster
GUI xpbsmon qmon sview xlsf OR xlsbatch xload

Environment PBS/Torque SGE Slurm LSF LoadLeveler
Job ID $PBS_JOBID $JOB_ID $SLURM_JOBID $LSB_JOBID $LOAD_STEP_ID
Submit Directory $PBS_O_WORKDIR $SGE_O_WORKDIR $SLURM_SUBMIT_DIR $LSB_SUBCWD $LOADL_STEP_INITDIR
Submit Host $PBS_O_HOST $SGE_O_HOST $SLURM_SUBMIT_HOST $LSB_SUB_HOST
Node List $PBS_NODEFILE $PE_HOSTFILE $SLURM_JOB_NODELIST $LSB_HOSTS/LSB_MCPU_HOST $LOADL_PROCESSOR_LIST
Job Array Index $PBS_ARRAYID $SGE_TASK_ID $SLURM_ARRAY_TASK_ID $LSB_JOBINDEX

User commands PBS/Torque SGE Slurm LSF LoadLeveler
Script directive #PBS #$ #SBATCH #BSUB #@
Queue -q [queue] -q [queue] -p [queue] -q [queue] class=[queue]
Node Count -l nodes=[count] N/A -N [min[-max]] -n [count] node=[count]
CPU Cound -l ppn=[count] OR -lmppwidth=[PE_count] -pe [PE] [count] -n [count] -n [count]
Wall Clock Limit -l walltime=[hh:mm:ss] -l h_rt=[seconds] -t [min] OR -t [days-hh:mm:ss] -W [hh:mm:ss] wall_clock_limit=[hh:mm:ss]
Standard Output File -o [file_name] -o [file_name] -o [file_name] -o [file_name] output=[file_name]
Standard Error File -e [file_name] -e [file_name] e [file_name] -e [file_name] error=[File_name]
Combine stdout/err -j oe (both to stdout) OR -j eo(both to stderr) -j yes (use -o without -e) (use -o without -e)
Copy Environment -V -V --export=[ALL NONE variables]
Event Notification -m abe -m abe --mail-type=[events] -B or -N notification=start
Email Address -M [address] -M [address] --mail-user=[address] -u [address] notify_user=[address]
Job Name -N [name] -N [name] --job-name=[name] -J [name] job_name=[name]
Job Restart -r [y n] -r [yes no] --requeue OR --no-requeue (NOTE:configurable default)
Working Directory N/A -wd [directory] --workdir=[dir_name] (submission directory) initialdir=[directory]
Resource Sharing -l naccesspolicy=singlejob -l exclusive --exclusive OR--shared -x node_usage=not_shared
Memory Size -l mem=[MB] -l mem_free=[memory][K M G] --mem=[mem][M
Account to charge -W group_list=[account] -A [account] --account=[account] -P [account]
Tasks Per Node -l mppnppn [PEs_per_node] (Fixed allocation_rule in PE) --tasks-per-node=[count] tasks_per_node=[count]
CPUs Per Task --cpus-per-task=[count]
Job Dependency -d [job_id] -hold_jid [job_id job_name] --depend=[state:job_id] -w [done
Job Project -P [name] --wckey=[name] -P [name]
Job host preference -q [queue]@[node] OR -q[queue]@@[hostgroup] --nodelist=[nodes] AND/OR --exclude=[nodes] -m [nodes]
Quality of Service -l qos=[name] --qos=[name]
Job Arrays -t [array_spec] -t [array_spec] --array=[array_spec] (Slurm version 2.6+) J "name[array_spec]"
Generic Resources -l other=[resource_spec] -l [resource]=[value] --gres=[resource_spec]
Licenses -l [license]=[count] --licenses=[license_spec] -R "rusage[license_spec]"
Begin Time -A "YYYY-MM-DD HH:MM:SS" -a [YYMMDDhhmm] --begin=YYYY-MM-DD[THH:MM[:SS]] -b[[year:][month:]daty:]hour:minute