Hummingbird for statisticians

class: center, middle, inverse, title-slide

.title[
# Hummingbird for statisticians
]
.author[
### Zehang Richard Li
]
.date[
### 11/16/2023
]

---

# About Hummingbird

This is the tutorial for UCSC students who wants to use the Hummingbird Computational Cluster. The tutorial is aimed for students in the statistics program with no prior exposure to high performance computing clusters.

More information on [hummingbird website](https://hummingbird.ucsc.edu/getting-started/).

What do you get?

+ 72 CPU cores maximum per user in parallel
+ Disk quota per user (home directory): 1TB
+ No restriction on CPU-hours.

---
# Log in and load required modules

Access from terminal (replace [CRUZID] with your CruzID, without bracket) with UCSC gold password:
```
ssh [CRUZID]@hb.ucsc.edu
```

For off-campus access, you will need to use the campus VPN.

More information on [hummingbird get-started](https://hummingbird.ucsc.edu/getting-started/).

This tutorial will cover two topics:
1. Use an interactive session to run R, like what you do on your local computer.
2. Use the slurm system to submit and run multiple R jobs in parallel.

---
# Interactive R session: create a task

Let's start with interactive sessions first. This is useful for you to run simple tasks and test your scripts before submitting large amount of jobs. This process is also similar to what you do on your own computer, but with a few extra steps before and after.

First, before starting an interactive R session, you need to `salloc` the computing resources you need for this session. For example,

```
salloc --partition=128x24 --time=01:00:00 --mem=500M --cpus-per-task=1 
```

+ `--partition=128x24 --cpus-per-task=1` specifies that you request one CPU from the partition 128x24. There are 4 public partitions (also called queues) on HB: 128×24, 256×44, Instruction, and 96x24gpu4. Additionally, there is one large-memory partition (1024×28) that public users are allowed to use, but the researcher who owns it gets priority access, and jobs may be canceled if they decide they need to use it.

+ `--time=01:00:00` specifies a time limit (wall clock). After one hour, the interactive session (and the jobs you run in this session) are automatically killed.

+ `--mem=500M` specifies that you request 500Mb of memory. The memory required should depend on the job you need to run. The job will be killed if it exceeds the specified memory.

---
# Interactive R session: load software

After you run the `salloc` command, you will see messages such as 
```
salloc: Granted job allocation 323055
```
which tells you the Job ID (323055), which is useful keep track of, in case you need to look up the status of this session later.

You can log into the allocated node using the following command.
```
ssh $SLURM_NODELIST
```
Note that if you do not do this, your job will be run on the login node, and not using the resource you requested. If your interactive session requires reasonably heavy computation, this can slow down everyone on the cluster when they log in.

After running the above command, your current directory is no longer at the the login node @hb, but a specific node (e.g., @hbcomp-XXX). Next, run the following commands on your allocated node to load R-4.1.1
```
module load R/R-4.1.1
R
```

You can use the R session interactively to run simple jobs, load and manipulate data/output from other jobs, etc.

The Environment Modules package provides for dynamic modification of your shell environment. Module commands set, change, or delete environment variables, typically in support of a particular application. They also let the user choose between different versions of the same software or different combinations of related codes. For more details, see [hummingbird website](https://hummingbird.ucsc.edu/getting-started/) Section "Accessing Software Applications and Python with Modules"

---
# Interactive R session: installing R packages

You can install local packages on the interactive session and test they are working. For example, installing `glmnet` and its dependencies

```
install.packages("glmnet", dep = TRUE)
```

Packages will be installed on your home directory, and are accessible for jobs on all partitions. So there is no need to start every job script with package installation if they have been installed once.

---
# Interactive R session: locally installed R packages

The default behavior of R on the cluster is to first load the package in the module, and if not available, look for the user's local library. This is usually not ideal when you have a different versions of packages installed locally. So the following code changes the order

```
myPaths <- .libPaths()   # get the paths
myPaths <- c(myPaths[2], myPaths[1])  # switch them
.libPaths(myPaths)
```

To avoid running these three lines every time, you can also set it as default in the `~/.Rprofile` file.
```
vim ~/.Rprofile
```
and add the following line with your CRUZID replaced:

.small-code[
```{}
.libPaths(c("/hb/home/[CRUZID]/R/x86_64-pc-linux-gnu-library/4.1", "/hb/software/apps/R/gnu-4.1.1/lib64/R/library"))
```
]

---
# Interactive R session: exiting

Exit the node that was allocated. That is, jumping back to the logging node.
```
exit
```

Close the allocated resources (yes, the second exit here). 
```
exit
```

Alternatively, you can also close the allocated resource using `scancel` and the job ID
```
scancel 323055
```

---
# Using `sbatch` to deploy parallel jobs

The `salloc` command is from the `slurm` system, a system that performs resource scheduling. When you have multiple jobs you need to run in parallel (e.g., cross validation with different seeds/parameters), it is more efficient to use `sbatch` to deploy a sequence of jobs, which are added to the queue of jobs on Hummingbird.

The following example assumes the working directory of the following structure:

```
hb-tutorial/
  -- scripts/
  	 -- job1.R
  	 -- run-job.sbatch
  -- output/
  -- log/
```

The .R file is the main file I want to run multiple times (with different input). The .sbatch file is the file that specifies the jobs to run.

---
# Batch job example: The R script

In `hb-tutorial/scripts/job1.R`, we include a simple example that takes the input of a single number (which controls which case we will run), and output a saved data frame and a plot. We will load tidyverse package in the script as well.

```r
# This is job1.R
# Take the input ID from run-job.sh
case = as.numeric(commandArgs(trailingOnly = TRUE)[1])
# Output from print() will appear in .out file
print(paste("This is job ", case))
start.time <- Sys.time()
dir.create("output", showWarnings = FALSE)
library(tidyverse)
df <- data.frame(gp = factor(rep(letters[case:(case + 2)], each = 10)), y = rnorm(30))
g <- ggplot(df, aes(gp, y)) + geom_boxplot(aes(x = gp, y = y), size = 2)
save(df, file = paste0("output/case", case, ".RData"))   
ggsave(g, file = paste0("output/case", case, ".pdf"))
# Output from message() will appear in .err file
message("Job started on ", start.time. "Job finished on ", Sys.time())
# Output from print() will appear in .out file
print(end.time - start.time)
```

---

# Batch job example: The sbatch script

In run-job.sbatch file, we specify how we want to run the job: which module needs to be loaded, how to initiate the job, how much resources to allocate, and where the messages are saved.

```
#!/bin/bash

#SBATCH --job-name=some_job_name
#SBATCH --mem=250M
#SBATCH --partition=256x44
#SBATCH --output=log/R%j.out
#SBATCH --error=log/R%j.err
#SBATCH --time=01:00:00
#SBATCH --ntasks=1
#SBATCH --mail-type=END,FAIL # notifications for job done & fail
#SBATCH --mail-user=[YOUR_EMAIL]@ucsc.edu

module load R/R-4.1.1

Rscript job1.R $SLURM_ARRAY_TASK_ID > log/log$SLURM_ARRAY_TASK_ID
```

---
# Batch job example: scp, rsync and file storage system

Create a directory on hb called `hb-tutorial`.
```
mkdir hb-tutorial
```

Sync the local folder scripts to the folder on hb, implementing from the local machine.

```
rsync -au -P -v scripts/ [CRUZID]@hb.ucsc.edu:~/hb-tutorial/scripts/
```

Now check on hb that the folder and files are uploaded

```
ls hb-tutorial/scripts
```

---
# Batch job example: run the jobs

Go to the working directory

```
cd hb-tutorial/scripts/
```

Run the job with ID 1 to 10

```
sbatch --array=1-10 run-job.sbatch
```

On the local machine, you can pull the server folder output/ down using `rsync`

```
rsync -au -v -P [CRUZID]@hb.ucsc.edu:~/hb-tutorial/scripts/output scripts/
```

Note: for more details on the `rsync` command, please visit [the official document](https://linux.die.net/man/1/rsync) or google it.

---
# More tools on the cluster

Use `screen` when log in to create multiple windows that are not automatically killed when you disconnect.
```
screen 
```

Checking the current queue
```
squeue
```

Checking the current queue by username
```
squeue -u [CRUZID]
```

---
# More topics not covered in this tutorial

+ Troubleshoot non-standard libraries (e.g., need specific C++ compiler, java environment, openMP, etc.)
+ Using parallel computation package in R

More information about Hummingbird with a lot of details can be found on [this page](https://hummingbird.ucsc.edu/documentation/hpc-series/), especially in [this set of slides](https://bpb-us-e1.wpmucdn.com/sites.ucsc.edu/dist/2/1143/files/2022/10/HB-Use-and-Etiquette-RParsons-09-23-22.pdf).

For more specific questions, you can email hummingbird@ucsc.edu to open a ticket, or join the Hummingbird slack channel (see [here](https://hummingbird.ucsc.edu/) for instructions) to ask a question, or use the [drop-in zoom office hour](https://hummingbird.ucsc.edu/documentation/hummingbird-open-office-hour/).