3 How to run Cell Ranger on HPC

Author

Javier Carpinteyro Ponce

Published

November 5, 2024

A short tutorial that shows how to get started with 10x Genomics Cell Ranger version 8.0.1. The examples below show how to run cellranger count for primary analysis of single-cell/nuclei RNA sequencing data. This tutorial assumes that Cell Ranger has been installed in your system and it is fully functional.

3.1 Demultiplex sequence data with BCL-convert

Refer to this tutorial for details on how to run BCL-convert on HPC:
- How to Run BCL-convert on HPC

3.2 Run `cellranger count`

Create a wrapper script, i.e. doScRNA.8.0.1.sh

#!/bin/bash

module load cellranger/8.0.1

TEMPLATE=/data/10x/processing/slurm.template

SAMPLE=$1 # First positional argument to specify the sample to be processed
FASTQS=$2 # Second positional argument to enter the full path to the FastQ files generated by BCL Convert
TRANSCRIPTOME=$3 # Reference transcriptome for a specific species/organism

cellranger count --jobmode=$TEMPLATE --id $SAMPLE\_count --fastqs $FASTQS --sample $SAMPLE --transcriptome $TRANSCRIPTOME --create-bam true

This is an example of the slurm.template you could use for your system (provided by 10x Genomics). You might need to consult your system administrator for specific settings.

#!/usr/bin/env bash
#
# Copyright (c) 2016 10x Genomics, Inc. All rights reserved.
#
# =============================================================================
# Setup Instructions
# =============================================================================
#
# 1. Add any other necessary Slurm arguments such as partition (-p) or account
#    (-A). If your system requires a walltime (-t), 24 hours (24:00:00) is
#    sufficient.  We recommend you do not remove any arguments below or Martian
#    may not run properly.
#
# 2. Change filename of slurm.template.example to slurm.template.
#
# =============================================================================
# Template
# =============================================================================
#
#SBATCH -J __MRO_JOB_NAME__
#SBATCH --export=ALL
#SBATCH --nodes=1 --ntasks-per-node=__MRO_THREADS__
#SBATCH --signal=2
#SBATCH --no-requeue
### Alternatively: --ntasks=1 --cpus-per-task=__MRO_THREADS__
###   Consult with your cluster administrators to find the combination that
###   works best for single-node, multi-threaded applications on your system.
#SBATCH --mem=__MRO_MEM_GB__G
#SBATCH -o __MRO_STDOUT__
#SBATCH -e __MRO_STDERR__

#SBATCH -p priority
#SBATCH -t 72:0:0

__MRO_CMD__

Run the cellranger count wrapper:
```
nohup bash doScRNA.8.0.1.sh \
    SAMPLE \
    /path/to/FastQs/ \
    /path/to/reference/transcriptome/ \
    > /path/to/stdout/file.out &
```
Some more details about the positional arguments
- SAMPLE sample that is being processed
- /path/to/FastQs is the directory containing the raw sequencing data. Note that this directory might contain multiple SAMPLEs depending on how the demultiplexing step has been processed.
- /path/to/reference/transcriptome/ is the directory to the reference genome/transcriptome, which has been created by cellranger mkref . See below for further instructions.

3.3 Inspect the main visual report

If everything went well, cellranger count should have created the web_summary.html file located in the [SAMPLE]_count/outs/ directory.

Main info to look for and report:

Estimated number of cells
Number of clusters
Fraction Reads in Cells

3.4 Create a custom reference for Cell Ranger

To create a custom reference genome for cellranger count, we need to run cellranger mkref where the input files are:

Files needed:
- Reference annotation in gtf format
- Reference genome assembly in fasta format

Create cellranger mkref wrapper script, i.e. `doMkref.8.0.1.sh`:

#!/usr/bin/env bash
#
#
# =============================================================================
# Job Script
# Auth: 10x Genomics; Javier Carpinteyro-Ponce
# =============================================================================
#
#SBATCH -J CR_mkref
#SBATCH --export=ALL
#SBATCH --nodes=1 --ntasks-per-node=24
#SBATCH --signal=2
#SBATCH --no-requeue
### Alternatively: --ntasks=1 --cpus-per-task={NUM_THREADS}
###   Consult with your cluster administrators to find the combination that
###   works best for single-node, multi-threaded applications on your system.
#SBATCH --mem=400G
#SBATCH -o mkref_%j.err
#SBATCH -e mkref_%j.log

module load cellranger/8.0.1

# Check if the number of arguments is a multiple of 3
if (( $# % 3 != 0 )); then
  echo -e "Error: You need to provide at least 3 arguments in the following order:\n \
  1) species name (or desired outdir name)\n \
  2) genome assembly (/full/path/genome.fa)\n \
  3) genes in gtf format (/full/path/genes.gtf)\n\n \
  If 2 species/genomes, arguments need to be in the following order:\n \
  1) species 1 name (only species name given that this and species 2 name will be concatenated to create final output dir name) \n \
  2) genome assembly species 1 (/full/path/sp1genome.fa)\n \
  3) genes in gtf format for species 1 (/full/path/sp1genes.gtf)\n \
  4) species 2 name (only species name)\n \
  5) genome assembly species 2 (/full/path/sp2genome.fa)\n \
  6) genes in gtf format for species 2 (/full/path/sp2genes.gtf)"
  exit 1
fi

case $# in
  3)
    # Code to execute when there are 3 arguments
    echo "Running cellranger mkref for a single species genome:"
    echo "Outdir name: $1"
    echo "Genome assembly: $2"
    echo "Genes: $3"
    cellranger mkref --genome=$1 --fasta=$2 --genes=$3 --memgb=40 --localmem=400 --localcores=80
    ;;
  6)
    # Code to execute when there are 6 arguments
    echo "Running cellranger mkref for 2 species genomes:"
    echo "Outdir name: "$1"_and_"$4
    echo "Genome assembly species 1: $2"
    echo "Genes species 1: $3"
    echo "Genome assembly species 2: $5"
    echo "Genes species 2: $6"
    cellranger mkref --genome=$1 --fasta=$2 --genes=$3 --genome=$4 --fasta=$5 --genes=$6 --memgb=400 --localmem=400 --localcores=8
    ;;
  *)
    echo -e "Error: You need to provide at least 3 arguments in the following order:\n \
    1) species name (or desired outdir name)\n \
    2) genome assembly (/full/path/genome.fa)\n \
    3) genes in gtf format (/full/path/genes.gtf)\n\n \
    If 2 species/genomes, arguments need to be in the following order:\n \
    1) species 1 name (only species name given that this and species 2 name will be concatenated to create final output dir name) \n \
    2) genome assembly species 1 (/full/path/sp1genome.fa)\n \
    3) genes in gtf format for species 1 (/full/path/sp1genes.gtf)\n \
    4) species 2 name (only species name)\n \
    5) genome assembly species 2 (/full/path/sp2genome.fa)\n \
    6) genes in gtf format for species 2 (/full/path/sp2genes.gtf)"
    exit 1
    ;;
esac

The script above has been designed to create a reference with either 1 or 2 genomes/transcriptomes. The 2 genomes/transcriptomes case can be useful when processing samples where host-symbiont are of interest.

Run the cellranger mkref wrapper

sbatch -p partition -t 24:0:0 \
    doMkref.8.0.1.sh \
    species_name \
    /full/path/genome.fasta \
    /full/path/annotation.gtf

species_name is the custom name you can use to name the output reference main directory

3.4.1 Run `cellranger mkref` for two species

Same doMkref.8.0.1.sh can also take 6 arguments, which correspond to the information for 2 different species. For example when creating a reference for Aiptasia and a symbiont algae:

sbatch -p partition -t 24:0:0 \
    doMkref.8.0.1.sh \ 
    species1 \ # species 1 name
    /full/path/species1_genome.fasta \ # species 1 genome
    /full/path/species1_annotation.gtf \ # species 1 annotation
    species2 \ # species 2 name
    /full/path/species2_genome.fasta \ # species 2 genome
    /full/path/species2_annotation.gtf # species 2 annotation

Large and very fragmented genomes!

For large and very fragmented genomes (i.e. very high number of contigs/scaffolds) STAR may cause issues. Given this was the case for the aipSp1 and Smic combined reference, STAR error log suggested to add the --limitSjdbInsertNsj 1184844 argument. Then to implement this on the cellranger installation, the cellranger-x.y.z/lib/python/cellranger/reference_builder.py (starting at line 438) needed to be modified:

args = [            
    os.path.join(_LIB_BIN, "STAR"),            
    "--runMode",            
    "genomeGenerate",            
    "--limitSjdbInsertNsj 1184844",             
    "--genomeDir",            
    self.reference_star_path,            
    "--runThreadN",            
    str(num_threads),            
    "--genomeFastaFiles",            
    in_fasta_fn,            
    "--sjdbGTFfile",            
    in_gtf_fn,        
  ]