3  How to run Cell Ranger on HPC

Author

Javier Carpinteyro Ponce

Published

November 5, 2024

A short tutorial that shows how to get started with 10x Genomics Cell Ranger version 8.0.1. The examples below show how to run cellranger count for primary analysis of single-cell/nuclei RNA sequencing data. This tutorial assumes that Cell Ranger has been installed in your system and it is fully functional.

3.1 Demultiplex sequence data with BCL-convert

  1. Refer to this tutorial for details on how to run BCL-convert on HPC:

3.2 Run cellranger count

  1. Create a wrapper script, i.e. doScRNA.8.0.1.sh

    #!/bin/bash
    
    module load cellranger/8.0.1
    
    TEMPLATE=/data/10x/processing/slurm.template
    
    SAMPLE=$1 # First positional argument to specify the sample to be processed
    FASTQS=$2 # Second positional argument to enter the full path to the FastQ files generated by BCL Convert
    TRANSCRIPTOME=$3 # Reference transcriptome for a specific species/organism
    
    cellranger count --jobmode=$TEMPLATE --id $SAMPLE\_count --fastqs $FASTQS --sample $SAMPLE --transcriptome $TRANSCRIPTOME --create-bam true

    This is an example of the slurm.template you could use for your system (provided by 10x Genomics). You might need to consult your system administrator for specific settings.

    #!/usr/bin/env bash
    #
    # Copyright (c) 2016 10x Genomics, Inc. All rights reserved.
    #
    # =============================================================================
    # Setup Instructions
    # =============================================================================
    #
    # 1. Add any other necessary Slurm arguments such as partition (-p) or account
    #    (-A). If your system requires a walltime (-t), 24 hours (24:00:00) is
    #    sufficient.  We recommend you do not remove any arguments below or Martian
    #    may not run properly.
    #
    # 2. Change filename of slurm.template.example to slurm.template.
    #
    # =============================================================================
    # Template
    # =============================================================================
    #
    #SBATCH -J __MRO_JOB_NAME__
    #SBATCH --export=ALL
    #SBATCH --nodes=1 --ntasks-per-node=__MRO_THREADS__
    #SBATCH --signal=2
    #SBATCH --no-requeue
    ### Alternatively: --ntasks=1 --cpus-per-task=__MRO_THREADS__
    ###   Consult with your cluster administrators to find the combination that
    ###   works best for single-node, multi-threaded applications on your system.
    #SBATCH --mem=__MRO_MEM_GB__G
    #SBATCH -o __MRO_STDOUT__
    #SBATCH -e __MRO_STDERR__
    
    #SBATCH -p priority
    #SBATCH -t 72:0:0
    
    __MRO_CMD__
  2. Run the cellranger count wrapper:

    nohup bash doScRNA.8.0.1.sh \
        SAMPLE \
        /path/to/FastQs/ \
        /path/to/reference/transcriptome/ \
        > /path/to/stdout/file.out &

    Some more details about the positional arguments

    • SAMPLE sample that is being processed

    • /path/to/FastQs is the directory containing the raw sequencing data. Note that this directory might contain multiple SAMPLEs depending on how the demultiplexing step has been processed.

    • /path/to/reference/transcriptome/ is the directory to the reference genome/transcriptome, which has been created by cellranger mkref . See below for further instructions.

3.3 Inspect the main visual report

If everything went well, cellranger count should have created the web_summary.html file located in the [SAMPLE]_count/outs/ directory.

Main info to look for and report:

  • Estimated number of cells

  • Number of clusters

  • Fraction Reads in Cells

3.4 Create a custom reference for Cell Ranger

To create a custom reference genome for cellranger count, we need to run cellranger mkref where the input files are:

  1. Files needed:

    • Reference annotation in gtf format

    • Reference genome assembly in fasta format

  2. Create cellranger mkref wrapper script, i.e. `doMkref.8.0.1.sh`:

    #!/usr/bin/env bash
    #
    #
    # =============================================================================
    # Job Script
    # Auth: 10x Genomics; Javier Carpinteyro-Ponce
    # =============================================================================
    #
    #SBATCH -J CR_mkref
    #SBATCH --export=ALL
    #SBATCH --nodes=1 --ntasks-per-node=24
    #SBATCH --signal=2
    #SBATCH --no-requeue
    ### Alternatively: --ntasks=1 --cpus-per-task={NUM_THREADS}
    ###   Consult with your cluster administrators to find the combination that
    ###   works best for single-node, multi-threaded applications on your system.
    #SBATCH --mem=400G
    #SBATCH -o mkref_%j.err
    #SBATCH -e mkref_%j.log
    
    module load cellranger/8.0.1
    
    # Check if the number of arguments is a multiple of 3
    if (( $# % 3 != 0 )); then
      echo -e "Error: You need to provide at least 3 arguments in the following order:\n \
      1) species name (or desired outdir name)\n \
      2) genome assembly (/full/path/genome.fa)\n \
      3) genes in gtf format (/full/path/genes.gtf)\n\n \
      If 2 species/genomes, arguments need to be in the following order:\n \
      1) species 1 name (only species name given that this and species 2 name will be concatenated to create final output dir name) \n \
      2) genome assembly species 1 (/full/path/sp1genome.fa)\n \
      3) genes in gtf format for species 1 (/full/path/sp1genes.gtf)\n \
      4) species 2 name (only species name)\n \
      5) genome assembly species 2 (/full/path/sp2genome.fa)\n \
      6) genes in gtf format for species 2 (/full/path/sp2genes.gtf)"
      exit 1
    fi
    
    case $# in
      3)
        # Code to execute when there are 3 arguments
        echo "Running cellranger mkref for a single species genome:"
        echo "Outdir name: $1"
        echo "Genome assembly: $2"
        echo "Genes: $3"
        cellranger mkref --genome=$1 --fasta=$2 --genes=$3 --memgb=40 --localmem=400 --localcores=80
        ;;
      6)
        # Code to execute when there are 6 arguments
        echo "Running cellranger mkref for 2 species genomes:"
        echo "Outdir name: "$1"_and_"$4
        echo "Genome assembly species 1: $2"
        echo "Genes species 1: $3"
        echo "Genome assembly species 2: $5"
        echo "Genes species 2: $6"
        cellranger mkref --genome=$1 --fasta=$2 --genes=$3 --genome=$4 --fasta=$5 --genes=$6 --memgb=400 --localmem=400 --localcores=8
        ;;
      *)
        echo -e "Error: You need to provide at least 3 arguments in the following order:\n \
        1) species name (or desired outdir name)\n \
        2) genome assembly (/full/path/genome.fa)\n \
        3) genes in gtf format (/full/path/genes.gtf)\n\n \
        If 2 species/genomes, arguments need to be in the following order:\n \
        1) species 1 name (only species name given that this and species 2 name will be concatenated to create final output dir name) \n \
        2) genome assembly species 1 (/full/path/sp1genome.fa)\n \
        3) genes in gtf format for species 1 (/full/path/sp1genes.gtf)\n \
        4) species 2 name (only species name)\n \
        5) genome assembly species 2 (/full/path/sp2genome.fa)\n \
        6) genes in gtf format for species 2 (/full/path/sp2genes.gtf)"
        exit 1
        ;;
    esac

    The script above has been designed to create a reference with either 1 or 2 genomes/transcriptomes. The 2 genomes/transcriptomes case can be useful when processing samples where host-symbiont are of interest.

  3. Run the cellranger mkref wrapper

    sbatch -p partition -t 24:0:0 \
        doMkref.8.0.1.sh \
        species_name \
        /full/path/genome.fasta \
        /full/path/annotation.gtf

    species_name is the custom name you can use to name the output reference main directory

3.4.1 Run cellranger mkref for two species

Same doMkref.8.0.1.sh can also take 6 arguments, which correspond to the information for 2 different species. For example when creating a reference for Aiptasia and a symbiont algae:

sbatch -p partition -t 24:0:0 \
    doMkref.8.0.1.sh \ 
    species1 \ # species 1 name
    /full/path/species1_genome.fasta \ # species 1 genome
    /full/path/species1_annotation.gtf \ # species 1 annotation
    species2 \ # species 2 name
    /full/path/species2_genome.fasta \ # species 2 genome
    /full/path/species2_annotation.gtf # species 2 annotation
Large and very fragmented genomes!

For large and very fragmented genomes (i.e. very high number of contigs/scaffolds) STAR may cause issues. Given this was the case for the aipSp1 and Smic combined reference, STAR error log suggested to add the --limitSjdbInsertNsj 1184844 argument. Then to implement this on the cellranger installation, the cellranger-x.y.z/lib/python/cellranger/reference_builder.py (starting at line 438) needed to be modified:

args = [            
    os.path.join(_LIB_BIN, "STAR"),            
    "--runMode",            
    "genomeGenerate",            
    "--limitSjdbInsertNsj 1184844",             
    "--genomeDir",            
    self.reference_star_path,            
    "--runThreadN",            
    str(num_threads),            
    "--genomeFastaFiles",            
    in_fasta_fn,            
    "--sjdbGTFfile",            
    in_gtf_fn,        
  ]