Purpose

This pipeline is intended to:

As the lab is beginning to shift towards running many of these libraries a simple python script was produced to automate much of the repetition in this process.

Requirements

This document assumes you ran paired-end RNAseq with UMIs and have access to the FASTQ files.

This document heavily focuses on using a small python script called Windchime to handle sbatch shell generation. When ran it will generate two sbatch shell scripts. The first script will align reads using STAR and deduplicate reads using UMI-tools. The second sbatch shell script uses bedtools genomecov to calculate read coverage for each genome feature.

Step zero - Set up environment

Step one - Setup configuration File

Windchime uses a tab delimited configuration file to define necessary parameters and samples for the run. These come in two varieties: samples with replicates (Table 1) and without (Table 2).

Each table requires the same kinds of run parameters (noted as ‘meta’ in column 1). Lines can be commented out by using a hash ‘#’ character at the start of the line.

Table 1 - Example of configuration file for samples with replicates

#data_type  variable    value
meta    set_name    Project_Windchime_HGG72DRXY
meta    data_dir    /scratch/cgsb/gencore/out/Gresham/2021-08-09_HGG72DRXY/merged/
meta    work_dir    /scratch/cgsb/gresham/LABSHARE/Data/HGG72DRXY/
meta    output_dir  /scratch/cgsb/gresham/LABSHARE/Data/HGG72DRXY/results/
meta    genome_fa   /scratch/cgsb/gresham/pieter/genome/ensembl_50/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.fa
meta    genome_features /scratch/cgsb/gresham/pieter/genome/Saccharomyces_cerevisiae.R64-1-1.dna.toplevel.gtf
meta    intron_max  100
meta    adapter_seq_R1  AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
meta    adapter_seq_R2  AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
meta    prefix  HGG72DRXY
meta    fastq_name_template {prefix}_n01_{strain}_{replicate}
meta    qc_locus  XII:451439-468985
sample  1   1
sample  1   2
sample  1   3
sample  1222    1
sample  1222    2
sample  1222    3

Table 2 - Example of configuration file for samples without replicates

#data_type  variable    value
meta    set_name    Project_CAlbicans
meta    data_dir    /scratch/cgsb/gencore/out/Gresham/2021-10-25_HK7CCDRXY/merged/
meta    work_dir    /scratch/cgsb/gresham/LABSHARE/Data/HK7CCDRXY/
meta    output_dir  /scratch/cgsb/gresham/LABSHARE/Data/HK7CCDRXY/results/
meta    genome_fa   /scratch/cgsb/gresham/pieter/genome/Candida_albicans_sc5314/C_albicans_SC5314_A22_current_chromosomes.fasta
meta    genome_features /scratch/cgsb/gresham/pieter/genome/Candida_albicans_sc5314/C_albicans_SC5314_A22_current_features.gff
meta    intron_max  500
meta    adapter_seq_R1  AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
meta    adapter_seq_R2  AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
meta    prefix  HK7CCDRXY
meta    fastq_name_template {prefix}_n01_{strain}
meta    qc_locus    Ca22chrRA_C_albicans_SC5314:1891108-1896912
sample  CAWT1_C-LIM_2H
sample  CAWT1_C-LIM_5H

Step two - Make STAR alignment sbatch file

python windchime.py -a -i cfg_file.tab -o run_windchime_star.sh
# -a 'alignment' flag
# [optional] -rep flag, include if you ar using replicates
# [optional] -mem flag, set sbatch memory allocation in GB. (Default 60GB)
# -i input configuration file name
# -o output STAR sbatch file name

sbatch run_windchime_star.sh

Step three - Evaluate summary performance statistics

python windchime.py -e -i cfg_file.tab -o evaluate_run.txt
# -e 'evaluate' flag
# [optional] -rep flag, include if you ar using replicates
# -i input configuration file name
# -o output evaluation statistics file name

Step four - Calculate sequence rarefaction

python windchime.py -r -i ctrl_file.tab
# -r 'rarefaction' flag
# [optional] -rep flag, include if you ar using replicates
# [optional] -tol parameter, allows you to specify the percent tolerance point (default 50)
# -i input configuration file name

Step five - Calculate coverage of aligned reads to genome features

python windchime.py -c -i cfg_file.tab -o coverage_run.sh
# -c 'coverage' flag
# [optional] -rep flag, include if you ar using replicates
# [optional] -mem flag, set sbatch memory allocation in GB. (Default 60GB)
# -i input configuration file name
# -o output coverage sbatch file name

sbatch coverage_run.sh

Step six - combine feature coverage counts into a single table

python windchime.py -t -i cfg_file.tab -o coverage_table.txt
# -t 'table' flag
# [optional] -rep flag, include if you ar using replicates
# -i input configuration file name
# -o output combined coverage table file name

Conclusion

After a successful run you should have the following files:

