Getting started
This document assumes you’ve already ran a sample an ONT Nanopore device and have generated fast5 files recording the squiggles of that run.
It assumes you’ve previously performed base-calling on the fast5 and the resultant fastq files are on the HPC.
It assumes that you have a base conda environment and have activated it. Because of how snakemake is handling the environment creation at runtime I have found the only way to get seaborn to work is to install it using pip.
pip install seaborn
Step 1. Install by git
git clone https://github.com/nanoporetech/pipeline-nanopore-ref-isoforms.git
cd pipeline-nanopore-ref-isoforms
Step 2. Edit the config.yml file.
note: this pipeline is largely under- or undocumented, so changing these settings can have unexpected results
nano config.yml
workdir_top
- This will be your output folder
- Despite inline comment, this works best if it is not the absolute path
# ABSOLUTE path to directory holding the working directory:
workdir_top: "HEK_test"
genome_fasta
# Input genome
genome_fasta: "Homo_sapiens.GRCh38.dna.primary_assembly.fa"
existing_annotation
- works best if it is a GFF not GTF
existing_annotation: "Homo_sapiens.GRCh38.104.gff3"
reads_fastq
- Path to folder or file <sample.fastq>
- if this points to a file change ‘concatenate’ to ‘false’
# cDNA or direct RNA reads in fastq format
reads_fastq: "/scratch/cgsb/gresham/LABSHARE/Data/nanopore/HEK_CPA/Hek_CPA_pilot/FASTQ/pass"
concatenate
- if directory was already combined
# The path above is a directory, find and concatenate fastq files:
concatenate: true
run_pychopper
- NB This must be set to false
- pychopper runs an HMM to identify adapters and cut them out, reads lacking adapters will be discarded
- our sequences no longer have adapters leading to a massive loss of data
# Process cDNA reads using pychopper, turn off for direct RNA:
run_pychopper: false
plot_gffcmp_stats
- Potentially usefull, set to False by default
# Plot gffcompare results:
plot_gffcmp_stats: true
Step 3 Edit sbatch
NB Syntax Error - At time of writing the syntax used in this pipeline was incompatible with the syntax expected in the latest version of snakemake here. Here we are using a “safe” version - but this may break if ONT updates the pipeline syntax.
nano filename.sh
copy and paste:
#!/bin/bash
#
#SBATCH --verbose
#SBATCH --job-name=cDNA_isoform_calling
#SBATCH --output=isoform_%j.out
#SBATCH --error=isoform_%j.err
#SBATCH --time=24:00:00
#SBATCH --nodes=1
#SBATCH --mem=20GB
#SBATCH --mail-type=BEGIN,END,FAIL,TIME_LIMIT
#module load
module load snakemake/5.31.1
snakemake --use-conda -j 4 all
Then run using
sbatch <filename.sh>
NB - Restarting + If an error occurs in the pipeline, the previously completed steps will be retained and merely restarting the run will be sufficient to restart where the you had previously left off. To restart entirely either delete the output folder or change parameter in the config.yml file
Step 4 Evaluate results
cDNA Isoform identification and comparison to reference:
pipeline-nanopore-ref-isoforms/results/gffcompare/str_merged.annotated.gtf
This file defines which “gene” each isoform is assigned to (eg. “gene_name”).
This file also describes each transcript with a “class code” such as “=” or “y”. The exact nature of these is described on the gffcompare page here
NB - Unstranded data and isoform classes Because this is unstranded cDNA information the stranded isoform classes ‘s’ and ‘x’ should not be considered as “novel”.
cDNA Isoform coverage data:
pipeline-nanopore-ref-isoforms/results/gffcompare/str_merged.gff
This file contains read depth information for each transcript, Total coverage (“cov”), FPKM, and TPM.
LS0tDQp0aXRsZTogImNETkEgSXNvZm9ybSBjYWxsaW5nIGZvciBOYW5vcG9yZSINCmF1dGhvcjogIlBpZXRlciBTcGVhbG1hbiINCmRhdGU6ICIxMC8yMS8yMDIxIg0Kb3V0cHV0OiBodG1sX25vdGVib29rDQotLS0NCg0KYGBge3Igc2V0dXAsIGluY2x1ZGU9RkFMU0V9DQprbml0cjo6b3B0c19jaHVuayRzZXQoZWNobyA9IFRSVUUpDQpgYGANCg0KIyMgR2V0dGluZyBzdGFydGVkDQoNClRoaXMgZG9jdW1lbnQgYXNzdW1lcyB5b3UndmUgYWxyZWFkeSByYW4gYSBzYW1wbGUgYW4gT05UIE5hbm9wb3JlIGRldmljZSBhbmQgaGF2ZSBnZW5lcmF0ZWQgZmFzdDUgZmlsZXMgcmVjb3JkaW5nIHRoZSBzcXVpZ2dsZXMgb2YgdGhhdCBydW4uIA0KDQpJdCBhc3N1bWVzIHlvdSd2ZSBwcmV2aW91c2x5IHBlcmZvcm1lZCBiYXNlLWNhbGxpbmcgb24gdGhlIGZhc3Q1IGFuZCB0aGUgcmVzdWx0YW50IGZhc3RxIGZpbGVzIGFyZSBvbiB0aGUgSFBDLiANCg0KSXQgYXNzdW1lcyB0aGF0IHlvdSBoYXZlIGEgYmFzZSBjb25kYSBlbnZpcm9ubWVudCBhbmQgaGF2ZSBhY3RpdmF0ZWQgaXQuIEJlY2F1c2Ugb2YgaG93IHNuYWtlbWFrZSBpcyBoYW5kbGluZyB0aGUgZW52aXJvbm1lbnQgY3JlYXRpb24gYXQgcnVudGltZSBJIGhhdmUgZm91bmQgKip0aGUgb25seSB3YXkgdG8gZ2V0IHNlYWJvcm4gdG8gd29yayBpcyB0byBpbnN0YWxsIGl0IHVzaW5nIHBpcC4qKiAgDQpgYGANCnBpcCBpbnN0YWxsIHNlYWJvcm4NCmBgYA0KDQojIyBTdGVwIDEuIEluc3RhbGwgYnkgZ2l0DQpgYGANCmdpdCBjbG9uZSBodHRwczovL2dpdGh1Yi5jb20vbmFub3BvcmV0ZWNoL3BpcGVsaW5lLW5hbm9wb3JlLXJlZi1pc29mb3Jtcy5naXQNCmNkIHBpcGVsaW5lLW5hbm9wb3JlLXJlZi1pc29mb3Jtcw0KYGBgDQoNCiMjIyBTdGVwIDIuIEVkaXQgdGhlIGNvbmZpZy55bWwgZmlsZS4NCioqbm90ZTogdGhpcyBwaXBlbGluZSBpcyBsYXJnZWx5IHVuZGVyLSBvciB1bmRvY3VtZW50ZWQsIHNvIGNoYW5naW5nIHRoZXNlIHNldHRpbmdzIGNhbiBoYXZlIHVuZXhwZWN0ZWQgcmVzdWx0cyoqDQoNCmBgYA0KbmFubyBjb25maWcueW1sDQpgYGANCg0KKiB3b3JrZGlyX3RvcA0KICArIFRoaXMgd2lsbCBiZSB5b3VyIG91dHB1dCBmb2xkZXINCiAgKyBEZXNwaXRlIGlubGluZSBjb21tZW50LCB0aGlzIHdvcmtzIGJlc3QgaWYgaXQgaXMgbm90IHRoZSBhYnNvbHV0ZSBwYXRoDQogIGBgYA0KICAjIEFCU09MVVRFIHBhdGggdG8gZGlyZWN0b3J5IGhvbGRpbmcgdGhlIHdvcmtpbmcgZGlyZWN0b3J5Og0KICB3b3JrZGlyX3RvcDogIkhFS190ZXN0Ig0KICBgYGANCiogZ2Vub21lX2Zhc3RhDQogIGBgYA0KICAjIElucHV0IGdlbm9tZQ0KICBnZW5vbWVfZmFzdGE6ICJIb21vX3NhcGllbnMuR1JDaDM4LmRuYS5wcmltYXJ5X2Fzc2VtYmx5LmZhIg0KICBgYGANCiogZXhpc3RpbmdfYW5ub3RhdGlvbg0KICArIHdvcmtzIGJlc3QgaWYgaXQgaXMgYSBHRkYgbm90IEdURg0KICBgYGANCiAgZXhpc3RpbmdfYW5ub3RhdGlvbjogIkhvbW9fc2FwaWVucy5HUkNoMzguMTA0LmdmZjMiDQogIGBgYA0KKiByZWFkc19mYXN0cSANCiAgKyBQYXRoIHRvIGZvbGRlciA8RkFTVFEvUEFTUz4gb3IgZmlsZSA8c2FtcGxlLmZhc3RxPg0KICArIGlmIHRoaXMgcG9pbnRzIHRvIGEgZmlsZSBjaGFuZ2UgJ2NvbmNhdGVuYXRlJyB0byAnZmFsc2UnIA0KICBgYGANCiAgIyBjRE5BIG9yIGRpcmVjdCBSTkEgcmVhZHMgaW4gZmFzdHEgZm9ybWF0DQogIHJlYWRzX2Zhc3RxOiAiL3NjcmF0Y2gvY2dzYi9ncmVzaGFtL0xBQlNIQVJFL0RhdGEvbmFub3BvcmUvSEVLX0NQQS9IZWtfQ1BBX3BpbG90L0ZBU1RRL3Bhc3MiDQogIGBgYA0KKiBjb25jYXRlbmF0ZSANCiAgKyBpZiA8RkFTVFEvUEFTUz4gZGlyZWN0b3J5IHdhcyBhbHJlYWR5IGNvbWJpbmVkDQogIGBgYA0KICAjIFRoZSBwYXRoIGFib3ZlIGlzIGEgZGlyZWN0b3J5LCBmaW5kIGFuZCBjb25jYXRlbmF0ZSBmYXN0cSBmaWxlczoNCiAgY29uY2F0ZW5hdGU6IHRydWUNCiAgYGBgDQoqIHJ1bl9weWNob3BwZXINCiAgKyAqKk5CIFRoaXMgbXVzdCBiZSBzZXQgdG8gZmFsc2UqKg0KICArIHB5Y2hvcHBlciBydW5zIGFuIEhNTSB0byBpZGVudGlmeSBhZGFwdGVycyBhbmQgY3V0IHRoZW0gb3V0LCByZWFkcyBsYWNraW5nIGFkYXB0ZXJzIHdpbGwgYmUgZGlzY2FyZGVkDQogICsgb3VyIHNlcXVlbmNlcyBubyBsb25nZXIgaGF2ZSBhZGFwdGVycyBsZWFkaW5nIHRvIGEgbWFzc2l2ZSBsb3NzIG9mIGRhdGENCiAgYGBgDQogICMgUHJvY2VzcyBjRE5BIHJlYWRzIHVzaW5nIHB5Y2hvcHBlciwgdHVybiBvZmYgZm9yIGRpcmVjdCBSTkE6DQogIHJ1bl9weWNob3BwZXI6IGZhbHNlDQogIGBgYA0KKiBwbG90X2dmZmNtcF9zdGF0cyANCiAgKyBQb3RlbnRpYWxseSB1c2VmdWxsLCBzZXQgdG8gRmFsc2UgYnkgZGVmYXVsdA0KICBgYGANCiAgIyBQbG90IGdmZmNvbXBhcmUgcmVzdWx0czoNCiAgcGxvdF9nZmZjbXBfc3RhdHM6IHRydWUNCiAgYGBgDQoNCiMjIyBTdGVwIDMgRWRpdCBzYmF0Y2gNCioqTkIgU3ludGF4IEVycm9yKiogLSBBdCB0aW1lIG9mIHdyaXRpbmcgdGhlIHN5bnRheCB1c2VkIGluIHRoaXMgcGlwZWxpbmUgd2FzIGluY29tcGF0aWJsZSB3aXRoIHRoZSBzeW50YXggZXhwZWN0ZWQgaW4gdGhlIGxhdGVzdCB2ZXJzaW9uIG9mIHNuYWtlbWFrZSBbaGVyZV0oaHR0cHM6Ly9naXRodWIuY29tL25hbm9wb3JldGVjaC9waXBlbGluZS1uYW5vcG9yZS1yZWYtaXNvZm9ybXMvaXNzdWVzLzE4KS4gSGVyZSB3ZSBhcmUgdXNpbmcgYSAic2FmZSIgdmVyc2lvbiAtIGJ1dCB0aGlzIG1heSBicmVhayBpZiBPTlQgdXBkYXRlcyB0aGUgcGlwZWxpbmUgc3ludGF4Lg0KYGBgDQpuYW5vIGZpbGVuYW1lLnNoDQpgYGANCmNvcHkgYW5kIHBhc3RlOiANCmBgYHt9DQojIS9iaW4vYmFzaA0KIw0KI1NCQVRDSCAtLXZlcmJvc2UNCiNTQkFUQ0ggLS1qb2ItbmFtZT1jRE5BX2lzb2Zvcm1fY2FsbGluZw0KI1NCQVRDSCAtLW91dHB1dD1pc29mb3JtXyVqLm91dA0KI1NCQVRDSCAtLWVycm9yPWlzb2Zvcm1fJWouZXJyDQojU0JBVENIIC0tdGltZT0yNDowMDowMA0KI1NCQVRDSCAtLW5vZGVzPTENCiNTQkFUQ0ggLS1tZW09MjBHQg0KI1NCQVRDSCAtLW1haWwtdHlwZT1CRUdJTixFTkQsRkFJTCxUSU1FX0xJTUlUDQoNCiNtb2R1bGUgbG9hZA0KbW9kdWxlIGxvYWQgc25ha2VtYWtlLzUuMzEuMQ0KDQpzbmFrZW1ha2UgLS11c2UtY29uZGEgLWogNCBhbGwNCg0KYGBgDQpUaGVuIHJ1biB1c2luZw0KYGBgDQpzYmF0Y2ggPGZpbGVuYW1lLnNoPg0KYGBgDQoNCioqTkIgLSBSZXN0YXJ0aW5nKioNCiAgKyBJZiBhbiBlcnJvciBvY2N1cnMgaW4gdGhlIHBpcGVsaW5lLCB0aGUgcHJldmlvdXNseSBjb21wbGV0ZWQgc3RlcHMgd2lsbCBiZSByZXRhaW5lZCBhbmQgbWVyZWx5IHJlc3RhcnRpbmcgdGhlIHJ1biB3aWxsIGJlIHN1ZmZpY2llbnQgdG8gcmVzdGFydCB3aGVyZSB0aGUgeW91IGhhZCBwcmV2aW91c2x5IGxlZnQgb2ZmLiBUbyByZXN0YXJ0IGVudGlyZWx5IGVpdGhlciBkZWxldGUgdGhlIG91dHB1dCBmb2xkZXIgPHdvcmtkaXJfdG9wPiBvciBjaGFuZ2UgPHdvcmtkaXJfdG9wPiBwYXJhbWV0ZXIgaW4gdGhlIGNvbmZpZy55bWwgZmlsZSANCg0KIyMjIFN0ZXAgNCBFdmFsdWF0ZSByZXN1bHRzDQoqIGNETkEgSXNvZm9ybSBpZGVudGlmaWNhdGlvbiBhbmQgY29tcGFyaXNvbiB0byByZWZlcmVuY2U6DQogICsgbG9jYXRlZCBpbiANCiAgYGBgDQogIHBpcGVsaW5lLW5hbm9wb3JlLXJlZi1pc29mb3Jtcy9yZXN1bHRzL2dmZmNvbXBhcmUvc3RyX21lcmdlZC5hbm5vdGF0ZWQuZ3RmDQogIGBgYA0KICBUaGlzIGZpbGUgZGVmaW5lcyB3aGljaCAiZ2VuZSIgZWFjaCBpc29mb3JtIGlzIGFzc2lnbmVkIHRvIChlZy4gImdlbmVfbmFtZSIpLg0KDQogIFRoaXMgZmlsZSBhbHNvIGRlc2NyaWJlcyBlYWNoIHRyYW5zY3JpcHQgd2l0aCBhICJjbGFzcyBjb2RlIiBzdWNoIGFzICI9IiBvciAieSIuIFRoZSBleGFjdCBuYXR1cmUgb2YgdGhlc2UgaXMgZGVzY3JpYmVkIG9uIHRoZSBnZmZjb21wYXJlIHBhZ2UgW2hlcmVdKGh0dHA6Ly9jY2Iuamh1LmVkdS9zb2Z0d2FyZS9zdHJpbmd0aWUvZ2ZmY29tcGFyZV9jb2Rlcy5wbmcpDQoNCiAgKipOQiAtIFVuc3RyYW5kZWQgZGF0YSBhbmQgaXNvZm9ybSBjbGFzc2VzKiogDQogIEJlY2F1c2UgdGhpcyBpcyB1bnN0cmFuZGVkIGNETkEgaW5mb3JtYXRpb24gdGhlIHN0cmFuZGVkIGlzb2Zvcm0gY2xhc3NlcyAncycgYW5kICd4JyBzaG91bGQgbm90IGJlIGNvbnNpZGVyZWQgYXMgIm5vdmVsIi4NCg0KKiBjRE5BIElzb2Zvcm0gY292ZXJhZ2UgZGF0YToNCiAgKyBsb2NhdGVkIGluIA0KICBgYGANCiAgcGlwZWxpbmUtbmFub3BvcmUtcmVmLWlzb2Zvcm1zL3Jlc3VsdHMvZ2ZmY29tcGFyZS9zdHJfbWVyZ2VkLmdmZg0KICBgYGANCiAgVGhpcyBmaWxlIGNvbnRhaW5zIHJlYWQgZGVwdGggaW5mb3JtYXRpb24gZm9yIGVhY2ggdHJhbnNjcmlwdCwgVG90YWwgY292ZXJhZ2UgKCJjb3YiKSwgRlBLTSwgYW5kIFRQTS4NCiAgDQo=