Getting started

This document assumes you’ve already ran a sample an ONT Nanopore device and have generated fast5 files recording the squiggles of that run.

It assumes you’ve previously performed base-calling on the fast5 and the resultant fastq files are on the HPC.

It assumes that you have a base conda environment and have activated it. Because of how snakemake is handling the environment creation at runtime I have found the only way to get seaborn to work is to install it using pip.

pip install seaborn

Step 1. Install by git

git clone https://github.com/nanoporetech/pipeline-nanopore-ref-isoforms.git
cd pipeline-nanopore-ref-isoforms

Step 2. Edit the config.yml file.

note: this pipeline is largely under- or undocumented, so changing these settings can have unexpected results

nano config.yml
  • workdir_top

    • This will be your output folder
    • Despite inline comment, this works best if it is not the absolute path
    # ABSOLUTE path to directory holding the working directory:
    workdir_top: "HEK_test"
  • genome_fasta

    # Input genome
    genome_fasta: "Homo_sapiens.GRCh38.dna.primary_assembly.fa"
  • existing_annotation

    • works best if it is a GFF not GTF
    existing_annotation: "Homo_sapiens.GRCh38.104.gff3"
  • reads_fastq

    • Path to folder or file <sample.fastq>
    • if this points to a file change ‘concatenate’ to ‘false’
    # cDNA or direct RNA reads in fastq format
    reads_fastq: "/scratch/cgsb/gresham/LABSHARE/Data/nanopore/HEK_CPA/Hek_CPA_pilot/FASTQ/pass"
  • concatenate

    • if directory was already combined
    # The path above is a directory, find and concatenate fastq files:
    concatenate: true
  • run_pychopper

    • NB This must be set to false
    • pychopper runs an HMM to identify adapters and cut them out, reads lacking adapters will be discarded
    • our sequences no longer have adapters leading to a massive loss of data
    # Process cDNA reads using pychopper, turn off for direct RNA:
    run_pychopper: false
  • plot_gffcmp_stats

    • Potentially usefull, set to False by default
    # Plot gffcompare results:
    plot_gffcmp_stats: true

Step 3 Edit sbatch

NB Syntax Error - At time of writing the syntax used in this pipeline was incompatible with the syntax expected in the latest version of snakemake here. Here we are using a “safe” version - but this may break if ONT updates the pipeline syntax.

nano filename.sh

copy and paste:

#!/bin/bash
#
#SBATCH --verbose
#SBATCH --job-name=cDNA_isoform_calling
#SBATCH --output=isoform_%j.out
#SBATCH --error=isoform_%j.err
#SBATCH --time=24:00:00
#SBATCH --nodes=1
#SBATCH --mem=20GB
#SBATCH --mail-type=BEGIN,END,FAIL,TIME_LIMIT

#module load
module load snakemake/5.31.1

snakemake --use-conda -j 4 all

Then run using

sbatch <filename.sh>

NB - Restarting + If an error occurs in the pipeline, the previously completed steps will be retained and merely restarting the run will be sufficient to restart where the you had previously left off. To restart entirely either delete the output folder or change parameter in the config.yml file

Step 4 Evaluate results

  • cDNA Isoform identification and comparison to reference:

    • located in
    pipeline-nanopore-ref-isoforms/results/gffcompare/str_merged.annotated.gtf

    This file defines which “gene” each isoform is assigned to (eg. “gene_name”).

    This file also describes each transcript with a “class code” such as “=” or “y”. The exact nature of these is described on the gffcompare page here

    NB - Unstranded data and isoform classes Because this is unstranded cDNA information the stranded isoform classes ‘s’ and ‘x’ should not be considered as “novel”.

  • cDNA Isoform coverage data:

    • located in
    pipeline-nanopore-ref-isoforms/results/gffcompare/str_merged.gff

    This file contains read depth information for each transcript, Total coverage (“cov”), FPKM, and TPM.

LS0tDQp0aXRsZTogImNETkEgSXNvZm9ybSBjYWxsaW5nIGZvciBOYW5vcG9yZSINCmF1dGhvcjogIlBpZXRlciBTcGVhbG1hbiINCmRhdGU6ICIxMC8yMS8yMDIxIg0Kb3V0cHV0OiBodG1sX25vdGVib29rDQotLS0NCg0KYGBge3Igc2V0dXAsIGluY2x1ZGU9RkFMU0V9DQprbml0cjo6b3B0c19jaHVuayRzZXQoZWNobyA9IFRSVUUpDQpgYGANCg0KIyMgR2V0dGluZyBzdGFydGVkDQoNClRoaXMgZG9jdW1lbnQgYXNzdW1lcyB5b3UndmUgYWxyZWFkeSByYW4gYSBzYW1wbGUgYW4gT05UIE5hbm9wb3JlIGRldmljZSBhbmQgaGF2ZSBnZW5lcmF0ZWQgZmFzdDUgZmlsZXMgcmVjb3JkaW5nIHRoZSBzcXVpZ2dsZXMgb2YgdGhhdCBydW4uIA0KDQpJdCBhc3N1bWVzIHlvdSd2ZSBwcmV2aW91c2x5IHBlcmZvcm1lZCBiYXNlLWNhbGxpbmcgb24gdGhlIGZhc3Q1IGFuZCB0aGUgcmVzdWx0YW50IGZhc3RxIGZpbGVzIGFyZSBvbiB0aGUgSFBDLiANCg0KSXQgYXNzdW1lcyB0aGF0IHlvdSBoYXZlIGEgYmFzZSBjb25kYSBlbnZpcm9ubWVudCBhbmQgaGF2ZSBhY3RpdmF0ZWQgaXQuIEJlY2F1c2Ugb2YgaG93IHNuYWtlbWFrZSBpcyBoYW5kbGluZyB0aGUgZW52aXJvbm1lbnQgY3JlYXRpb24gYXQgcnVudGltZSBJIGhhdmUgZm91bmQgKip0aGUgb25seSB3YXkgdG8gZ2V0IHNlYWJvcm4gdG8gd29yayBpcyB0byBpbnN0YWxsIGl0IHVzaW5nIHBpcC4qKiAgDQpgYGANCnBpcCBpbnN0YWxsIHNlYWJvcm4NCmBgYA0KDQojIyBTdGVwIDEuIEluc3RhbGwgYnkgZ2l0DQpgYGANCmdpdCBjbG9uZSBodHRwczovL2dpdGh1Yi5jb20vbmFub3BvcmV0ZWNoL3BpcGVsaW5lLW5hbm9wb3JlLXJlZi1pc29mb3Jtcy5naXQNCmNkIHBpcGVsaW5lLW5hbm9wb3JlLXJlZi1pc29mb3Jtcw0KYGBgDQoNCiMjIyBTdGVwIDIuIEVkaXQgdGhlIGNvbmZpZy55bWwgZmlsZS4NCioqbm90ZTogdGhpcyBwaXBlbGluZSBpcyBsYXJnZWx5IHVuZGVyLSBvciB1bmRvY3VtZW50ZWQsIHNvIGNoYW5naW5nIHRoZXNlIHNldHRpbmdzIGNhbiBoYXZlIHVuZXhwZWN0ZWQgcmVzdWx0cyoqDQoNCmBgYA0KbmFubyBjb25maWcueW1sDQpgYGANCg0KKiB3b3JrZGlyX3RvcA0KICArIFRoaXMgd2lsbCBiZSB5b3VyIG91dHB1dCBmb2xkZXINCiAgKyBEZXNwaXRlIGlubGluZSBjb21tZW50LCB0aGlzIHdvcmtzIGJlc3QgaWYgaXQgaXMgbm90IHRoZSBhYnNvbHV0ZSBwYXRoDQogIGBgYA0KICAjIEFCU09MVVRFIHBhdGggdG8gZGlyZWN0b3J5IGhvbGRpbmcgdGhlIHdvcmtpbmcgZGlyZWN0b3J5Og0KICB3b3JrZGlyX3RvcDogIkhFS190ZXN0Ig0KICBgYGANCiogZ2Vub21lX2Zhc3RhDQogIGBgYA0KICAjIElucHV0IGdlbm9tZQ0KICBnZW5vbWVfZmFzdGE6ICJIb21vX3NhcGllbnMuR1JDaDM4LmRuYS5wcmltYXJ5X2Fzc2VtYmx5LmZhIg0KICBgYGANCiogZXhpc3RpbmdfYW5ub3RhdGlvbg0KICArIHdvcmtzIGJlc3QgaWYgaXQgaXMgYSBHRkYgbm90IEdURg0KICBgYGANCiAgZXhpc3RpbmdfYW5ub3RhdGlvbjogIkhvbW9fc2FwaWVucy5HUkNoMzguMTA0LmdmZjMiDQogIGBgYA0KKiByZWFkc19mYXN0cSANCiAgKyBQYXRoIHRvIGZvbGRlciA8RkFTVFEvUEFTUz4gb3IgZmlsZSA8c2FtcGxlLmZhc3RxPg0KICArIGlmIHRoaXMgcG9pbnRzIHRvIGEgZmlsZSBjaGFuZ2UgJ2NvbmNhdGVuYXRlJyB0byAnZmFsc2UnIA0KICBgYGANCiAgIyBjRE5BIG9yIGRpcmVjdCBSTkEgcmVhZHMgaW4gZmFzdHEgZm9ybWF0DQogIHJlYWRzX2Zhc3RxOiAiL3NjcmF0Y2gvY2dzYi9ncmVzaGFtL0xBQlNIQVJFL0RhdGEvbmFub3BvcmUvSEVLX0NQQS9IZWtfQ1BBX3BpbG90L0ZBU1RRL3Bhc3MiDQogIGBgYA0KKiBjb25jYXRlbmF0ZSANCiAgKyBpZiA8RkFTVFEvUEFTUz4gZGlyZWN0b3J5IHdhcyBhbHJlYWR5IGNvbWJpbmVkDQogIGBgYA0KICAjIFRoZSBwYXRoIGFib3ZlIGlzIGEgZGlyZWN0b3J5LCBmaW5kIGFuZCBjb25jYXRlbmF0ZSBmYXN0cSBmaWxlczoNCiAgY29uY2F0ZW5hdGU6IHRydWUNCiAgYGBgDQoqIHJ1bl9weWNob3BwZXINCiAgKyAqKk5CIFRoaXMgbXVzdCBiZSBzZXQgdG8gZmFsc2UqKg0KICArIHB5Y2hvcHBlciBydW5zIGFuIEhNTSB0byBpZGVudGlmeSBhZGFwdGVycyBhbmQgY3V0IHRoZW0gb3V0LCByZWFkcyBsYWNraW5nIGFkYXB0ZXJzIHdpbGwgYmUgZGlzY2FyZGVkDQogICsgb3VyIHNlcXVlbmNlcyBubyBsb25nZXIgaGF2ZSBhZGFwdGVycyBsZWFkaW5nIHRvIGEgbWFzc2l2ZSBsb3NzIG9mIGRhdGENCiAgYGBgDQogICMgUHJvY2VzcyBjRE5BIHJlYWRzIHVzaW5nIHB5Y2hvcHBlciwgdHVybiBvZmYgZm9yIGRpcmVjdCBSTkE6DQogIHJ1bl9weWNob3BwZXI6IGZhbHNlDQogIGBgYA0KKiBwbG90X2dmZmNtcF9zdGF0cyANCiAgKyBQb3RlbnRpYWxseSB1c2VmdWxsLCBzZXQgdG8gRmFsc2UgYnkgZGVmYXVsdA0KICBgYGANCiAgIyBQbG90IGdmZmNvbXBhcmUgcmVzdWx0czoNCiAgcGxvdF9nZmZjbXBfc3RhdHM6IHRydWUNCiAgYGBgDQoNCiMjIyBTdGVwIDMgRWRpdCBzYmF0Y2gNCioqTkIgU3ludGF4IEVycm9yKiogLSBBdCB0aW1lIG9mIHdyaXRpbmcgdGhlIHN5bnRheCB1c2VkIGluIHRoaXMgcGlwZWxpbmUgd2FzIGluY29tcGF0aWJsZSB3aXRoIHRoZSBzeW50YXggZXhwZWN0ZWQgaW4gdGhlIGxhdGVzdCB2ZXJzaW9uIG9mIHNuYWtlbWFrZSBbaGVyZV0oaHR0cHM6Ly9naXRodWIuY29tL25hbm9wb3JldGVjaC9waXBlbGluZS1uYW5vcG9yZS1yZWYtaXNvZm9ybXMvaXNzdWVzLzE4KS4gSGVyZSB3ZSBhcmUgdXNpbmcgYSAic2FmZSIgdmVyc2lvbiAtIGJ1dCB0aGlzIG1heSBicmVhayBpZiBPTlQgdXBkYXRlcyB0aGUgcGlwZWxpbmUgc3ludGF4Lg0KYGBgDQpuYW5vIGZpbGVuYW1lLnNoDQpgYGANCmNvcHkgYW5kIHBhc3RlOiANCmBgYHt9DQojIS9iaW4vYmFzaA0KIw0KI1NCQVRDSCAtLXZlcmJvc2UNCiNTQkFUQ0ggLS1qb2ItbmFtZT1jRE5BX2lzb2Zvcm1fY2FsbGluZw0KI1NCQVRDSCAtLW91dHB1dD1pc29mb3JtXyVqLm91dA0KI1NCQVRDSCAtLWVycm9yPWlzb2Zvcm1fJWouZXJyDQojU0JBVENIIC0tdGltZT0yNDowMDowMA0KI1NCQVRDSCAtLW5vZGVzPTENCiNTQkFUQ0ggLS1tZW09MjBHQg0KI1NCQVRDSCAtLW1haWwtdHlwZT1CRUdJTixFTkQsRkFJTCxUSU1FX0xJTUlUDQoNCiNtb2R1bGUgbG9hZA0KbW9kdWxlIGxvYWQgc25ha2VtYWtlLzUuMzEuMQ0KDQpzbmFrZW1ha2UgLS11c2UtY29uZGEgLWogNCBhbGwNCg0KYGBgDQpUaGVuIHJ1biB1c2luZw0KYGBgDQpzYmF0Y2ggPGZpbGVuYW1lLnNoPg0KYGBgDQoNCioqTkIgLSBSZXN0YXJ0aW5nKioNCiAgKyBJZiBhbiBlcnJvciBvY2N1cnMgaW4gdGhlIHBpcGVsaW5lLCB0aGUgcHJldmlvdXNseSBjb21wbGV0ZWQgc3RlcHMgd2lsbCBiZSByZXRhaW5lZCBhbmQgbWVyZWx5IHJlc3RhcnRpbmcgdGhlIHJ1biB3aWxsIGJlIHN1ZmZpY2llbnQgdG8gcmVzdGFydCB3aGVyZSB0aGUgeW91IGhhZCBwcmV2aW91c2x5IGxlZnQgb2ZmLiBUbyByZXN0YXJ0IGVudGlyZWx5IGVpdGhlciBkZWxldGUgdGhlIG91dHB1dCBmb2xkZXIgPHdvcmtkaXJfdG9wPiBvciBjaGFuZ2UgPHdvcmtkaXJfdG9wPiBwYXJhbWV0ZXIgaW4gdGhlIGNvbmZpZy55bWwgZmlsZSANCg0KIyMjIFN0ZXAgNCBFdmFsdWF0ZSByZXN1bHRzDQoqIGNETkEgSXNvZm9ybSBpZGVudGlmaWNhdGlvbiBhbmQgY29tcGFyaXNvbiB0byByZWZlcmVuY2U6DQogICsgbG9jYXRlZCBpbiANCiAgYGBgDQogIHBpcGVsaW5lLW5hbm9wb3JlLXJlZi1pc29mb3Jtcy9yZXN1bHRzL2dmZmNvbXBhcmUvc3RyX21lcmdlZC5hbm5vdGF0ZWQuZ3RmDQogIGBgYA0KICBUaGlzIGZpbGUgZGVmaW5lcyB3aGljaCAiZ2VuZSIgZWFjaCBpc29mb3JtIGlzIGFzc2lnbmVkIHRvIChlZy4gImdlbmVfbmFtZSIpLg0KDQogIFRoaXMgZmlsZSBhbHNvIGRlc2NyaWJlcyBlYWNoIHRyYW5zY3JpcHQgd2l0aCBhICJjbGFzcyBjb2RlIiBzdWNoIGFzICI9IiBvciAieSIuIFRoZSBleGFjdCBuYXR1cmUgb2YgdGhlc2UgaXMgZGVzY3JpYmVkIG9uIHRoZSBnZmZjb21wYXJlIHBhZ2UgW2hlcmVdKGh0dHA6Ly9jY2Iuamh1LmVkdS9zb2Z0d2FyZS9zdHJpbmd0aWUvZ2ZmY29tcGFyZV9jb2Rlcy5wbmcpDQoNCiAgKipOQiAtIFVuc3RyYW5kZWQgZGF0YSBhbmQgaXNvZm9ybSBjbGFzc2VzKiogDQogIEJlY2F1c2UgdGhpcyBpcyB1bnN0cmFuZGVkIGNETkEgaW5mb3JtYXRpb24gdGhlIHN0cmFuZGVkIGlzb2Zvcm0gY2xhc3NlcyAncycgYW5kICd4JyBzaG91bGQgbm90IGJlIGNvbnNpZGVyZWQgYXMgIm5vdmVsIi4NCg0KKiBjRE5BIElzb2Zvcm0gY292ZXJhZ2UgZGF0YToNCiAgKyBsb2NhdGVkIGluIA0KICBgYGANCiAgcGlwZWxpbmUtbmFub3BvcmUtcmVmLWlzb2Zvcm1zL3Jlc3VsdHMvZ2ZmY29tcGFyZS9zdHJfbWVyZ2VkLmdmZg0KICBgYGANCiAgVGhpcyBmaWxlIGNvbnRhaW5zIHJlYWQgZGVwdGggaW5mb3JtYXRpb24gZm9yIGVhY2ggdHJhbnNjcmlwdCwgVG90YWwgY292ZXJhZ2UgKCJjb3YiKSwgRlBLTSwgYW5kIFRQTS4NCiAgDQo=