RNA-seq: Transcript-level quantification

usage: HemTools rna_seq [-h] [-j JID] [--debug] [--short]
                        (-f INPUT | --guess_input) [-g GENOME] [-i INDEX_FILE]
                        [-b BLACKLIST] [-s CHROM_SIZE]
                        [-e EFFECTIVEGENOMESIZE]

Named Arguments

-j, --jid

enter a job ID, which is used to make a new directory. Every output will be moved into this folder.

Default: “{{subcmd}}_docs_2024-03-15”

--debug

Not for end-user.

Default: False

--short

Not for end-user.

Default: False

-f, --input

tab delimited 3 columns (tsv file): Read 1 fastq, Read 2 fastq, sample ID

--guess_input

Let the program generate the input files for you.

Default: False

Genome Info

-g, --genome

genome version: hg19, hg38, mm9, mm10

Default: “hg19”

-i, --index_file

STAR index file

Default: “/home/docs/checkouts/readthedocs.org/user_builds/hemtools/checkouts/latest/subcmd/../hg19/hg19_star_253a_index/”

-b, --Blacklist

Blacklist file

Default: “/home/docs/checkouts/readthedocs.org/user_builds/hemtools/checkouts/latest/subcmd/../hg19/hg19.blacklist.bed”

-s, --chrom_size

chrome size

Default: “/home/docs/checkouts/readthedocs.org/user_builds/hemtools/checkouts/latest/subcmd/../hg19/hg19.chrom.sizes”

-e, --effectiveGenomeSize

effectiveGenomeSize for bamCoverage

Default: “2451960000”

Summary

This pipeline quantifies transcript-level abundance using kallisto and generates bam, bw files using STAR. The output file kallisto_files/{jid}_transcript.tpm.csv contains both transcript ID and gene name, which users can look for specific genes. For differential gene/transcript expression analysis, see diff_genes. For volcano plot of differential genes, see volcano. For RNA-seq variant calling, see rna_seq_variant_call

3/14/2024 Update

Isoform quantification using STAR is added. The transcript count table is *.isoforms.results.

To visualize read assigned to a transcript

First, get the transcript ID from the transcript count table.

Then, do

module load samtools/1.7;samtools view Your_label.transcript.bam | grep Your_transcript_ID | cut -f 1 > read.list

# the bam file should be in your bam_files folder, use markdup.uq.bam file
# Please note that transcript.bam is different than markdup.uq.bam

module laod picard/2.9.4;java -jar /hpcf/apps/picard/install/2.9.4/picard.jar FilterSamReads I=YOUR_bam_file.markdup.uq.bam O=subset.bam READ_LIST_FILE=read.list FILTER=includeReadList Validation_Stringency=SILENT;samtools index subset.bam

Usage

Go to your data directory and type the following.

Step 0: Load python version 2.7.13.

$ module load python/2.7.13

Step 1: Prepare input files, generate fastq.tsv.

$ HemTools_dev rna_seq --guess_input

    Input fastq files preparation complete! ALL GOOD!
    Please check if you like the computer-generated labels in : fastq.tsv

Note

If you are preparing fastq.tsv yourself, please make sure no space anywhere in the file. Note that the seperator is tab. Spaces in file name will cause errors.

Step 2: Check the computer-generated input list (manually), make sure they are correct.

$ less fastq.tsv

Note

a random string will be added to the generated files (e.g., fastq.94c049cbff1f.tsv) if they exist before running step 1.

Step 3: Submit your job.

$ HemTools_dev rna_seq -f fastq.tsv

Single-end input

The input is the same, leave the second column empty.

$ HemTools_dev rna_seq -f fastq.tsv --single

Sample input format

fastq.tsv

This is a tab-seperated-value format file. The 3 columns are: Read 1, Read 2, sample ID.

../../_images/fastq.tsv.png

Common downstream analyses

  • differential expression analysis (transcript / gene level)

  • MA plot, volcano plot, heatmap

  • given a set of genes, show a boxplot / violin plot (gene counts, TPM, or other quantifiers)

  • gene set enrichment analysis

Paired-end sequencing allows for various of additional analyses, including

  • isoform specific expression / differential isoform usage / alternative splicing / cryptic exons

  • RNA editing / variant calling / allel-specific expression

  • gene fusion

If a large collection of RNA-seq data is available, one can perform clustering analysis, time-series analysis, infering biological network, and classification analysis.

Integrative analyses with other data, etc ChIP-seq, ATAC-seq.

Output

In the email, you will see a zip file containing transcript expressions (TPM) for all input samples. An analysis report is also attached (same format as other reports generated by HemTools). Since STAR doesn’t include un-mapped reads in the bam file, you will see 100% mapping rate. The mapping statistics generated by STAR are currently not included, you can manually “less” them; they are located in the log_files folder, with file name ending with {{output_name}}_Log.final.out.

Reference

https://hbctraining.github.io/Intro-to-rnaseq-hpc-O2/lessons/04_alignment_quality.html

Report bug

Once the job is finished, you will be notified by email with some attachments. If no attachment can be found, it might be caused by an error. In such case, please go to the result directory (where the log_files folder is located) and type:

$ HemTools report_bug

Comments

code @ github.