HemTools: a collection of NGS pipelines and bioinformatic analyses¶
- NGS pipelines
- Standard GATK variant calling for both human and non-human species
- MNase-seq analysis pipeline (not a standard version)
- MicroC data analysis pipeline
- MinION Nanopore for sequence assembly and read mapping
- Create normalized RNA-seq bigwiggle files
- Analysis of single cell Strand-seq data
- Analysis of Sci-L3-seq data
- RNA-seq alternative splicing pipeline
- ATAC-seq
- Footprint analysis for ATAC-seq data
- STARR-seq analysis pipeline
- Running just BWA mem mapping
- CaptureC data analysis pipeline
- CHANGE-seq analysis pipeline
- Paired-end ChIP-seq
- Single-end ChIP-seq
- CITE-seq (scRNA-seq with antibodies) analysis
- Genome-wide CRISPR Screening
- CUT & RUN pipeline
- CUT & RUN calibration pipeline
- Cut & RUN footprinting
- Differential exon analysis
- RNA-seq: differential gene expression analysis
- eCLIP-seq data analysis pipeline
- gCrisprTools: Genome-wide CRISPR Screening
- RNA-seq: Identification of gene fusion events
- Detecting allele-specific effects on ChIP-seq or ATAC-seq
- Analysis of Hi-C or HiChIP data
- MNase-based HiCHIP data analysis
- Analysis of Hi-C and capture-C data using HiC-Pro
- Paired-end histone ChIP-seq or CUT&RUN
- Call IDR peaks given bam files from two replicates
- PacBio iso-seq data analysis
- DNA methylation (Bisulfite-Sequencing) analysis pipeline using nf-core
- Call motif footprint from bigwiggle files
- Inspection of multi-mapped reads
- Sequencing-depth and fragment-length normalized bigwiggle track
- Sequencing-depth and fragment-length normalized bigwiggle track
- Total reads in peaks normalized bigwiggle track
- Variant calling on PacBio HIFI reads
- Summerize R1 R2 read mapping direction and distance
- RNA-seq: Transcript-level quantification
- Variant identification on RNA-seq data
- scJupyter for single cell integration, annotaiton, modeling and reporting
- SHARE-seq data analysis
- Single-cell RNA-seq analysis
- Single-cell RNA-seq analysis
- Single-cell multiomc analysis
- SLAM-seq for time-resolved RNA sequencing
- NCBI download GEO/SRA data
- STARR-seq analysis pipeline
- Target-Seq analysis
- HemTools Tutorial 4-18-2019
- Typical Usage
- Sample input format
- Report bug
- Output
- Reference
- Comments
- Data Visualization
- Allele frequency or MSA plot
- Plot text on bar plot
- Generate base editor score tracks
- Upload colorful bed files to protein paint
- Convert bed files to vcf
- Plot Venn diagram given two bed files
- Boxplot
- Plot bw file correlation
- Data table operations
- Gene expression heatmap
- Heatmap Basic
- HiC data visualization
- Plot motif position density on peaks
- Plot Chromsome Ideogram
- Interactive heatmap
- Simple lineplot
- line_plot to compare editing outcome frequency (two groups)
- Overlapping Barplot
- Pathway visualization
- Visualizing high-dimentional data using PCA or UMAP
- Correlation heatmap
- Re-order Correlation heatmap
- Plot Enrichment dotplot
- Creat data table of tracks on protein paint
- Creating facet table for protein paint
- Plot replicate correlation
- Plot correlation scatter plots
- Scatter plot by color, shape, and size
- Average signal and heatmap over a bed file
- Average signal over multiple peaks
- Average signal for multiple bw over one bed
- Statistical test on 2 signals
- Table to heatmap, no clustering
- Volcano plot for logFC and P-value/FDR
- Typical Usage
- Report bug
- Motif Analysis Suite
- Integrative Analysis
- Linux Art
- Building DNAnexus APP
- Start JupyterLab in HPC
- Things about R
- bjobs related commands
- conda cheatsheet
- Docker notes
- Using Git version control
- Log in to compute node
- Convert html file to PDF or PNG
- Linux text file operations
- Python visualization code examples
- Merge Images
- Things about python
- Remove <U+FEFF> character in your file
- How to download folders in sourceforge
- Notes on Sphinix
- SSH to HPC without password
- Installation
- A collection of Jupyter Notebooks
- ChiCMaxima algorithm review
- Load library
- Read count table
- Specify design matrix, control + treatment
- Get count matrix
- RUN DEseq2
- Save result (this is the result without LFC shrinkage)
- apply LFC shrinkage
- save new result
- GSEA dot plot
- How to run GSEA analysis using user defined gene sets
- define gene sets
- define your ranked list
- run GSEA
- save GSEA stat
- plot GSEA figure
- run everything for another ranked gene list
- GWAS plot example
- ggplot correlation with values shown
- ggplot Manhattan plot
- ggplot Scatter plot gRNA counts version 2
- ggpubr violin plot for comparing number of fragments
- adding custom p-value bar to your ggplot
- ggseqlogo for variant motifs
- Scatter plot for pairwise comparison (gRNA counts)
- Step 1, read data
- generate density plot for each column
- merge each density
- Use R and Python in Jupyter Notebook
- FAQ
- Comments
- Bioinformatic Tools
- Predicting in vivo TFBS using Catchitt
- Calling significant interactions from Capture-C or Capture-HiC
- GSEA and pathway/GO enrichment analyses
- Replicate correlation and QC for HiC data
- Gene expression clustering
- Consensus peaks given multiple (>=2) replicates
- NCBI data submission
- Local UCSC cell browser usage for Seurat
- notes on alphafold
- Assigning features to a bed file.
- General bait design
- calculate chrM percent
- Filter bam files and generate bw files
- check sample barcode frequency in index reads
- Barcode frequency in 5’-end
- Download raw data from Illumina Base Space
- Convert BCL basecall files to FASTQ files
- BedGraph to BigWiggle
- bed overlap bedpe
- Query bed overlap with a list of bed files
- Merging bigwiggle files into one bw
- Input
- Usage
- Chromatin interaction calling in captureC data
- Visualize genomic loci (overview)
- Count indel integration pipeline
- Count indel integration pipeline (simplified version)
- Crispresso2 for HDR
- Convert CRISPResso allele frequency table to vcf-like table
- Interactive visualization using Dash Bio
- convert dataframe to html
- CRISPR Screening Demultiplexing
- CRISPR Screening Demultiplexing (hard trim first N random bp)
- Demultiplexing fastq files
- Diff or merge of two bw files
- DNAnexus download and upload
- EGACryptor for EGA submission
- Call interactions from HiC
- Extract inward/outward oriented pairs from BAM file
- Merge fastq I1 I2 R1 R2 reads into R1 and R2
- subsample fastq and visualize in sequence logo
- Run fastQC for a list of fastq files
- Filter out reads mapped to specific sequences
- Annotate vcf file (custom annotation not work)
- Genomic features annotatoin given bed file
- Extract user-defined gene promoter from refseq TSS database
- Find allele (e.g., SNPs) specific effects
- Integrating gene expression data and PPI network
- Objective
- Steps
- Cons
- Input
- Usage
- Output
- GTF operations
- Running GUIDE-seq in HPC
- HiC-Pro
- Generate indexed genome, chrom size, and res fragment bed for HicPro analysis
- Homer ChIP-seq analysis
- How to download all files from a website
- ENCODE database query
- Transcript-level abundance quantification
- Kmer count over bed
- Lift Over Bed or bigWiggle files
- LiftOverVCF
- Seurat to Loupe browser
- Merge multiple bedfiles
- Merge fastq files for L001 L002 L003 L004
- Write flowchart using text
- Using nf-core pipelines on HPC
- OnTAD
- Optimal subset finding problem in mutagenesis studies
- Filtering out peaks in narrowPeak files
- Convert rmd to html
- RNA-seq QC
- Across cell type NGS data normalization
- single cell RNA-seq data integration
- FASTQ files operations
- Smoothing a bedgraph file
- Download fastq data from NCBI SRA
- Super-enhancer identification
- Convert a column to bigwiggle file
- Using GPU on HPC
- Test differences in number of interactions
- Identify direct targets and co-binding factors
- Extract Ensembl Gene Name and IDs given IDs or names from any databases
- (TOBIAS) Footprint analysis for ATAC-seq data
- Uditas
- Generate new genome given vcf file
- Accessible Data in HemTools
- Gallery (stand-alone tools)
- Add gene annotations to CHANGE-seq off-targets table
- Manhanttan plot in circos
- Upload your bw and bed files to protein paint
- Plot hematopoiesis intensity, blood lineage
- Peak Annotation using GREAT
- A simple solution to submit LSF jobs
- Circos Manhanttan plot
- Peak Annotation GREAT
- Add gene annotations to CHANGE-seq off-targets table
- Upload your bw and bed files to protein paint
- Plot hematopoiesis
- Differential Analysis pipelines
- Study notes
- Notes on ATAC-seq, open chromatin, and footprints/motif
- Hosting web server on AWS
- Current observations/claims for CTCF in insulation potency and chromatin interaction
- Duplicated sequences in HBG1/HBG2 promoters
- Duplicated reads, multi-mapped reads
- Notes on python
- R for python people
- Thoughts on HemTools
- bioinformatics
- Chromatin Interaction data analysis
- pegRNA design
- Facts about CRISPR-Cas9
- Docker your applications
- All errors and how it was solved
- Concepts, names, and other small things to remember
- install tensorlfow on stjude HPC
- A very good description of histone marks
- Motif notes
- Python Pandas
- Notes on how to use plink
- Quantify mutations
- All questions and how it was solved
- single-cell RNA-seq analysis
- Using Scanpy to replace pandas
- Published statistics about genome biology
- Machine Learning pipelines
- CRISPR tools
- Crispresso2 for Base editor
- Base editor screening for gene functions
- cas9ENG
- Find number of off-targets
- Analysis pipeline for change-seq randomized assay
- Calculate custom defined editing frequency from Crispresso2 output
- Quantify base editing efficiency for crispressoPooled experiments
- Quantify prime editor off-target activity for crispressoPooled experiments
- Crispresso2 for Prime Editor
- convert gRNA bed file to cutsite bed file
- Easy-Prime: pegRNA design
- CRISPR-cas9 energy models for gRNA efficiency prediction
- QC for gRNA quality, gRNA sequencing data
- Convert MaGeCK RRA sgRNA results to bw tracks
- sgRNA design for disrupting TFBS
- Bioinformatics Core Competencies
- Calculate DEG gene expression level in the same TAD
- What is JupyterLab?
- Data analysis using Pandas
- Data visualization using Seaborn and many other libraries
- What is MA plot?
- Common NGS data formats and tools
- Hypothesis testing
- Volcano plot
- Find all genes in given TAD region (Language: Chinese)
- DeepTools tutorial
- Gene expression data analysis
- homer motif result interpretation
- Merge ChIP-seq peaks and Diff Gene Tables (Language: Chinese)
- Video tutorial: pooled gRNA screenining
- pysam example: checking softclip reads
- Density plot using python
- Python Heatmap plots
- Introduction
- Approach
- Reference
Ask question here¶
General principles¶
A typical HemTools command looks like this:
module load python/2.7.13
HemTools cut_run -f fastq.tsv -d peakcall.tsv
You can always see all available sub-commands by:
HemTools -h
usage: HemTools [-h] [-v]
{cut_run,chip_seq_pair,chip_seq_single,atac_seq,report_bug,volcano_plot,crispr_seq}
...
HemTools: performs NGS pipelines and other common analyses. Contact:
Yichao.Li@stjude.org or Yong.Cheng@stjude.org
positional arguments:
{cut_run,chip_seq_pair,chip_seq_single,atac_seq,report_bug,volcano_plot,crispr_seq}
Available APIs in HemTools
cut_run CUT & RUN pipeline
chip_seq_pair Paired-end ChIP-seq pipeline
chip_seq_single Single-end ChIP-seq pipeline
atac_seq ATAC-seq pipeline
report_bug Email the log files to the developer.
volcano_plot Data visualization: Volcano plot
crispr_seq Genome-wide CRISPR Screening pipeline
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
Duplicated reads and unique reads¶
Duplicated reads refer to reads that have the same 5’end position.
Unique reads refer to uniquely mapped reads as compared to multi-mapped reads.
In all NGS pipelines, we provide results from:
raw reads: .markdup.bam, .all.bw, _peak.NarrowPeak
remove duplicates reads: rmdup.bam, rmdup.bw, rmdup_peak.NarrowPeak
remove duplicates and unique reads: rmdup.uq.bam, rmdup.uq.bw, rmdup.uq_peak.NarrowPeak
For general usage, rmdup.uq is commonly accepted.
For people focusing on HBG1 and HBG2, raw reads or rmdup reads can be used.
Mapping rate is usually > 95% for high quality data.
Duplicated reads could be PCR duplicates or real signals, can be ranging from 5% to 20% for high quality data. I think I’ve seen cases where the duplicate rate is nearly 40% in some GEO datasets.
Multi-mapped reads are not a lot, maybe 1% to 2%. Never seen any data with >10% multi-mapped reads yet.
FAQ¶
How do I list all of my past analyses?¶
All the locations of the past analyses are logged here: ~/.hemtools_meta/my_dir.csv
Error loading python¶
/hpcf/apps/python/install/2.7.13/bin/python: error while loading shared libraries: libpython2.7.so.1.0: cannot open shared object file: No such file or directory
A: Missing python module. Just do module load python
ERROR: temporary directory is not writable: ‘normalize_bw_given_peak_yli11_2021-06-20’¶
Interesting, this should be an HPC error, not due to HemTools bug.