Across cell type NGS data normalization¶
usage: S3norm.py [-h] [-j JID] -f INPUT [--smoothLength SMOOTHLENGTH]
[--binSize BINSIZE] [--peak_dir PEAK_DIR]
[--file_name_pattern FILE_NAME_PATTERN] (-bam | -bw | -bdg)
[-g GENOME] [-s CHROM_SIZE] [-b BLACK_LIST]
NGS read counts normalization
optional arguments:
-h, --help show this help message and exit
-j JID, --jid JID enter a job ID, which is used to make a new directory.
Every output will be moved into this folder. (default:
S3norm_yli11_2022-05-23)
-f INPUT, --input INPUT
TSV file, 1 or 2 columns. If the 2nd col is available,
it will be used as control. (default: None)
--smoothLength SMOOTHLENGTH
--binSize BINSIZE bin size for s3norm (default: 20)
--peak_dir PEAK_DIR
--file_name_pattern FILE_NAME_PATTERN
-bam input files are bam, bam need index (default: False)
-bw input files are bigwig (default: False)
-bdg input files are bedgraph (default: False)
Genome Info:
-g GENOME, --genome GENOME
genome version: hg19, hg38, mm9, mm10. By default,
specifying a genome version will automatically update
index file, black list, chrom size and
effectiveGenomeSize, unless a user explicitly sets
those options. (default: hg19)
-s CHROM_SIZE, --chrom_size CHROM_SIZE
chrome size (default: /home/yli11/Data/Human/hg19/anno
tations/hg19.chrom.sizes.sorted)
-b BLACK_LIST, --black_list BLACK_LIST
Blacklist file (default: /home/yli11/Data/Human/hg19/a
nnotations/hg19.blacklist.bed)
Summary¶
S3norm can perform read counts normalization across different cell types. It uses a monotonic nonlinear data transformation method to match signals in both the peak regions and the background regions differently, such that both sequencing depth and signal-to-noise ratio between data sets can be simulatenously normalized.
Ref:
Input¶
A 1 column or 2 columns tsv file containing your signal data with or without control data.
For example
chip_seq_1 control_1
chip_seq_2 control_2
chip_seq_3 control_1
atac_seq_1
Input files can be one of the 3 types, but not mixed types:
Bam (with .bai index file in the same dir)
bigwig
bedgraph
Usage¶
Put all your input data in the same working directory and inside this dir, do the following:
hpcf_interactive
module load python/2.7.13
S3norm.py -f input.list -bam
To make S3norm run faster, we can only normalize signals around the peak region:
S3norm.py -f v15input.list --peak_dir ../cut_run_rfeng_2021-05-20/peak_files -bam
Output¶
The output files are S3norm-normalized bigwig files, located at $jid/S3norm_NBP_bedgraph/*.bw
Common problems¶
1. no outputs.¶
This could be caused by out of memory error if you have many files to be normalized at the same time. It also depends on number of entries (i.e., resolution, binsize) in your bedgraph file. In one example I have, 6 files with hg19 binsize=100bp, took almost 50G memory to run and finish in about 3 hours.
You shouldn’t have this problem if your data looks similar to the example above because by default, we require 80G memory.
2. input bw or bdg¶
--bdg
not implemented. If you inputs are bigwig, use S3norm.py -f input.list -bw
, similarly, if inputs are bedgraph, use -bdg
Comments¶
code @ github.