Common NGS data formats and tools¶

Objectives¶

You should be familiar with the following common examples that we use daily:

Sequence retrieval
Read alignment
SAM/BAM query: index, sort, stats, visualization, filter, subset
SAM format, SAM flags
BED: sort, intersect (remove), merge

Data formats¶

fasta
fastq
bam / bam.bai / sam
bed / bed4 / bed6 / bed6+4 == narrowPeak / bed12
wiggle / bedgraph / bigwiggle
vcf
gtf / gff / gff3
tsv / csv
.py / .R / .sh / .jar / .pl
Please refer to the UCSC format FAQ for “official” definition

Tools¶

fastqc / fastp
seqtk
BWA / bowtie2 / STAR
samtools
bedtools
deeptools
scripting languange: Python

FASTA format¶

An example shown below.

;LCBO - Prolactin precursor - Bovine
; a sample sequence in FASTA format
; MDSKGSSQKGSRLLLLLVVSNLLLCQGVVSTPVCPNGPGNCQVSLRDLFDRAVMVSHYIHDLSS
EMFNEFDKRYAQGKGFITMALNSCHTSSLPTPEDKEQAQQTHHEVLMSLILGLLRSWNDPLYHL
VTEVRGMKGAPDAILSRAIEIEEENKRLLEGMEMIFGQVIPGAKETEPYPVWSGLPSLQTKDED
ARYSAFYNLLHCLRRDSSKIDTYLKLLNCRIIYNNNC*

>MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken
MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID
FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA
DIDGDGQVNYEEFVQMMTAK*

>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
IENY

Syntax¶

“;” is for comments
“>” is for the name of a DNA/RNA/protein sequence
The sequence name can only be written in one line, no matter how long it is
The next line following “>” should not be an empty line

; invalid FASTA format
>Sequence_1

ACTTGG

; invalid FASTA format
>Sequence_2
>split line for names new sequence in human
ACGTGG

Other notes¶

IUPAC code for DNA letters:¶

IUPAC nucleotide code	Base
A	Adenine
C	Cytosine
G	Guanine
T (or U)	Thymine (or Uracil)
R	A or G
Y	C or T
S	G or C
W	A or T
K	G or T
M	A or C
B	C or G or T
D	A or G or T
H	A or C or T
V	A or C or G
N	any base
. or -	gap

Filename extension¶

Extension	Meaning	Notes
fasta, fa	generic FASTA	Any generic fasta file. See below for other common FASTA file extensions
fna	FASTA nucleic acid	Used generically to specify nucleic acids.
ffn	FASTA nucleotide of gene regions	Contains coding regions for a genome.
faa	FASTA amino acid	Contains amino acid sequences. A multiple protein fasta file can have the more specific extension mpfa.
frn	FASTA non-coding RNA	Contains non-coding RNA regions for a genome, in DNA alphabet e.g. tRNA, rRNA

Reference¶

See this wiki.

FASTA examples¶

1. Download human genome sequence from UCSC¶

wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz

ls

gunzip hg38.fa.gz

head hg38.fa

# How many chromosomes do we have?
grep ">" hg38.fa | wc -l

2. Sequence Retrieval using online tools¶

This is a common task. For example: given several genes, we want to extract gene promoters, or gene exons.

The question is simple, however, for many people who have never done it before, what we really need is a list of genomic coordinates.

Solution 1: Using ProteinPaint (https://proteinpaint.stjude.org/)¶

Fastest way for a single given genomic coordinate: e.g. chr11:5248160-5248251 or chr11 5248160 5248251

To get the reverse completement, just use some online tool, such as: https://www.bioinformatics.org/sms/rev_comp.html

Solution 2: Using Ensembl BioMart (https://www.ensembl.org/biomart/martview/)¶

Google ensembl biomart, click on the first hit.

Convenient way for getting promoter/UTR/exon/intron sequences given a list of gene names/IDs.

Exercise: get the promoter sequences for these genes

Hbg1
Hbg2
hbb
hba1

3: Sequence Retrieval using bedtools ¶

The ultimate solution.

[6]:

!bedtools getfasta


Tool:    bedtools getfasta (aka fastaFromBed)
Version: v2.30.0
Summary: Extract DNA sequences from a fasta file based on feature coordinates.

Usage:   bedtools getfasta [OPTIONS] -fi <fasta> -bed <bed/gff/vcf>

Options:
        -fi             Input FASTA file
        -fo             Output file (opt., default is STDOUT
        -bed            BED/GFF/VCF file of ranges to extract from -fi
        -name           Use the name field and coordinates for the FASTA header
        -name+          (deprecated) Use the name field and coordinates for the FASTA header
        -nameOnly       Use the name field for the FASTA header
        -split          Given BED12 fmt., extract and concatenate the sequences
                        from the BED "blocks" (e.g., exons)
        -tab            Write output in TAB delimited format.
        -bedOut         Report extract sequences in a tab-delimited BED format instead of in FASTA format.
                        - Default is FASTA format.
        -s              Force strandedness. If the feature occupies the antisense,
                        strand, the sequence will be reverse complemented.
                        - By default, strand information is ignored.
        -fullHeader     Use full fasta header.
                        - By default, only the word before the first space or tab
                        is used.
        -rna    The FASTA is RNA not DNA. Reverse complementation handled accordingly.

BED format¶

We are going to use a bed6 or a standard bed format, which contains 6 columns: chr, start, end, name, value, strand.

[8]:

!head CTCF.hg38.random.bed

chr6    92372038        92372057        JASPA_MA0139.1_CTCF_chr6_92372039       9.80328 -
chr1    113890670       113890689       JASPA_MA0139.1_CTCF_chr1_113890671      8.7541  -
chr16   12677107        12677126        JASPA_MA0139.1_CTCF_chr16_12677108      13.9836 -
chr11   116687055       116687074       JASPA_MA0139.1_CTCF_chr11_116687056     9.04918 -
chr13   32306850        32306869        JASPA_MA0139.1_CTCF_chr13_32306851      8.11475 +
chr4    55038826        55038845        JASPA_MA0139.1_CTCF_chr4_55038827       8.18033 +
chr4    116198084       116198103       JASPA_MA0139.1_CTCF_chr4_116198085      12.0328 -
chr12   24244394        24244413        JASPA_MA0139.1_CTCF_chr12_24244395      8.44262 -
chr8    47186476        47186495        JASPA_MA0139.1_CTCF_chr8_47186477       8.85246 +
chr4    82671394        82671413        JASPA_MA0139.1_CTCF_chr4_82671395       9.55738 -

BED3: A BED file where each feature is described by chrom, start, and end.

For example: chr1 11873 14409

BED4: A BED file where each feature is described by chrom, start, end, and name.

For example: chr1 11873 14409 uc001aaa.3

BED5: A BED file where each feature is described by chrom, start, end, name, and score.

For example: chr1 11873 14409 uc001aaa.3 0

BED6: A BED file where each feature is described by chrom, start, end, name, score, and strand.

For example: chr1 11873 14409 uc001aaa.3 0 +

BED12: A BED file where each feature is described by all twelve columns listed above.

For example: chr1 11873 14409 uc001aaa.3 0 + 11873 11873 0 3 354,109,1189, 0,739,1347

[12]:

!bedtools getfasta -fi hg38.fa -bed CTCF.hg38.random.bed -fo CTCF.fa ; head CTCF.fa

>chr6:92372038-92372057
TAGCTCCCTCTTCTGAAGC
>chr1:113890670-113890689
TCTTGCCACCCTCTGACCA
>chr16:12677107-12677126
GACTGGCACCTGCTGGGCA
>chr11:116687055-116687074
CACCTCCACCTCTAGGTCT
>chr13:32306850-32306869
aatccaggaggtggagatt

[13]:

!bedtools getfasta -fi hg38.fa -bed CTCF.hg38.random.bed -fo CTCF.fa -s; head CTCF.fa

>chr6:92372038-92372057(-)
GCTTCAGAAGAGGGAGCTA
>chr1:113890670-113890689(-)
TGGTCAGAGGGTGGCAAGA
>chr16:12677107-12677126(-)
TGCCCAGCAGGTGCCAGTC
>chr11:116687055-116687074(-)
AGACCTAGAGGTGGAGGTG
>chr13:32306850-32306869(+)
aatccaggaggtggagatt

[15]:

!bedtools getfasta -fi hg38.fa -bed CTCF.hg38.random.bed -fo CTCF.fa -s -name; head CTCF.fa

>JASPA_MA0139.1_CTCF_chr6_92372039::chr6:92372038-92372057(-)
GCTTCAGAAGAGGGAGCTA
>JASPA_MA0139.1_CTCF_chr1_113890671::chr1:113890670-113890689(-)
TGGTCAGAGGGTGGCAAGA
>JASPA_MA0139.1_CTCF_chr16_12677108::chr16:12677107-12677126(-)
TGCCCAGCAGGTGCCAGTC
>JASPA_MA0139.1_CTCF_chr11_116687056::chr11:116687055-116687074(-)
AGACCTAGAGGTGGAGGTG
>JASPA_MA0139.1_CTCF_chr13_32306851::chr13:32306850-32306869(+)
aatccaggaggtggagatt

[17]:

!bedtools getfasta -fi hg38.fa -bed CTCF.hg38.random.bed -fo CTCF.fa -s -nameOnly; head CTCF.fa

>JASPA_MA0139.1_CTCF_chr6_92372039(-)
GCTTCAGAAGAGGGAGCTA
>JASPA_MA0139.1_CTCF_chr1_113890671(-)
TGGTCAGAGGGTGGCAAGA
>JASPA_MA0139.1_CTCF_chr16_12677108(-)
TGCCCAGCAGGTGCCAGTC
>JASPA_MA0139.1_CTCF_chr11_116687056(-)
AGACCTAGAGGTGGAGGTG
>JASPA_MA0139.1_CTCF_chr13_32306851(+)
aatccaggaggtggagatt

[19]:

!bedtools getfasta -fi hg38.fa -bed CTCF.hg38.random.bed -fo CTCF.fa -s -nameOnly -tab; head CTCF.fa

JASPA_MA0139.1_CTCF_chr6_92372039(-)    GCTTCAGAAGAGGGAGCTA
JASPA_MA0139.1_CTCF_chr1_113890671(-)   TGGTCAGAGGGTGGCAAGA
JASPA_MA0139.1_CTCF_chr16_12677108(-)   TGCCCAGCAGGTGCCAGTC
JASPA_MA0139.1_CTCF_chr11_116687056(-)  AGACCTAGAGGTGGAGGTG
JASPA_MA0139.1_CTCF_chr13_32306851(+)   aatccaggaggtggagatt
JASPA_MA0139.1_CTCF_chr4_55038827(+)    aagtcagcaggtgccacaa
JASPA_MA0139.1_CTCF_chr4_116198085(-)   ctgccacaagaggatacaa
JASPA_MA0139.1_CTCF_chr12_24244395(-)   CAGTCAGTAGGAGGGGCTG
JASPA_MA0139.1_CTCF_chr8_47186477(+)    taatcaccagtggtcagta
JASPA_MA0139.1_CTCF_chr4_82671395(-)    tagcctgtagatgtcaatt

4: Sequence Retrieval using Python¶

The ultimate solution for more customized analyses.

PyRanges¶

[56]:

import pyranges as pr
import pandas as pd

[57]:

ctcf = pr.read_bed("CTCF.hg38.random.bed",as_df=False).sort() # sort will increase downstream analysis speed
ctcf.head()

[57]:

	Chromosome	Start	End	Name	Score	Strand
0	chr1	11804507	11804526	JASPA_MA0139.1_CTCF_chr1_11804508	13.27870	+
1	chr1	29824616	29824635	JASPA_MA0139.1_CTCF_chr1_29824617	9.62295	+
2	chr1	23368354	23368373	JASPA_MA0139.1_CTCF_chr1_23368355	15.57380	-
3	chr1	29744978	29744997	JASPA_MA0139.1_CTCF_chr1_29744979	8.57377	-
4	chr1	113890670	113890689	JASPA_MA0139.1_CTCF_chr1_113890671	8.75410	-
5	chr1	116134359	116134378	JASPA_MA0139.1_CTCF_chr1_116134360	11.63930	-
6	chr1	153684245	153684264	JASPA_MA0139.1_CTCF_chr1_153684246	8.45902	-
7	chr1	223429627	223429646	JASPA_MA0139.1_CTCF_chr1_223429628	10.93440	-

[58]:

type(ctcf)

[58]:

pyranges.pyranges.PyRanges

[59]:

ctcf.seq = pr.get_fasta(ctcf.unstrand(), "hg38.fa")

[60]:

ctcf.head()

[60]:

	Chromosome	Start	End	Name	Score	Strand	seq
0	chr1	11804507	11804526	JASPA_MA0139.1_CTCF_chr1_11804508	13.27870	+	AGGCCAACAGAGGGCATAG
1	chr1	29824616	29824635	JASPA_MA0139.1_CTCF_chr1_29824617	9.62295	+	AGATCAGCAGAGGGCCCCT
2	chr1	23368354	23368373	JASPA_MA0139.1_CTCF_chr1_23368355	15.57380	-	CTCTGCCATCTCGAGGCCG
3	chr1	29744978	29744997	JASPA_MA0139.1_CTCF_chr1_29744979	8.57377	-	ccctgacctctgtaggtcc
4	chr1	113890670	113890689	JASPA_MA0139.1_CTCF_chr1_113890671	8.75410	-	TCTTGCCACCCTCTGACCA
5	chr1	116134359	116134378	JASPA_MA0139.1_CTCF_chr1_116134360	11.63930	-	GGATGCCACCTAATGGTGG
6	chr1	153684245	153684264	JASPA_MA0139.1_CTCF_chr1_153684246	8.45902	-	TCCTGCCGTCCAGTGGAGG
7	chr1	223429627	223429646	JASPA_MA0139.1_CTCF_chr1_223429628	10.93440	-	GCTTGCCCTCTGCAGCCCA

Pyrange only works for unstranded dataframe

[61]:

def revcomp(seq):
    tab = str.maketrans("ACTG", "TGAC")
    return seq.translate(tab)[::-1]
ctcf.seq = ctcf.as_df().apply(lambda r: revcomp(r['seq']) if r['Strand']=="-" else r['seq'],axis=1)

[66]:

ctcf.head()

[66]:

	Chromosome	Start	End	Name	Score	Strand	seq
0	chr1	11804507	11804526	JASPA_MA0139.1_CTCF_chr1_11804508	13.27870	+	AGGCCAACAGAGGGCATAG
1	chr1	29824616	29824635	JASPA_MA0139.1_CTCF_chr1_29824617	9.62295	+	AGATCAGCAGAGGGCCCCT
2	chr1	23368354	23368373	JASPA_MA0139.1_CTCF_chr1_23368355	15.57380	-	CGGCCTCGAGATGGCAGAG
3	chr1	29744978	29744997	JASPA_MA0139.1_CTCF_chr1_29744979	8.57377	-	cctggatgtctccagtccc
4	chr1	113890670	113890689	JASPA_MA0139.1_CTCF_chr1_113890671	8.75410	-	TGGTCAGAGGGTGGCAAGA
5	chr1	116134359	116134378	JASPA_MA0139.1_CTCF_chr1_116134360	11.63930	-	CCACCATTAGGTGGCATCC
6	chr1	153684245	153684264	JASPA_MA0139.1_CTCF_chr1_153684246	8.45902	-	CCTCCACTGGACGGCAGGA
7	chr1	223429627	223429646	JASPA_MA0139.1_CTCF_chr1_223429628	10.93440	-	TGGGCTGCAGAGGGCAAGC

Look at sequence logo¶

[77]:

from Bio import motifs
from Bio.Seq import Seq

[82]:

m = motifs.create([Seq(x.upper()) for x in ctcf.seq.tolist()])

[84]:

m.weblogo("CTCF.png")

[86]:

from IPython.display import Image
Image("CTCF.png")

[86]:

Manhattan plot¶

[108]:

from qmplot import manhattanplot
# https://github.com/ShujiaHuang/qmplot
manhattanplot(data=ctcf.as_df(),
              chrom="Chromosome",
              pos="Start",
              pv="Score",
              snp="Name",
              logp=False,
              ylabel="Score",
              xlabel="Chromosome",
               xticklabel_kws={"rotation": "vertical"},
               is_show=True,  # display the plot in screen
               figname="output_manhattan_plot.png")

../../_images/content_Bioinformatics_Core_Competencies_NGS101_39_0.png

[108]:

<AxesSubplot:xlabel='Chromosome', ylabel='Score'>

FASTQ format¶

[110]:

!zcat test.10K.R1.fastq.gz | head

@M04990:66:000000000-G83PY:1:2101:2222:14575 1:N:0:TCTTCCTTTCCTTTTTTTGTTTTTCCTTCTTTCTCTTTCTTTTTTTTTGTTTTTTTTATTTGGTTTTTTTTTTTTTTTTCCTCTTGCATCTGTTGTTCT+TCTTTCCC
GCTCGACGCCATTAATAATGTTTTCCGTAAATTCAGCGCCTTCCATGATG
+
CCCDDDDCCCCCGGGGGGGGGGHHHHGHHGHHHHHHHGGGGGHHHHHHHH
@M04990:66:000000000-G83PY:1:1102:10573:9495 1:N:0:TCGTCCTCTCCTTGTGCCTGTTTCAATGCGTCCTCGCTCGCACTCACTTTTATCGAAGTGACTCGTTCGCCGTTGTGACACCTCGTCCACGCCAATTAT+TATCCTCT
CAATAGTCGTACGCCGATGCGAAACATCGGCCACGTGTGTCGATCTCGTA
+
CCCCBFFFCBFCGGGGFGGGGGGGGHHHGGGGGGGHHHHHHGHGFHGHGG
@M04990:66:000000000-G83PY:1:1102:15470:5522 1:N:0:CCCCGCCACCGTGGGGCGGGGTGGAAGCATGCGCCGCCCCCCGCCGGGCGGGGCGATGTTTTGTGGTAGCCGCGGCGCAAACACCCCCCCCACTAACCC+AGAGTAGA
CATTGGCACAGCTTGTCTCCAGGACCTTTTATTTTAGAACAAAAAAAAAA

gzip: stdout: Broken pipe

Syntax:¶

Each sequence/read is represented in 4 lines
first line is read name, followed by some descriptions
second line is the sequence
third line is just +
fourth line is the quality scores, same length as the sequence

Base qualities, increasing order:¶

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

FASTQ examples¶

1. take a look at the fastq files¶

Usually I just use less.

2. check sequencing quality¶

Use fastqc

[112]:

!mkdir test_fastqc
!module load fastqc; fastqc test.10K.R1.fastq.gz -o test_fastqc

Started analysis of test.10K.R1.fastq.gz
Approx 10% complete for test.10K.R1.fastq.gz
Approx 20% complete for test.10K.R1.fastq.gz
Approx 30% complete for test.10K.R1.fastq.gz
Approx 40% complete for test.10K.R1.fastq.gz
Approx 50% complete for test.10K.R1.fastq.gz
Approx 60% complete for test.10K.R1.fastq.gz
Approx 70% complete for test.10K.R1.fastq.gz
Approx 80% complete for test.10K.R1.fastq.gz
Approx 90% complete for test.10K.R1.fastq.gz
Approx 100% complete for test.10K.R1.fastq.gz
Analysis complete for test.10K.R1.fastq.gz

3. Random sampling fastq¶

[2]:

# sample N reads
!module load seqtk; seqtk sample test.10K.R1.fastq.gz 100 > test.sample.fastq;wc -l test.sample.fastq

400 test.sample.fastq

Note that seqtk only output uncompressed fastq, if you want to a fastq.gz file, you have to gzip it¶

[3]:

!module load seqtk; seqtk sample test.10K.R1.fastq.gz 100 > test.sample.fastq;gzip test.sample.fastq;zcat test.sample.fastq.gz|head

@M04990:66:000000000-G83PY:1:2104:15259:7127 1:N:0:TCGGACGATCATGGGTAGACGGACAAGTATGCAGCGCGCTCAAGCACGTGGATTACTTCGGAGTCGTACGCCGATGCGAAACATCGGCCACCAGATCTG+TAGATCGC
ATCCACAACTATCTGAAAAGGTTTATATTAAAATACTGCTCCCAGCCAGG
+
CCCCCFFCCFFFGGGGGGGGGGHHHHHHHHHHHHHHHHHHHHGHHHGHHG
@M04990:66:000000000-G83PY:1:2104:7163:24261 1:N:0:TCGGACGATCATGGGTAGACGGACAAGTATGCAGCGCGCTCAAGCACGTGGATTCTACGACAGTCGTACGCCGATGCGAAACATCGGCCACTTCCATTG+AGAGTAGA
GCCCATGCCCCAGCTCACCTCCATCATCGTCAACCAGCTCAAGAAGATCA
+
AACCAFFFFFCCGGGGGGGGGGHHHHHHHHHHGHGGHGHHHHHHHHHHHH
@M04990:66:000000000-G83PY:1:1103:27097:20242 1:N:0:TCGGACGATCATGGGTCTCACGGCAAGTATGCACGCGCTCAAGCACGTGGATTCTCACGGAGTCGTACGCCGATGCGAAACATCGGCCACGCATGGCTA+TAGATCGC
CTCCCAGGTTCACCTCATTTTCTTTGGCTCTTCAGTCCGCTGGTTTGAGT

[4]:

# Sample 10% reads
!module load seqtk; seqtk sample test.10K.R1.fastq.gz 0.1 > test.sample.fastq;wc -l test.sample.fastq

4000 test.sample.fastq

Also see: https://hemtools.readthedocs.io/en/latest/content/Bioinformatics_tools/fastqc.html for batch run

4. read alignment using BWA¶

[119]:

!module load bwa;bwa index hg38.fa

[bwa_index] Pack FASTA... 20.66 sec
[bwa_index] Construct BWT for the packed sequence...
[BWTIncCreate] textLength=6418572210, availableWord=463634060
[BWTIncConstructFromPacked] 10 iterations done. 99999986 characters processed.
[BWTIncConstructFromPacked] 20 iterations done. 199999986 characters processed.
[BWTIncConstructFromPacked] 30 iterations done. 299999986 characters processed.
[BWTIncConstructFromPacked] 40 iterations done. 399999986 characters processed.
[BWTIncConstructFromPacked] 50 iterations done. 499999986 characters processed.
[BWTIncConstructFromPacked] 60 iterations done. 599999986 characters processed.
[BWTIncConstructFromPacked] 70 iterations done. 699999986 characters processed.
[BWTIncConstructFromPacked] 80 iterations done. 799999986 characters processed.
[BWTIncConstructFromPacked] 90 iterations done. 899999986 characters processed.
[BWTIncConstructFromPacked] 100 iterations done. 999999986 characters processed.
[BWTIncConstructFromPacked] 110 iterations done. 1099999986 characters processed.
[BWTIncConstructFromPacked] 120 iterations done. 1199999986 characters processed.
[BWTIncConstructFromPacked] 130 iterations done. 1299999986 characters processed.
[BWTIncConstructFromPacked] 140 iterations done. 1399999986 characters processed.
[BWTIncConstructFromPacked] 150 iterations done. 1499999986 characters processed.
[BWTIncConstructFromPacked] 160 iterations done. 1599999986 characters processed.
[BWTIncConstructFromPacked] 170 iterations done. 1699999986 characters processed.
[BWTIncConstructFromPacked] 180 iterations done. 1799999986 characters processed.
[BWTIncConstructFromPacked] 190 iterations done. 1899999986 characters processed.
[BWTIncConstructFromPacked] 200 iterations done. 1999999986 characters processed.
[BWTIncConstructFromPacked] 210 iterations done. 2099999986 characters processed.
[BWTIncConstructFromPacked] 220 iterations done. 2199999986 characters processed.
[BWTIncConstructFromPacked] 230 iterations done. 2299999986 characters processed.
[BWTIncConstructFromPacked] 240 iterations done. 2399999986 characters processed.
[BWTIncConstructFromPacked] 250 iterations done. 2499999986 characters processed.
[BWTIncConstructFromPacked] 260 iterations done. 2599999986 characters processed.
[BWTIncConstructFromPacked] 270 iterations done. 2699999986 characters processed.
[BWTIncConstructFromPacked] 280 iterations done. 2799999986 characters processed.
[BWTIncConstructFromPacked] 290 iterations done. 2899999986 characters processed.
[BWTIncConstructFromPacked] 300 iterations done. 2999999986 characters processed.
[BWTIncConstructFromPacked] 310 iterations done. 3099999986 characters processed.
[BWTIncConstructFromPacked] 320 iterations done. 3199999986 characters processed.
[BWTIncConstructFromPacked] 330 iterations done. 3299999986 characters processed.
^C

[116]:

!module load bwa;\
bwa_index=/home/yli11/Data/Human/hg38/bwa_16a_index/hg38.fa;\
bwa mem $bwa_index test.10K.R1.fastq.gz test.10K.R2.fastq.gz > test.sam

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 20000 sequences (1000000 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 2289, 1, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (62, 168, 228)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 560)
[M::mem_pestat] mean and std.dev: (162.87, 115.26)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 726)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 20000 reads in 2.793 CPU sec, 2.793 real sec
[main] Version: 0.7.12-r1039
[main] CMD: bwa mem /home/yli11/Data/Human/hg38/bwa_16a_index/hg38.fa test.10K.R1.fastq.gz test.10K.R2.fastq.gz
[main] Real time: 22.389 sec; CPU: 4.152 sec

[187]:

!module avail bwa


------------------------ /hpcf/apps/modules/modulefiles ------------------------
   bwa/0.5.10    bwa/0.7.5a        bwa/0.7.15
   bwa/0.6.1     bwa/0.7.12 (D)    bwa/0.7.16a

---------------- /hpcf/authorized_apps/rhel7_apps/module_files -----------------
   bwa/0.7.17

  Where:
   D:  Default Module

Module defaults are chosen based on Find First Rules due to Name/Version/Version
 modules found in the module tree.
See https://lmod.readthedocs.io/en/latest/060_locating.html for details.

Use "module spider" to find all possible modules and extensions.
Use "module keyword key1 key2 ..." to search for all possible modules matching
any of the "keys".

IMPORTANT

Notice that the bwa version is 0.7.12-r1039, but my index was created by 0.7.16a, so the aligned results may have some error. Better to use the same bwa version.

SAM Format¶

[122]:

!head -n 30 test.sam

@SQ     SN:chr10        LN:133797422
@SQ     SN:chr11        LN:135086622
@SQ     SN:chr12        LN:133275309
@SQ     SN:chr13        LN:114364328
@SQ     SN:chr14        LN:107043718
@SQ     SN:chr15        LN:101991189
@SQ     SN:chr16        LN:90338345
@SQ     SN:chr17        LN:83257441
@SQ     SN:chr18        LN:80373285
@SQ     SN:chr19        LN:58617616
@SQ     SN:chr1 LN:248956422
@SQ     SN:chr20        LN:64444167
@SQ     SN:chr21        LN:46709983
@SQ     SN:chr22        LN:50818468
@SQ     SN:chr2 LN:242193529
@SQ     SN:chr3 LN:198295559
@SQ     SN:chr4 LN:190214555
@SQ     SN:chr5 LN:181538259
@SQ     SN:chr6 LN:170805979
@SQ     SN:chr7 LN:159345973
@SQ     SN:chr8 LN:145138636
@SQ     SN:chr9 LN:138394717
@SQ     SN:chrM LN:16569
@SQ     SN:chrX LN:156040895
@SQ     SN:chrY LN:57227415
@PG     ID:bwa  PN:bwa  VN:0.7.12-r1039 CL:bwa mem /home/yli11/Data/Human/hg38/bwa_16a_index/hg38.fa test.10K.R1.fastq.gz test.10K.R2.fastq.gz
M04990:66:000000000-G83PY:1:2101:2222:14575     77      *       0       0       *       *       0       0       GCTCGACGCCATTAATAATGTTTTCCGTAAATTCAGCGCCTTCCATGATG      CCCDDDDCCCCCGGGGGGGGGGHHHHGHHGHHHHHHHGGGGGHHHHHHHH      AS:i:0  XS:i:0
M04990:66:000000000-G83PY:1:2101:2222:14575     141     *       0       0       *       *       0       0       ATCATGAGTCAAGTTACTGAACAATCCGTACGTTTCCAGACCGCTTTGGC      AABCCFFFFFFFGGGGGGGGGGHHHHHGGHGGHHHHHGHHGHGGGGGHHH      AS:i:0  XS:i:0
M04990:66:000000000-G83PY:1:1102:10573:9495     77      *       0       0       *       *       0       0       CAATAGTCGTACGCCGATGCGAAACATCGGCCACGTGTGTCGATCTCGTA      CCCCBFFFCBFCGGGGFGGGGGGGGHHHGGGGGGGHHHHHHGHGFHGHGG      AS:i:0  XS:i:0
M04990:66:000000000-G83PY:1:1102:10573:9495     141     *       0       0       *       *       0       0       AACTACACAATCAACGACGCCCTCGTCTTCCTTCTCTTCCTTCCTGTCTC      1>111@@BD1F1FB11AE0AE00BA0AF021A2BG1GGFGFD1B111B2D   AS:i:0  XS:i:0

For detailed SAM format specification, see the official reference: https://samtools.github.io/hts-specs/SAMv1.pdf ¶

The header section¶

[126]:

!module load samtools/1.7;\
samtools view -H test.sam

@SQ     SN:chr10        LN:133797422
@SQ     SN:chr11        LN:135086622
@SQ     SN:chr12        LN:133275309
@SQ     SN:chr13        LN:114364328
@SQ     SN:chr14        LN:107043718
@SQ     SN:chr15        LN:101991189
@SQ     SN:chr16        LN:90338345
@SQ     SN:chr17        LN:83257441
@SQ     SN:chr18        LN:80373285
@SQ     SN:chr19        LN:58617616
@SQ     SN:chr1 LN:248956422
@SQ     SN:chr20        LN:64444167
@SQ     SN:chr21        LN:46709983
@SQ     SN:chr22        LN:50818468
@SQ     SN:chr2 LN:242193529
@SQ     SN:chr3 LN:198295559
@SQ     SN:chr4 LN:190214555
@SQ     SN:chr5 LN:181538259
@SQ     SN:chr6 LN:170805979
@SQ     SN:chr7 LN:159345973
@SQ     SN:chr8 LN:145138636
@SQ     SN:chr9 LN:138394717
@SQ     SN:chrM LN:16569
@SQ     SN:chrX LN:156040895
@SQ     SN:chrY LN:57227415
@PG     ID:bwa  PN:bwa  VN:0.7.12-r1039 CL:bwa mem /home/yli11/Data/Human/hg38/bwa_16a_index/hg38.fa test.10K.R1.fastq.gz test.10K.R2.fastq.gz

@SQ Reference sequence dictionary. The order of @SQ lines defines the alignment sorting order.
SN: Reference sequence name
LN: Reference sequence length

The alignment section¶

[127]:

!module load samtools/1.7;\
samtools view test.sam | head

M04990:66:000000000-G83PY:1:2101:2222:14575     77      *       0       0       *       *       0       0       GCTCGACGCCATTAATAATGTTTTCCGTAAATTCAGCGCCTTCCATGATG      CCCDDDDCCCCCGGGGGGGGGGHHHHGHHGHHHHHHHGGGGGHHHHHHHH      AS:i:0  XS:i:0
M04990:66:000000000-G83PY:1:2101:2222:14575     141     *       0       0       *       *       0       0       ATCATGAGTCAAGTTACTGAACAATCCGTACGTTTCCAGACCGCTTTGGC      AABCCFFFFFFFGGGGGGGGGGHHHHHGGHGGHHHHHGHHGHGGGGGHHH      AS:i:0  XS:i:0
M04990:66:000000000-G83PY:1:1102:10573:9495     77      *       0       0       *       *       0       0       CAATAGTCGTACGCCGATGCGAAACATCGGCCACGTGTGTCGATCTCGTA      CCCCBFFFCBFCGGGGFGGGGGGGGHHHGGGGGGGHHHHHHGHGFHGHGG      AS:i:0  XS:i:0
M04990:66:000000000-G83PY:1:1102:10573:9495     141     *       0       0       *       *       0       0       AACTACACAATCAACGACGCCCTCGTCTTCCTTCTCTTCCTTCCTGTCTC      1>111@@BD1F1FB11AE0AE00BA0AF021A2BG1GGFGFD1B111B2D   AS:i:0  XS:i:0
M04990:66:000000000-G83PY:1:1102:15470:5522     83      chr7    113829330       0       21M29S  =       113829329       -22     TTTTTTTTTTGTTCTAAAATAAAAGGTCCTGGAGACAAGCTGTGCCAATG      GEAECGA0GFHFAFHGHHGGGHHHGGC11BGGFGGGGGAFFFFC1A@AAA      NM:i:0  MD:Z:21 AS:i:21 XS:i:21 XA:Z:chr12,+60139811,26S21M3S,0;chr2,-139726748,3S20M27S,0;chr7,+88270110,23S19M8S,0;
M04990:66:000000000-G83PY:1:1102:15470:5522     163     chr7    113829329       0       16S22M12S       =       113829330       22      ATGTAAGGGGTTTTTTTTTTTTTTTTTGTTCTAAAATAAAAGGTCCTGGA      AABAAFFFFBBBGFGGGGGGGGGGGGG0BF3@BFFFEGHGB3EBFE3GF2      NM:i:0  MD:Z:22 AS:i:22 XS:i:25
M04990:66:000000000-G83PY:1:1103:16433:22048    65      chrX    24701555        0       50M     chr9    92816867        0       CCTCCCAAGTAGCTGGGATTACAGGTGTCTGCCACCACGCCTGGCTGATT      CCCCCCFFCFFFGGGGGGGGGGHHHHHHHHHHHHHHHHGGGHHHHHGHHH      NM:i:1  MD:Z:46A3       AS:i:46 XS:i:46
M04990:66:000000000-G83PY:1:1103:16433:22048    129     chr9    92816867        0       50M     chrX    24701555        0       CTGTAATCCCAGTACTTTGGGGGGCTGAGATGGGTGGATCATGAGGTCAG      BBBBBFFFFFFFGCGGGGGGGGGGGGGHHGHHHHGEGHHHHHHGHFHGHH      NM:i:2  MD:Z:12C8A28    AS:i:40 XS:i:40
M04990:66:000000000-G83PY:1:2104:26463:8539     99      chr8    113276127       60      50M     =       113276316       239     GCATGAAATTGTAGGCCAAATAATGCCCCTCTACCTCCCCAAAAATGTAT      BBBBBFFFFFFFGGC4GGGGGGHFFHHHGHGHHHHHHHGGHGHHHHHHHH      NM:i:0  MD:Z:50 AS:i:50 XS:i:0
M04990:66:000000000-G83PY:1:2104:26463:8539     147     chr8    113276316       60      50M     =       113276127       -239    AGAAATTTCTTCACCTGGAGTATAAGATGTGTGGAGGGAGAAAGTAGCAG      HHHHHHGHHHHHHHHHHHHHHHHHHHHGGGGGGGGGGGFFFFFFFBBBBB      NM:i:0  MD:Z:50 AS:i:50 XS:i:19
samtools view: writing to standard output failed: Broken pipe
samtools view: error closing standard output: -1

Col	Field	Type	N/A Value	Description
1	QNAME	string	mandatory	The query/read name.
2	FLAG	int	mandatory	The record’s flag.
3	RNAME	string	*	The reference name.
4	POS	32-bit int	0	1-based position on the reference.
5	MAPQ	8-bit int	255	The mapping quality.
6	CIGAR	string	*	The CIGAR string of the alignment.
7	RNEXT	string	*	The reference of the next mate/segment.
8	PNEXT	string	0	The position of the next mate/seqgment.
9	TLEN	string	0	The observed length of the template.
10	SEQ	string	*	The query/read sequence.
11	QUAL	string	*	The ASCII PHRED-encoded base qualities.

Notes:

The SAM standard talks about “queries”. In the context of read mapping, where the format originates, queries are reads.
The SAM standard talks about “templates” and “segments”. In the case of paired-end and mate-pair mapping the template consists of two segments, each is one read. The template length is the insert size.
Paired-end reads are stored as two alignments records with the same QNAME. The first and second mate are discriminated by the FLAG values.
When the FLAG indicates that SEQ is reverse-complemented, then QUAL is reversed.
Positions in the SAM file are 1-based.
The qualities must be stored as ASCII PRED-encoded qualities.
The query and reference names must not contain whitespace. It is common to trim query and reference ids at the first space.
Getting the exact mapping coordinates directly based on the 4th column is not trivial, better to use bedtools for correctness.
MAPQ 0 could mean multi-mapped or repeat regions.

SAM flags¶

I usually use this online tool to make sense of SAM flags: https://broadinstitute.github.io/picard/explain-flags.html

BAM format¶

Binary format for SAM files. An index file (.bai) is usually generated for each position sorted BAM file

Common Examples¶

1. Convert SAM to BAM¶

2. Sort BAM file¶

3. Index BAM file¶

4. Get summary statistics¶

[131]:

!module load samtools/1.7;samtools view -b test.sam > test.bam

[132]:

!module load samtools/1.7;samtools sort -o test.st.bam test.bam

[133]:

!module load samtools/1.7;samtools index test.st.bam

[134]:

!module load samtools/1.7;samtools flagstat test.st.bam

20000 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
8455 + 0 mapped (42.27% : N/A)
20000 + 0 paired in sequencing
10000 + 0 read1
10000 + 0 read2
7854 + 0 properly paired (39.27% : N/A)
7994 + 0 with itself and mate mapped
461 + 0 singletons (2.31% : N/A)
62 + 0 with mate mapped to a different chr
23 + 0 with mate mapped to a different chr (mapQ>=5)

[135]:

!module load samtools/1.7;samtools idxstats test.st.bam

chr10   133797422       277     22
chr11   135086622       326     22
chr12   133275309       307     22
chr13   114364328       143     14
chr14   107043718       215     16
chr15   101991189       169     8
chr16   90338345        238     11
chr17   83257441        261     14
chr18   80373285        116     4
chr19   58617616        312     13
chr1    248956422       681     36
chr20   64444167        160     3
chr21   46709983        413     39
chr22   50818468        115     7
chr2    242193529       629     40
chr3    198295559       410     27
chr4    190214555       311     15
chr5    181538259       397     22
chr6    170805979       541     23
chr7    159345973       356     18
chr8    145138636       520     24
chr9    138394717       295     21
chrM    16569   1029    28
chrX    156040895       196     10
chrY    57227415        38      2
*       0       0       11084

[7]:

!module load samtools/1.7;samtools stats test.st.bam | head -n 50

# This file was produced by samtools stats (1.7+htslib-1.7) and can be plotted using plot-bamstats
# This file contains statistics for all reads.
# The command line was:  stats test.st.bam
# CHK, Checksum [2]Read Names   [3]Sequences    [4]Qualities
# CHK, CRC32 of reads which passed filtering followed by addition (32bit overflow)
CHK     fae7cc8c        e08cae0b        f53795c5
# Summary Numbers. Use `grep ^SN | cut -f 2-` to extract this part.
SN      raw total sequences:    20000
SN      filtered sequences:     0
SN      sequences:      20000
SN      is sorted:      1
SN      1st fragments:  10000
SN      last fragments: 10000
SN      reads mapped:   8455
SN      reads mapped and paired:        7994    # paired-end technology bit set + both mates mapped
SN      reads unmapped: 11545
SN      reads properly paired:  7854    # proper-pair bit set
SN      reads paired:   20000   # paired-end technology bit set
SN      reads duplicated:       0       # PCR or optical duplicate bit set
SN      reads MQ0:      1325    # mapped and MQ=0
SN      reads QC failed:        0
SN      non-primary alignments: 0
SN      total length:   1000000 # ignores clipping
SN      bases mapped:   422750  # ignores clipping
SN      bases mapped (cigar):   385042  # more accurate
SN      bases trimmed:  0
SN      bases duplicated:       0
SN      mismatches:     1936    # from NM fields
SN      error rate:     5.028023e-03    # mismatches / bases mapped (cigar)
SN      average length: 50
SN      maximum length: 50
SN      average quality:        33.5
SN      insert size average:    183.9
SN      insert size standard deviation: 140.8
SN      inward oriented pairs:  3161
SN      outward oriented pairs: 168
SN      pairs with other orientation:   2
SN      pairs on different chromosomes: 31
# First Fragment Qualitites. Use `grep ^FFQ | cut -f 2-` to extract this part.
# Columns correspond to qualities and rows to cycles. First column is the cycle number.
FFQ     1       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       49      0       36      0       0       0       0       0       0       0       0       0       0       164     27      6       1898    3254    3603    750     213     0       0       0       0       0
FFQ     2       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       48      0       27      0       0       0       0       0       0       0       0       0       0       118     40      33      1529    2971    3837    1061    336     0       0       0       0       0
FFQ     3       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       41      0       12      0       0       0       0       0       0       0       0       0       0       83      41      28      1311    2898    3895    1263    428     0       0       0       0       0
FFQ     4       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       45      0       26      0       0       0       0       0       0       0       0       0       0       66      29      35      1229    2838    3858    1394    480     0       0       0       0       0
FFQ     5       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       79      0       22      0       0       0       0       0       0       0       0       0       0       72      55      39      1212    2759    3904    1392    466     0       0       0       0       0
FFQ     6       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       52      0       15      6       8       0       0       0       0       0       0       0       0       26      20      43      210     784     1074    571     129     7062    0       0       0       0
FFQ     7       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       59      0       18      10      7       0       0       0       0       0       0       0       0       23      19      49      239     768     1023    597     132     7056    0       0       0       0
FFQ     8       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       56      5       18      9       12      0       0       0       0       0       0       0       0       24      22      52      238     829     1137    632     154     6812    0       0       0       0
FFQ     9       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       68      6       18      14      9       0       0       0       0       0       0       0       0       30      28      47      246     821     1160    711     182     6660    0       0       0       0
FFQ     10      0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       2       60      7       16      19      10      0       0       0       0       0       0       0       0       38      21      63      205     833     1177    788     203     6558    0       0       0       0

For mapping multiple samples, use the bwa_mem pipeline¶

https://hemtools.readthedocs.io/en/latest/content/NGS_pipelines/bwa_mem.html

module load python/2.7.13

run_lsf.py --guess_input

run_lsf.py -f fastq.tsv -p bwa_mem -g mm9

5. visualize bam file in IGV¶

reads are sub-sampled
highligh a specific read
view as pairs
sort reads by mapping start

6. remove PCR duplicates (rmdup or dedup)¶

What is PCR duplicates? Simply put, identical reads. More precisely, the positions on 5’-end are identical. (that means 3’-end can be different)

For single-end sequencing, rmdup is usually not performed.

For paired-end sequencing, rmdup may be performed.

We never perform rmdup for RNA-seq or amplicon sequencing data.

[148]:

!module load samtools/1.7; \
samtools sort -n -o test.name.st.bam test.st.bam; \
samtools fixmate -m test.name.st.bam test.fixmate.bam; \
samtools sort -o test.fixmate.st.bam test.fixmate.bam; \
samtools markdup test.fixmate.st.bam test.markdup.st.bam; \
samtools markdup -r -s test.markdup.st.bam test.rmdup.st.bam

READ 20000 WRITTEN 19985
EXCLUDED 11545 EXAMINED 8455
PAIRED 7994 SINGLE 461
DULPICATE PAIR 10 DUPLICATE SINGLE 5
DUPLICATE TOTAL 15

[152]:

!module load samtools/1.7; samtools view -H test.fixmate.bam

@HD     VN:1.5  SO:queryname
@SQ     SN:chr10        LN:133797422
@SQ     SN:chr11        LN:135086622
@SQ     SN:chr12        LN:133275309
@SQ     SN:chr13        LN:114364328
@SQ     SN:chr14        LN:107043718
@SQ     SN:chr15        LN:101991189
@SQ     SN:chr16        LN:90338345
@SQ     SN:chr17        LN:83257441
@SQ     SN:chr18        LN:80373285
@SQ     SN:chr19        LN:58617616
@SQ     SN:chr1 LN:248956422
@SQ     SN:chr20        LN:64444167
@SQ     SN:chr21        LN:46709983
@SQ     SN:chr22        LN:50818468
@SQ     SN:chr2 LN:242193529
@SQ     SN:chr3 LN:198295559
@SQ     SN:chr4 LN:190214555
@SQ     SN:chr5 LN:181538259
@SQ     SN:chr6 LN:170805979
@SQ     SN:chr7 LN:159345973
@SQ     SN:chr8 LN:145138636
@SQ     SN:chr9 LN:138394717
@SQ     SN:chrM LN:16569
@SQ     SN:chrX LN:156040895
@SQ     SN:chrY LN:57227415
@PG     ID:bwa  PN:bwa  VN:0.7.12-r1039 CL:bwa mem /home/yli11/Data/Human/hg38/bwa_16a_index/hg38.fa test.10K.R1.fastq.gz test.10K.R2.fastq.gz

[150]:

!module load samtools/1.7; samtools view -H test.rmdup.st.bam

@HD     VN:1.5  SO:coordinate
@SQ     SN:chr10        LN:133797422
@SQ     SN:chr11        LN:135086622
@SQ     SN:chr12        LN:133275309
@SQ     SN:chr13        LN:114364328
@SQ     SN:chr14        LN:107043718
@SQ     SN:chr15        LN:101991189
@SQ     SN:chr16        LN:90338345
@SQ     SN:chr17        LN:83257441
@SQ     SN:chr18        LN:80373285
@SQ     SN:chr19        LN:58617616
@SQ     SN:chr1 LN:248956422
@SQ     SN:chr20        LN:64444167
@SQ     SN:chr21        LN:46709983
@SQ     SN:chr22        LN:50818468
@SQ     SN:chr2 LN:242193529
@SQ     SN:chr3 LN:198295559
@SQ     SN:chr4 LN:190214555
@SQ     SN:chr5 LN:181538259
@SQ     SN:chr6 LN:170805979
@SQ     SN:chr7 LN:159345973
@SQ     SN:chr8 LN:145138636
@SQ     SN:chr9 LN:138394717
@SQ     SN:chrM LN:16569
@SQ     SN:chrX LN:156040895
@SQ     SN:chrY LN:57227415
@PG     ID:bwa  PN:bwa  VN:0.7.12-r1039 CL:bwa mem /home/yli11/Data/Human/hg38/bwa_16a_index/hg38.fa test.10K.R1.fastq.gz test.10K.R2.fastq.gz

7. remove multi-mapped reads (uq)¶

[153]:

!module load samtools/1.7; samtools view -q 1 -b test.rmdup.st.bam > test.rmdup.uq.st.bam

8. filter reads by SAM flags¶

-f INT only include reads with all of the FLAGs in INT present [0]
-F INT only include reads with none of the FLAGS in INT present [0]

-f 2 -F 4 -F 8 -F 256 -F 512 -F 2048¶

[155]:

!module load samtools/1.7; samtools view -f 2 -F 4 -F 8 -F 256 -F 512 -F 2048 -b test.rmdup.uq.st.bam > test.valid_pairs.rmdup.uq.st.bam

9.1 subset SAM/BAM files using samtools view based on coordinates¶

[159]:

# subset by a single region
!module load samtools/1.7; samtools index test.valid_pairs.rmdup.uq.st.bam;\
samtools view test.valid_pairs.rmdup.uq.st.bam "chr10:610000-660000"

M04990:66:000000000-G83PY:1:2101:19247:11601    163     chr10   613723  60      50M     =       613849  176     CCTCTCCACACCCTGGTAAACATGGCATTTCGCACACCTGCTTTGGTCCC      >AABAFFFFFDDGGGGGGGGGGHHHHHHHHHHGGGGHGHHHHHHGHHHHH   NM:i:0  MD:Z:50 AS:i:50 XS:i:0  MQ:i:60 MC:Z:50M        ms:i:1872
M04990:66:000000000-G83PY:1:2101:19247:11601    83      chr10   613849  60      50M     =       613723  -176    TTCCCCGGCCACCCTCCTGATCTCGGCCATCTGATGGCATCACTCCCCAC      GGGGGGHHGGGGGGHHHHHGGGGGHHHHGGGGGGGGGGFFFBCCDDCCBC      NM:i:0  MD:Z:50 AS:i:50 XS:i:0  MQ:i:60 MC:Z:50M        ms:i:1879

[160]:

# subset by a bed file
!module load samtools/1.7; \
samtools view test.valid_pairs.rmdup.uq.st.bam -L ctcfpos.bed

M04990:66:000000000-G83PY:1:1101:17665:21940    83      chr6    30617302        60      50M     =       30617169        -183    CGTGCTAGTGAAACGCCCTTGTCGCGAGACATTATACGCGAGGCGTGAAC      GHHHHHHHHFGGGGGGHHGGGGGFGGHHGGGGGGGGGGBCFCFBFBCCCC      NM:i:0  MD:Z:50 AS:i:50 XS:i:0  MQ:i:60 MC:Z:50M        ms:i:1855
M04990:66:000000000-G83PY:1:2104:3420:8655      163     chr6    36842224        60      50M     =       36842603        429     CACATGCTGTGTGCTGGGTGCAGTCCTAGGCTCAGAGAATAAAAGGCAAA      >>A3>BFFFFFFGGGGC2GEA4DFGHHHCBAGHHFFBFFHHFFF5AGAF2     NM:i:0  MD:Z:50 AS:i:50 XS:i:19 MQ:i:40 MC:Z:50M        ms:i:1875

9.2 subset SAM/BAM files using bedtools intersect based on coordinates¶

[162]:

!module load bedtools; \
bedtools intersect -a test.valid_pairs.rmdup.uq.st.bam -b ctcfpos.bed | samtools view

M04990:66:000000000-G83PY:1:1101:17665:21940    83      chr6    30617302        60      50M     =       30617169        -183    CGTGCTAGTGAAACGCCCTTGTCGCGAGACATTATACGCGAGGCGTGAAC      GHHHHHHHHFGGGGGGHHGGGGGFGGHHGGGGGGGGGGBCFCFBFBCCCC      NM:i:0  MD:Z:50 AS:i:50 XS:i:0  MQ:i:60 MC:Z:50M        ms:i:1855
M04990:66:000000000-G83PY:1:2104:3420:8655      163     chr6    36842224        60      50M     =       36842603        429     CACATGCTGTGTGCTGGGTGCAGTCCTAGGCTCAGAGAATAAAAGGCAAA      >>A3>BFFFFFFGGGGC2GEA4DFGHHHCBAGHHFFBFFHHFFF5AGAF2     NM:i:0  MD:Z:50 AS:i:50 XS:i:19 MQ:i:40 MC:Z:50M        ms:i:1875

[169]:

!module load bedtools; \
bedtools intersect -a test.valid_pairs.rmdup.uq.st.bam -b ctcfpos.bed | bedtools bamtofastq -i - -fq test.fq | head test.fq

@M04990:66:000000000-G83PY:1:1101:17665:21940
GTTCACGCCTCGCGTATAATGTCTCGCGACAAGGGCGTTTCACTAGCACG
+
CCCCBFBFCFCBGGGGGGGGGGHHGGFGGGGGHHGGGGGGFHHHHHHHHG
@M04990:66:000000000-G83PY:1:2104:3420:8655
CACATGCTGTGTGCTGGGTGCAGTCCTAGGCTCAGAGAATAAAAGGCAAA
+
>>A3>BFFFFFFGGGGC2GEA4DFGHHHCBAGHHFFBFFHHFFF5AGAF2

9.3 subset SAM/BAM files using picard based on read list¶

[176]:

!echo M04990:66:000000000-G83PY:1:1101:17665:21940 > test.list
!echo M04990:66:000000000-G83PY:1:2104:3420:8655 >> test.list
!head test.list

M04990:66:000000000-G83PY:1:1101:17665:21940
M04990:66:000000000-G83PY:1:2104:3420:8655

[177]:

!module load java/10.0.2;\
java -jar /hpcf/apps/picard/install/2.9.4/picard.jar\
FilterSamReads I=test.valid_pairs.rmdup.uq.st.bam O=test.out.bam READ_LIST_FILE=test.list FILTER=includeReadList;\
module load samtools;samtools view test.out.bam

Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/research/rgs01/scratch_lsf/java -XX:ParallelGCThreads=1
[Tue Aug 17 00:31:03 CDT 2021] picard.sam.FilterSamReads INPUT=test.valid_pairs.rmdup.uq.st.bam FILTER=includeReadList [OUTPUT SAM/BAM will contain reads that are supplied in the READ_LIST_FILE file] READ_LIST_FILE=test.list OUTPUT=test.out.bam    WRITE_READS_FILES=true VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Tue Aug 17 00:31:03 CDT 2021] Executing as yli11@noderome182 on Linux 3.10.0-1160.15.2.el7.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 10.0.2+13; Picard version: 2.9.4-SNAPSHOT
INFO    2021-08-17 00:31:03     FilterSamReads  Filtering [presorted=true] test.valid_pairs.rmdup.uq.st.bam -> OUTPUT=test.out.bam [sortorder=coordinate]
INFO    2021-08-17 00:31:03     FilterSamReads  4 SAMRecords written to test.out.bam
[Tue Aug 17 00:31:03 CDT 2021] picard.sam.FilterSamReads done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=2147483648
M04990:66:000000000-G83PY:1:1101:17665:21940    163     chr6    30617169        60      50M     =       30617302        183     ATCCAACGCTCTCCGCAAAATTTTACCCACTAGACGACAAAGTAGGCAAA      3>AABFFCCCCDGGGGGGGGGGHHHHHHHGHHHHHGGGGGHHHHHHHHHH   NM:i:0  MD:Z:50 AS:i:50 XS:i:0  MQ:i:60 MC:Z:50M        ms:i:1868
M04990:66:000000000-G83PY:1:1101:17665:21940    83      chr6    30617302        60      50M     =       30617169        -183    CGTGCTAGTGAAACGCCCTTGTCGCGAGACATTATACGCGAGGCGTGAAC      GHHHHHHHHFGGGGGGHHGGGGGFGGHHGGGGGGGGGGBCFCFBFBCCCC      NM:i:0  MD:Z:50 AS:i:50 XS:i:0  MQ:i:60 MC:Z:50M        ms:i:1855
M04990:66:000000000-G83PY:1:2104:3420:8655      163     chr6    36842224        60      50M     =       36842603        429     CACATGCTGTGTGCTGGGTGCAGTCCTAGGCTCAGAGAATAAAAGGCAAA      >>A3>BFFFFFFGGGGC2GEA4DFGHHHCBAGHHFFBFFHHFFF5AGAF2     NM:i:0  MD:Z:50 AS:i:50 XS:i:19 MQ:i:40 MC:Z:50M        ms:i:1875
M04990:66:000000000-G83PY:1:2104:3420:8655      83      chr6    36842603        40      50M     =       36842224        -429    TGGGAGGCAGAGGTTGCAGTGAGCCGAGATCGCACCACTGCACTCCAGCC      HGHHHHHHHHHHHHHHBHHFGGGGHHHGGGFGFGGGGGFFFFFFBBBBBA      NM:i:0  MD:Z:50 AS:i:50 XS:i:50 MQ:i:60 MC:Z:50M        ms:i:1708

10. Bam to fastq¶

[178]:

!module load bedtools;bedtools bamtofastq -i test.valid_pairs.rmdup.uq.st.bam -fq test.fq | head test.fq

@M04990:66:000000000-G83PY:1:1101:17665:21940
GTTCACGCCTCGCGTATAATGTCTCGCGACAAGGGCGTTTCACTAGCACG
+
CCCCBFBFCFCBGGGGGGGGGGHHGGFGGGGGHHGGGGGGFHHHHHHHHG
@M04990:66:000000000-G83PY:1:2104:3420:8655
CACATGCTGTGTGCTGGGTGCAGTCCTAGGCTCAGAGAATAAAAGGCAAA
+
>>A3>BFFFFFFGGGGC2GEA4DFGHHHCBAGHHFFBFFHHFFF5AGAF2

get fastq using seqtk¶

[184]:

# Here, we are missing the index reads, because they are not part of the read names
# If we want to take a look at the index reads, we can use seqtk to subset fastq files
# Also, sometimes the input fastq to BWA mem is pre-processed (i.e., trimed), so if
# we want to see the raw reads, we need to subset raw fastq
!module load bedtools; \
bedtools intersect -a test.valid_pairs.rmdup.uq.st.bam -b ctcfpos.bed | samtools view

M04990:66:000000000-G83PY:1:1101:17665:21940    83      chr6    30617302        60      50M     =       30617169        -183    CGTGCTAGTGAAACGCCCTTGTCGCGAGACATTATACGCGAGGCGTGAAC      GHHHHHHHHFGGGGGGHHGGGGGFGGHHGGGGGGGGGGBCFCFBFBCCCC      NM:i:0  MD:Z:50 AS:i:50 XS:i:0  MQ:i:60 MC:Z:50M        ms:i:1855
M04990:66:000000000-G83PY:1:2104:3420:8655      163     chr6    36842224        60      50M     =       36842603        429     CACATGCTGTGTGCTGGGTGCAGTCCTAGGCTCAGAGAATAAAAGGCAAA      >>A3>BFFFFFFGGGGC2GEA4DFGHHHCBAGHHFFBFFHHFFF5AGAF2     NM:i:0  MD:Z:50 AS:i:50 XS:i:19 MQ:i:40 MC:Z:50M        ms:i:1875

[186]:

!module load bedtools seqtk; \
bedtools intersect -a test.valid_pairs.rmdup.uq.st.bam -b ctcfpos.bed | samtools view | cut -f 1 > read.list ;\
seqtk subseq test.10K.R1.fastq.gz read.list

@M04990:66:000000000-G83PY:1:2104:3420:8655 1:N:0:TCGGACGATCATGGGTAGACGGACAAGTATGCAGCGCGCTCAAGCACGTGGATTACCGAGCAGTCGTACGCCGATGCGAAACATCGGCCACTCTACGAC+TAGATCGC
GGCTGGAGTGCAGTGGTGCGATCTCGGCTCACTGCAACCTCTGCCTCCCA
+
ABBBBBFFFFFFGGGGGFGFGGGHHHGGGGFHHBHHHHHHHHHHHHHHGH
@M04990:66:000000000-G83PY:1:1101:17665:21940 1:N:0:TCGGACGATCATGGGTGATACGTCAAGTATGCAGCGCGCTCAAGCACGTGGATTGTACCTTAGTCGTACGCCGATGCGAAACATCGGCCACTTCTGTGT+TAGATCGC
GTTCACGCCTCGCGTATAATGTCTCGCGACAAGGGCGTTTCACTAGCACG
+
CCCCBFBFCFCBGGGGGGGGGGHHGGFGGGGGHHGGGGGGFHHHHHHHHG

BED examples¶

bed overlaps using intervene¶

https://intervene.readthedocs.io/en/latest/index.html

[189]:

!ls bedtools_test/*bed

bedtools_test/B_cell.bed    bedtools_test/HSPC_LOC.bed
bedtools_test/CD34_D0.bed   bedtools_test/mono_cell.bed
bedtools_test/Ery_cell.bed  bedtools_test/T_cell.bed

[203]:

!intervene upset --figtype png -i bedtools_test/*bed -o upset_plot --save-overlaps


Running UpSet module. Please wait...


You are done! Please check your results @ upset_plot.
Thank you for using Intervene!

[204]:

!ls upset_plot/* | head

upset_plot/Intervene_upset_combinations.txt
upset_plot/Intervene_upset.pdf
upset_plot/Intervene_upset.png
upset_plot/Intervene_upset.R

upset_plot/sets:
000001_T_cell.bed
000010_mono_cell.bed
000011_mono_cell_T_cell.bed
001000_Ery_cell.bed

[205]:

from IPython.display import Image
Image("upset_plot/Intervene_upset.png")

[205]:

../../_images/content_Bioinformatics_Core_Competencies_NGS101_104_0.png

[206]:

!intervene pairwise --figtype png -i bedtools_test/*bed -o pairwise_frac_plot --compute frac --htype pie


Performing a pairwise intersection analysis. Please wait...


You are done! Please check your results @ pairwise_frac_plot.
Thank you for using Intervene!

[207]:

!ls pairwise_frac_plot/* | head

pairwise_frac_plot/Intervene_pairwise_frac_matrix.txt
pairwise_frac_plot/Intervene_pairwise_frac.png
pairwise_frac_plot/Intervene_pairwise_frac.R

[222]:

from IPython.display import Image
Image("pairwise_frac_plot/Intervene_pairwise_frac.png", width=500, height=500)

[222]:

../../_images/content_Bioinformatics_Core_Competencies_NGS101_107_0.png

[209]:

!intervene venn --figtype png -i bedtools_test/*bed -o venn_plot --save-overlaps


Generating a 6-way "venn" diagram. Please wait...


Done! Please check your results @ venn_plot.
Thank you for using Intervene!

[210]:

!ls venn_plot/* | head

venn_plot/Intervene_venn.png

venn_plot/sets:
000001_T_cell.bed
000010_mono_cell.bed
000011_mono_cell_T_cell.bed
001000_Ery_cell.bed
001001_Ery_cell_T_cell.bed
001010_Ery_cell_mono_cell.bed
001011_Ery_cell_mono_cell_T_cell.bed

[220]:

from IPython.display import Image
Image("venn_plot/Intervene_venn.png", width=500, height=500)

[220]:

../../_images/content_Bioinformatics_Core_Competencies_NGS101_110_0.png

BEDTools¶

[223]:

!bedtools

bedtools is a powerful toolset for genome arithmetic.

Version:   v2.30.0
About:     developed in the quinlanlab.org and by many contributors worldwide.
Docs:      http://bedtools.readthedocs.io/
Code:      https://github.com/arq5x/bedtools2
Mail:      https://groups.google.com/forum/#!forum/bedtools-discuss

Usage:     bedtools <subcommand> [options]

The bedtools sub-commands include:

[ Genome arithmetic ]
    intersect     Find overlapping intervals in various ways.
    window        Find overlapping intervals within a window around an interval.
    closest       Find the closest, potentially non-overlapping interval.
    coverage      Compute the coverage over defined intervals.
    map           Apply a function to a column for each overlapping interval.
    genomecov     Compute the coverage over an entire genome.
    merge         Combine overlapping/nearby intervals into a single interval.
    cluster       Cluster (but don't merge) overlapping/nearby intervals.
    complement    Extract intervals _not_ represented by an interval file.
    shift         Adjust the position of intervals.
    subtract      Remove intervals based on overlaps b/w two files.
    slop          Adjust the size of intervals.
    flank         Create new intervals from the flanks of existing intervals.
    sort          Order the intervals in a file.
    random        Generate random intervals in a genome.
    shuffle       Randomly redistribute intervals in a genome.
    sample        Sample random records from file using reservoir sampling.
    spacing       Report the gap lengths between intervals in a file.
    annotate      Annotate coverage of features from multiple files.

[ Multi-way file comparisons ]
    multiinter    Identifies common intervals among multiple interval files.
    unionbedg     Combines coverage intervals from multiple BEDGRAPH files.

[ Paired-end manipulation ]
    pairtobed     Find pairs that overlap intervals in various ways.
    pairtopair    Find pairs that overlap other pairs in various ways.

[ Format conversion ]
    bamtobed      Convert BAM alignments to BED (& other) formats.
    bedtobam      Convert intervals to BAM records.
    bamtofastq    Convert BAM records to FASTQ records.
    bedpetobam    Convert BEDPE intervals to BAM records.
    bed12tobed6   Breaks BED12 intervals into discrete BED6 intervals.

[ Fasta manipulation ]
    getfasta      Use intervals to extract sequences from a FASTA file.
    maskfasta     Use intervals to mask sequences from a FASTA file.
    nuc           Profile the nucleotide content of intervals in a FASTA file.

[ BAM focused tools ]
    multicov      Counts coverage from multiple BAMs at specific intervals.
    tag           Tag BAM alignments based on overlaps with interval files.

[ Statistical relationships ]
    jaccard       Calculate the Jaccard statistic b/w two sets of intervals.
    reldist       Calculate the distribution of relative distances b/w two files.
    fisher        Calculate Fisher statistic b/w two feature files.

[ Miscellaneous tools ]
    overlap       Computes the amount of overlap from two intervals.
    igv           Create an IGV snapshot batch script.
    links         Create a HTML page of links to UCSC locations.
    makewindows   Make interval "windows" across a genome.
    groupby       Group by common cols. & summarize oth. cols. (~ SQL "groupBy")
    expand        Replicate lines based on lists of values in columns.
    split         Split a file into multiple files with equal records or base pairs.
    summary       Statistical summary of intervals in a file.

[ General Parameters ]
     --cram-ref    Reference used by a CRAM input

[ General help ]
    --help        Print this help menu.
    --version     What version of bedtools are you using?.
    --contact     Feature requests, bugs, mailing lists, etc.

Most common bedtools commands¶

intersect, Find overlapping intervals in various ways.
closest, Find the closest, potentially non-overlapping interval.
merge, Combine overlapping/nearby intervals into a single interval.
slop, Adjust the size of intervals.
sort , Order the intervals in a file.

Other subcommand usage is very similar, you can read invidual documentation for more details. Just google bedtools

bedtools intersect¶

[226]:

!ls hg19-blacklist.v2.bed.gz

hg19-blacklist.v2.bed.gz

[227]:

!ls *bed

bd9835eb2b5b.bed      example.bed                        matches.bed.bed
ca218748b8db.bed      get_promoter_yli11_2019-12-10.bed  out.bed
CTCF.hg38.random.bed  get_promoter_yli11_2019-12-18.bed  output.bed
ctcfneg_1x.bed        get_promoter_yli11_2020-01-06.bed  test23.bed
ctcfpos.bed           input.bed                          test2.bed
ctcfpos.neg.bed       input.bed.assigned_targets.bed     test.bed
d3.bed                loci.bed                           tmp.out.bed
df1.bed               ma.bed
df2.bed               matches.bed

[229]:

!bedtools intersect -a ctcfpos.neg.bed -b hg19-blacklist.v2.bed.gz | head

chr17   19033560        19033878
chr22   16741870        16742161
chr22   15446088        15446374
chr1    144198266       144198544
chr8    12379842        12380140
chr22   16266062        16266356
chr22   21843468        21843748
chr5    69578590        69578870
chrX    114989405       114989716
chr12   495349  495656

[230]:

!bedtools intersect -a ctcfpos.neg.bed -b hg19-blacklist.v2.bed.gz | bedtools sort -i - | head

chr1    121146526       121146821
chr1    142089885       142090188
chr1    144198266       144198544
chr11   1148412 1148715
chr12   495349  495656
chr14   18713048        18713369
chr14   19136813        19137201
chr15   19568430        19568738
chr15   19973033        19973368
chr15   21664313        21664604

[232]:

!bedtools intersect -b ctcfpos.neg.bed -a hg19-blacklist.v2.bed.gz -u | bedtools sort -i - | head

chr1    120926100       121149300       High Signal Region
chr1    121341500       145396500       High Signal Region
chr11   1141100 1214300 High Signal Region
chr12   479900  531700  High Signal Region
chr14   0       20303800        High Signal Region
chr15   0       20166200        High Signal Region
chr15   20200400        22365100        High Signal Region
chr15   30358500        30919300        High Signal Region
chr15   32445900        32915200        High Signal Region
chr15   82582300        83213900        High Signal Region

[233]:

!bedtools intersect -b ctcfpos.neg.bed -a hg19-blacklist.v2.bed.gz | bedtools sort -i - | head

chr1    121146526       121146821       High Signal Region
chr1    142089885       142090188       High Signal Region
chr1    144198266       144198544       High Signal Region
chr11   1148412 1148715 High Signal Region
chr12   495349  495656  High Signal Region
chr14   18713048        18713369        High Signal Region
chr14   19136813        19137201        High Signal Region
chr15   19568430        19568738        High Signal Region
chr15   19973033        19973368        High Signal Region
chr15   21664313        21664604        High Signal Region

remove black list regions¶

[234]:

!bedtools intersect -a ctcfpos.neg.bed -b hg19-blacklist.v2.bed.gz -v -wa | bedtools sort -i - | head

chr1    1261101 1261398
chr1    1729311 1729599
chr1    2245235 2245525
chr1    3153860 3154143
chr1    3244752 3245042
chr1    3418422 3418758
chr1    4443875 4444191
chr1    5243988 5244261
chr1    5641142 5641440
chr1    6010760 6011094

[237]:

!bedtools intersect -a ctcfpos.neg.bed -b hg19-blacklist.v2.bed.gz -wa -wb | bedtools sort -i - | head

chr1    121146526       121146821       chr1    120926100       121149300       High Signal Region
chr1    142089885       142090188       chr1    121341500       145396500       High Signal Region
chr1    144198266       144198544       chr1    121341500       145396500       High Signal Region
chr11   1148412 1148715 chr11   1141100 1214300 High Signal Region
chr12   495349  495656  chr12   479900  531700  High Signal Region
chr14   18713048        18713369        chr14   0       20303800        High Signal Region
chr14   19136813        19137201        chr14   0       20303800        High Signal Region
chr15   19568430        19568738        chr15   0       20166200        High Signal Region
chr15   19973033        19973368        chr15   0       20166200        High Signal Region
chr15   21664313        21664604        chr15   20200400        22365100        High Signal Region

[246]:

!bedtools intersect -b ctcfpos.neg.bed -a hg19-blacklist.v2.bed.gz -wa

chr11   1141100 1214300 High Signal Region
chr12   479900  531700  High Signal Region
chr14   0       20303800        High Signal Region
chr14   0       20303800        High Signal Region
chr15   0       20166200        High Signal Region
chr15   0       20166200        High Signal Region
chr15   20200400        22365100        High Signal Region
chr15   20200400        22365100        High Signal Region
chr15   30358500        30919300        High Signal Region
chr15   32445900        32915200        High Signal Region
chr15   82582300        83213900        High Signal Region
chr15   82582300        83213900        High Signal Region
chr17   4734800 4736700 Low Mappability
chr17   18928600        19140800        High Signal Region
chr17   22207000        25341300        High Signal Region
chr17   22207000        25341300        High Signal Region
chr18   15293900        18552900        High Signal Region
chr1    120926100       121149300       High Signal Region
chr1    121341500       145396500       High Signal Region
chr1    121341500       145396500       High Signal Region
chr21   9594900 10366000        High Signal Region
chr22   0       16962100        High Signal Region
chr22   0       16962100        High Signal Region
chr22   0       16962100        High Signal Region
chr22   20304800        20708400        High Signal Region
chr22   21466100        21916600        High Signal Region
chr22   21466100        21916600        High Signal Region
chr2    87441300        88290400        High Signal Region
chr2    90267000        95326200        High Signal Region
chr2    90267000        95326200        High Signal Region
chr2    97718400        98232300        High Signal Region
chr3    75678100        75917700        High Signal Region
chr3    197798000       198022400       High Signal Region
chr4    190795300       191154200       High Signal Region
chr5    68830000        70669400        High Signal Region
chr5    68830000        70669400        High Signal Region
chr6    62371500        62383900        Low Mappability
chr8    12252000        12466300        High Signal Region
chrX    114959100       115006100       High Signal Region
chrY    7432700 13491000        High Signal Region

[247]:

!bedtools intersect -b ctcfpos.neg.bed -a hg19-blacklist.v2.bed.gz -wa | bedtools sort -i - | bedtools merge -i - -o count -c 4

chr1    120926100       121149300       1
chr1    121341500       145396500       2
chr11   1141100 1214300 1
chr12   479900  531700  1
chr14   0       20303800        2
chr15   0       20166200        2
chr15   20200400        22365100        2
chr15   30358500        30919300        1
chr15   32445900        32915200        1
chr15   82582300        83213900        2
chr17   4734800 4736700 1
chr17   18928600        19140800        1
chr17   22207000        25341300        2
chr18   15293900        18552900        1
chr2    87441300        88290400        1
chr2    90267000        95326200        2
chr2    97718400        98232300        1
chr21   9594900 10366000        1
chr22   0       16962100        3
chr22   20304800        20708400        1
chr22   21466100        21916600        2
chr3    75678100        75917700        1
chr3    197798000       198022400       1
chr4    190795300       191154200       1
chr5    68830000        70669400        2
chr6    62371500        62383900        1
chr8    12252000        12466300        1
chrX    114959100       115006100       1
chrY    7432700 13491000        1

merge peaks¶

[249]:

!ls *bed

bd9835eb2b5b.bed      example.bed                        matches.bed.bed
ca218748b8db.bed      get_promoter_yli11_2019-12-10.bed  out.bed
CTCF.hg38.random.bed  get_promoter_yli11_2019-12-18.bed  output.bed
ctcfneg_1x.bed        get_promoter_yli11_2020-01-06.bed  test23.bed
ctcfpos.bed           input.bed                          test2.bed
ctcfpos.neg.bed       input.bed.assigned_targets.bed     test.bed
d3.bed                loci.bed                           tmp.out.bed
df1.bed               ma.bed
df2.bed               matches.bed

[254]:

!cat ctcfpos.bed ctcfpos.neg.bed | bedtools sort -i - | bedtools merge -i - > merge.bed; head merge.bed

chr1    535747  536177
chr1    1261101 1261398
chr1    1430705 1431030
chr1    1729311 1729599
chr1    2245235 2245525
chr1    2303085 2303376
chr1    3153860 3154143
chr1    3244752 3245042
chr1    3418422 3418758
chr1    4443875 4444191

bedtools closest¶

[256]:

!bedtools sort -i ctcfpos.bed > t1.bed
!bedtools sort -i ctcfpos.neg.bed > t2.bed
!bedtools closest -a t1.bed -b t2.bed -d | head

chr1    535747  536177  chr1    1261101 1261398 724925
chr1    1430705 1431030 chr1    1261101 1261398 169308
chr1    2303085 2303376 chr1    2245235 2245525 57561
chr1    5492536 5492843 chr1    5641142 5641440 148300
chr1    6326110 6326418 chr1    6010760 6011094 315017
chr1    6702688 6702985 chr1    6010760 6011094 691595
chr1    8296980 8297279 chr1    8295680 8295961 1020
chr1    8831580 8831868 chr1    8878641 8878924 46774
chr1    8970722 8971025 chr1    8878641 8878924 91799
chr1    9093414 9093739 chr1    8878641 8878924 214491

extend bed file by certain size¶

[258]:

!bedtools slop -h


Tool:    bedtools slop (aka slopBed)
Version: v2.30.0
Summary: Add requested base pairs of "slop" to each feature.

Usage:   bedtools slop [OPTIONS] -i <bed/gff/vcf> -g <genome> [-b <int> or (-l and -r)]

Options:
        -b      Increase the BED/GFF/VCF entry -b base pairs in each direction.
                - (Integer) or (Float, e.g. 0.1) if used with -pct.

        -l      The number of base pairs to subtract from the start coordinate.
                - (Integer) or (Float, e.g. 0.1) if used with -pct.

        -r      The number of base pairs to add to the end coordinate.
                - (Integer) or (Float, e.g. 0.1) if used with -pct.

        -s      Define -l and -r based on strand.
                E.g. if used, -l 500 for a negative-stranded feature,
                it will add 500 bp downstream.  Default = false.

        -pct    Define -l and -r as a fraction of the feature's length.
                E.g. if used on a 1000bp feature, -l 0.50,
                will add 500 bp "upstream".  Default = false.

        -header Print the header from the input file prior to results.

Notes:
        (1)  Starts will be set to 0 if options would force it below 0.
        (2)  Ends will be set to the chromosome length if  requested slop would
        force it above the max chrom length.
        (3)  The genome file should tab delimited and structured as follows:

        <chromName><TAB><chromSize>

        For example, Human (hg19):
        chr1    249250621
        chr2    243199373
        ...
        chr18_gl000207_random   4262

Tips:
        One can use the UCSC Genome Browser's MySQL database to extract
        chromosome sizes. For example, H. sapiens:

        mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -e \
        "select chrom, size from hg19.chromInfo"  > hg19.genome

[259]:

!chrom_size=/home/yli11/Data/Human/hg19/annotations/hg19.chrom.sizes;\
bedtools slop -i ctcfpos.bed -g $chrom_size -b 10 | head

chr1    36659006        36659360
chr1    67738730        67739065
chr1    15630222        15630586
chr1    9850142 9850470
chr1    43767393        43767722
chr1    26616134        26616456
chr1    151462162       151462521
chr1    27243619        27243952
chr1    47431156        47431494
chr1    44243924        44244249

[260]:

!head ctcfpos.bed

chr1    36659016        36659350
chr1    67738740        67739055
chr1    15630232        15630576
chr1    9850152 9850460
chr1    43767403        43767712
chr1    26616144        26616446
chr1    151462172       151462511
chr1    27243629        27243942
chr1    47431166        47431484
chr1    44243934        44244239

[261]:

!chrom_size=/home/yli11/Data/Human/hg19/annotations/hg19.chrom.sizes;\
bedtools slop -i ctcfpos.bed -g $chrom_size -l 1 -r 10 | head

chr1    36659015        36659360
chr1    67738739        67739065
chr1    15630231        15630586
chr1    9850151 9850470
chr1    43767402        43767722
chr1    26616143        26616456
chr1    151462171       151462521
chr1    27243628        27243952
chr1    47431165        47431494
chr1    44243933        44244249

Bigwiggle Format¶

[ ]:

## Bigwiggle Format