Load library¶

[1]:

library("DESeq2")

Loading required package: S4Vectors

Loading required package: stats4

Loading required package: BiocGenerics

Loading required package: parallel


Attaching package: ‘BiocGenerics’


The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB


The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs


The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, which.max, which.min



Attaching package: ‘S4Vectors’


The following object is masked from ‘package:base’:

    expand.grid


Loading required package: IRanges

Loading required package: GenomicRanges

Loading required package: GenomeInfoDb

Loading required package: SummarizedExperiment

Loading required package: MatrixGenerics

Loading required package: matrixStats


Attaching package: ‘MatrixGenerics’


The following objects are masked from ‘package:matrixStats’:

    colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
    colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
    colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
    colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
    colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
    colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
    colWeightedMeans, colWeightedMedians, colWeightedSds,
    colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
    rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
    rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
    rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
    rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
    rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
    rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
    rowWeightedSds, rowWeightedVars


Loading required package: Biobase

Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.



Attaching package: ‘Biobase’


The following object is masked from ‘package:MatrixGenerics’:

    rowMedians


The following objects are masked from ‘package:matrixStats’:

    anyMissing, rowMedians

Read count table¶

This count table is generated from the diffpeak pipeline: https://hemtools.readthedocs.io/en/latest/content/Differential_analysis/deseq2_diffpeak.html?highlight=diffpeak

[2]:

infile="DiffPeak_syi_2021-11-16.count_table.bed"
countData = read.csv(infile,sep="\t")
head(countData)

A data.frame: 6 × 10
	Geneid	Chr	Start	End	Strand	Length	X1098561_Hudep2_CTCF.rmdup.uq.bam	X1047954_Hudep2_CTCF_50bp.rmdup.uq.bam	X2268629_KO_29_CTCF_S134_L004.rmdup.uq.bam	X2268630_KO_68_CTCF_S135_L004.rmdup.uq.bam
	<chr>	<chr>	<int>	<int>	<chr>	<int>	<int>	<int>	<int>	<int>
1	region_1	chr1	15918	16556	+	639	52	34	278	32
2	region_2	chr1	54581	54686	+	106	6	9	38	3
3	region_3	chr1	91023	91696	+	674	38	15	381	130
4	region_4	chr1	104705	105259	+	555	40	34	150	34
5	region_5	chr1	115570	115837	+	268	35	19	140	23
6	region_6	chr1	138518	139511	+	994	32	20	268	24

Specify design matrix, control + treatment¶

[3]:

group1=c("X1098561_Hudep2_CTCF.rmdup.uq.bam","X1047954_Hudep2_CTCF_50bp.rmdup.uq.bam") # WT
group2=c("X2268629_KO_29_CTCF_S134_L004.rmdup.uq.bam","X2268630_KO_68_CTCF_S135_L004.rmdup.uq.bam") # KO
group_label <- c(rep("control", length(group1)), rep("treatment", length(group2)))

[4]:

group_label

'control'
'control'
'treatment'
'treatment'

Get count matrix¶

[5]:

mat <- countData[,7:10]
rownames(mat) <- countData[,1]
head(mat)

A data.frame: 6 × 4
	X1098561_Hudep2_CTCF.rmdup.uq.bam	X1047954_Hudep2_CTCF_50bp.rmdup.uq.bam	X2268629_KO_29_CTCF_S134_L004.rmdup.uq.bam	X2268630_KO_68_CTCF_S135_L004.rmdup.uq.bam
	<int>	<int>	<int>	<int>
region_1	52	34	278	32
region_2	6	9	38	3
region_3	38	15	381	130
region_4	40	34	150	34
region_5	35	19	140	23
region_6	32	20	268	24

RUN DEseq2¶

[6]:

sample_info <- data.frame(sampleName=c(group1, group2), Group=group_label)
dds <- DESeqDataSetFromMatrix(countData=mat, colData=sample_info, design=~Group)
dds = DESeq(dds) # main function
res=results(dds)
raw_count = counts(dds)
norm_count = counts(dds,normalized=TRUE)

Warning message in DESeqDataSet(se, design = design, ignoreRank):
“some variables in design formula are characters, converting to factors”
estimating size factors

estimating dispersions

gene-wise dispersion estimates

mean-dispersion relationship

final dispersion estimates

fitting model and testing

Save result (this is the result without LFC shrinkage)¶

[7]:

outfile="deseq2.result"
colnames(norm_count) = paste(colnames(norm_count),".norm",sep="")
d <- data.frame(countData,norm_count, logFC=res[,"log2FoldChange"], AveExpr=res[,"baseMean"], t=res[,"stat"], P.Value=res[,"pvalue"], adj.P.Val=res[,"padj"])
d <- d[order(d[,"P.Value"]),]
write.table(d, file=paste(outfile, ".deseq2_wo_lfcShrink.tsv", sep=""), sep="\t", quote=FALSE, row.names=FALSE)

[8]:

plotMA(res, ylim = c(-4, 4))

../_images/jupyter_notebooks_DEseq2_example_13_0.png

apply LFC shrinkage¶

[10]:

res <- lfcShrink(dds,coef="Group_treatment_vs_control", type="ashr")
plotMA(res, ylim = c(-4, 4))

using 'ashr' for LFC shrinkage. If used in published research, please cite:
    Stephens, M. (2016) False discovery rates: a new deal. Biostatistics, 18:2.
    https://doi.org/10.1093/biostatistics/kxw041

../_images/jupyter_notebooks_DEseq2_example_15_1.png

save new result¶

[12]:

head(res)

log2 fold change (MMSE): Group treatment vs control
Wald test p-value: Group treatment vs control
DataFrame with 6 rows and 5 columns
          baseMean log2FoldChange     lfcSE      pvalue       padj
         <numeric>      <numeric> <numeric>   <numeric>  <numeric>
region_1  63.44804     0.00805928 0.0788364 3.31818e-01 0.99770039
region_2   9.55499    -0.00183120 0.1008589 8.50857e-01         NA
region_3 110.64886     1.97421221 0.5220874 6.11724e-07 0.00147051
region_4  52.58138     0.00282279 0.0643666 6.94088e-01 0.99770039
region_5  38.47367     0.00579136 0.0769100 4.82171e-01 0.99770039
region_6  47.76244     0.03886980 0.1892013 6.02855e-02 0.90094156

[13]:

outfile="deseq2.result"
colnames(norm_count) = paste(colnames(norm_count),".norm",sep="")
d <- data.frame(countData,norm_count, logFC=res[,"log2FoldChange"], AveExpr=res[,"baseMean"], P.Value=res[,"pvalue"], adj.P.Val=res[,"padj"])
d <- d[order(d[,"P.Value"]),]
write.table(d, file=paste(outfile, ".deseq2_with_lfcShrink.tsv", sep=""), sep="\t", quote=FALSE, row.names=FALSE)