Plot correlation scatter plots¶

usage: scatter_density.py [-h] -f F [-s S] -x X -y Y [--index INDEX]
                          [--regression] [--diagnal_line] [--lowess]
                          [-bc BACKGROUND_DOTS_COLOR] [-hc HIGHLIGHT_COLOR]
                          [--highlight HIGHLIGHT] [-o OUTPUT]

optional arguments:
  -h, --help            show this help message and exit
  -f F                  data table (default: None)
  -s S                  sep (default: )
  -x X                  column name for x axis (default: None)
  -y Y                  column name for y axis (default: None)
  --index INDEX         index name for index (default: None)
  --regression          by default it is a dignal line (default: False)
  --diagnal_line        force to draw a dignal line (default: False)
  --lowess              fit a curve (default: False)
  -bc BACKGROUND_DOTS_COLOR, --background_dots_color BACKGROUND_DOTS_COLOR
                        background_dots_color (default: #0000ff)
  -hc HIGHLIGHT_COLOR, --highlight_color HIGHLIGHT_COLOR
                        highlight_color (default: #ff1500)
  --highlight HIGHLIGHT
                        column name for y axis, sep by comma (default: None)
  -o OUTPUT, --output OUTPUT
                        output file name (default:
                        yli11_2020-09-22.correlation_scatter_density.png)

Summary¶

This script can be used to calculate sample correlation. Values are log-transformed, e.g., log2(x+1). Addtionally, you can also highlight some points, for example, to show some differences.

Input¶

Input can be tsv or csv. For tsv use -s "\t", for csv use -s ,

Geneid  Chr     Start   End     Strand  Length  Banana  Orange
asd1    chr1    3513707 3514076 +       370     800     22
b       chr1    3538168 3538438 +       271     24      16
a       chr1    3970540 3970785 +       246     16      6
b       chr1    4059120 4059436 +       317     44      12
a       chr1    4388977 4389294 +       318     22      11
b       chr1    4561768 4562101 +       334     31      11
a       chr1    4760133 4760340 +       208     23      9
b       chr1    5073062 5073299 +       238     36      20

The above example is a read count distribution for two chip-seq replicates.

The aim is to see the correlation between Banana and Orange, use -x Banana -y Orange to plot Banana column as the X-axis and Orange column as the Y-axis.

Usage¶

Sample correlation usage¶

hpcf_interactive -q standard -R "rusage[mem=10000]"

module load conda3

source activate /home/yli11/.conda/envs/py2/

    scatter_density.py -f input.tsv -s "\t" -x Banana -y Orange

Point highlight usage¶

hpcf_interactive -q standard -R "rusage[mem=10000]"

module load conda3

source activate /home/yli11/.conda/envs/py2/

    scatter_density.py -f input.tsv -s "\t" -x Banana -y Orange --index Geneid --highlight asd1 --regression

--regression is to add regression line. For sample correlation plot, regression line is on, for differential plot, the default is off.

If you have multiple points to highlight, use --highlight asd1,another_name,another_name2. These names should match the index name, which is defined using --index Geneid

Output¶

sample correlation¶

The value shown on the upper left corner is pearson correlation coefficicient.

../../_images/scatter_density_example1.png

differential analysis highlight¶

The value shown on the upper left corner is pearson correlation coefficicient.

../../_images/scatter_density_example2.png

code @ github.