Data table operations¶
Subset tables or remove rows with all zeros¶
usage: dataframe.py [-h] [--remove_zero] [--row] [--col] [--bed] [--ascending]
[--sort_by SORT_BY]
[--remove_duplicates REMOVE_DUPLICATES] [--index]
[-f INPUT] [--merge MERGE] [--subset SUBSET] [-s SEP]
[-o OUTPUT] [--header]
optional arguments:
-h, --help show this help message and exit
--remove_zero remove all rows or cols that are zero (default: False)
--row results on rows. This is to the opposite of pandas row
and col, which is operations on rows or cols. For
example, operations (e.g., find zeros) on cols will
result in removing rows. (default: False)
--col results on cols (default: False)
--bed input is bed file (default: False)
--ascending input is bed file (default: False)
--sort_by SORT_BY sort input by which columns (default: None)
--remove_duplicates REMOVE_DUPLICATES
remove duplicates on a specific column (default: None)
--index index is false (default: False)
-f INPUT, --input INPUT
data table input (default: None)
--merge MERGE merge with a data frame (default: None)
--subset SUBSET subset a data frame with a list (default: None)
-s SEP, --sep SEP separator (default: )
-o OUTPUT, --output OUTPUT
output table name (default:
yli11_2020-05-01_4434d3c4a5b5.csv)
--header input table has header (default: False)
Example: subset csv given a list
==> df1.txt <==
x,sd1,sd2
a,454,23
b,5.3,sd2
==> list.txt <==
sd2
dataframe.py -f df1.txt -s "," --subset list.txt -o output.csv
If your input is bed file, use -s "\t"
==> output.csv <==
x,sd1,sd2
b,5.3,sd2
Example: remove all zeros
==> df1.txt <==
x,sd1,sd2
1,4,5
0,0,0
dataframe.py -f df1.txt -s "," --remove_zero -o output.csv --header
If your input is bed file, use -s "\t"
==> output.csv <==
x,sd1,sd2
1,4,5
Example: remove duplicated rows
==> df1.txt <==
chr1,4,7,ACG,.,-
chr1,9,14,ACG,.,-
chr1,2,5,ACG,.,+
We only want to have unique sequences, sort the bed file and keep the first one.
dataframe.py -f df1.txt -s , --bed --sort_by 0,1 --remove_duplicates 3 -o output.bed
If your input is bed file, use -s "\t"
==> output.bed <==
chr1 2 5 ACG . +
Merge tables by row names (Any number)¶
usage: dataframe_merge.py [-h] [-s SEP] [--index_col INDEX_COL] [--glob GLOB]
[--header_list HEADER_LIST] [--drop DROP]
[--name_col_with_filename NAME_COL_WITH_FILENAME]
[--rename_col_with_filename] [--intersection]
[-o OUTPUT]
file [file ...]
merge input dataframes using row index. Assume input tables contain both row
names and column names.
positional arguments:
file
optional arguments:
-h, --help show this help message and exit
-s SEP, --sep SEP this program can infer separator automatically, but it
may fail. Use auto if the input tables contain
different separators. (default: auto)
--index_col INDEX_COL
which col to use as index (default: 0)
--glob GLOB glob the current dir with file name match to given
string (default: None)
--header_list HEADER_LIST
sep by , define your own colum names (default: None)
--drop DROP try drop this column(s), seperated by , (default:
None)
--name_col_with_filename NAME_COL_WITH_FILENAME
--rename_col_with_filename
--intersection merge dataframes only on overlapping row names
(default: False)
-o OUTPUT, --output OUTPUT
output table name (default:
yli11_2019-10-08_c88dbe184e44.tsv)
Note
This program assumes input tables contain both row names and column names.
Example: Merge one csv and one tsv
==> df1.txt <==
x,sd1,sd2
a,454,23
b,5.3,4.5
==> df2.txt <==
x ff4 yy6
a 34 77
c 33 23
The program can guess the file format (only for csv or tsv). If you only need the overlapping rows, use --intersection
dataframe_merge.py df1.txt df2.txt --intersection
df1.txt shape: 2 X 2
df2.txt shape: 2 X 2
Merged table shape: 1 X 4
Output to table: yli11_2019-08-05_c060b1dec3db.tsv
==> yli11_2019-08-05_c060b1dec3db.tsv <==
x sd1 sd2 ff4 yy6
a 454.0 23.0 34 77