CUT & RUN calibration pipeline¶

usage: cut_run_calibration.py [-h] [-j JID] -f FASTQ_TSV -d PEAKCALL_TSV
                              [--scale_multiplier SCALE_MULTIPLIER]
                              [--Ecoli_index_file ECOLI_INDEX_FILE]
                              [-g GENOME] [--genome_fasta GENOME_FASTA]
                              [-b BLACK_LIST] [-s CHROM_SIZE] [-i INDEX_FILE]

optional arguments:
  -h, --help            show this help message and exit
  -j JID, --jid JID     enter a job ID, which is used to make a new directory.
                        Every output will be moved into this folder. (default:
                        cut_run_calibration_yli11_2021-06-23)
  -f FASTQ_TSV, --fastq_tsv FASTQ_TSV
                        TSV file, 3 columns, peak, bam, output, need absolute
                        path to file (default: None)
  -d PEAKCALL_TSV, --peakcall_tsv PEAKCALL_TSV
                        TSV file, 3 columns, peak, bam, output, need absolute
                        path to file (default: None)
  --scale_multiplier SCALE_MULTIPLIER
  --Ecoli_index_file ECOLI_INDEX_FILE

Genome Info:
  -g GENOME, --genome GENOME
                        genome version: hg19, mm10, mm9 (default: hg19)
  --genome_fasta GENOME_FASTA
                        genome version: hs, mm (default:
                        /home/yli11/Data/Human/hg19/fasta/hg19.fa)
  -b BLACK_LIST, --black_list BLACK_LIST
                        Blacklist file (default: /home/yli11/Data/Human/hg19/a
                        nnotations/hg19.blacklist.bed)
  -s CHROM_SIZE, --chrom_size CHROM_SIZE
                        chrome size (default: /home/yli11/Data/Human/hg19/anno
                        tations/hg19.chrom.sizes)
  -i INDEX_FILE, --index_file INDEX_FILE
                        chrome size (default: /home/yli11/Data/Human/hg19/inde
                        x/bwa_16a_index/hg19.fa)

Summary¶

CUN-RUN pipeline with calibrated bw files based on E.coli reads, see https://github.com/Henikoff/Cut-and-Run/blob/master/spike_in_calibration.csh.

Duplicated reads were not removed. Multi-mapped reads can be removed, so two mapped bam files were generated, st.bam is all properly paired reads. .uq.st.bam is all properly paired and uniquely mapped reads.

Bw files, peak files, and motif discovery for top200 peaks were generated for both st.bam and .uq.st.bam files.

Usage¶

Step 0: LOGIN to a compute node

hpcf_interactive

Go to your data directory and type the following.

Step 1: Prepare input files, generate fastq.tsv.

module load python/2.7.13

HemTools cut_run --guess_input

    Input fastq files preparation complete! ALL GOOD!
    Please check if you like the computer-generated labels in : fastq.tsv
    Input peakcall file preparation complete! File name: peakcall.tsv

Note

If you are preparing fastq.tsv and peakcall.tsv yourself, please make sure no space anywhere in the file. Note that the seperator is tab. Spaces in file name will cause errors.

Step 2: Check the computer-generated input list (manually), make sure they are correct.

more fastq.tsv

more peakcall.tsv

Note

a random string will be added to the generated files (e.g., fastq.94c049cbff1f.tsv) if they exist before running step 1.

Step 3: Submit your job.

cut_run_calibration.py -f fastq.tsv -d peakcall.tsv

Sample input format¶

fastq.tsv

This is a tab-seperated-value format file. The 3 columns are: Read 1, Read 2, sample ID.

peakcall.tsv

This is also a tab-seperated-value format file. The 3 columns are: treatment sample ID, control/input sample ID, peakcall ID.

Output¶

See each folder in the {jid} folder for results. For example, calibrated bw files are located in bw_files.

Report bug¶

Once the job is finished, you will be notified by email with some attachments. If no attachment can be found, it might be caused by an error. In such case, please go to the result directory (where the log_files folder is located) and type:

$ HemTools report_bug

Tags¶

tags:: cut_run

Comments¶

code @ github.