Processed gene info data in HemTools

Summary

Both kallisto and STAR indices were built using Ensembl data. mm10 and hg38 kallisto indices were directly downloaded from kallisto github repo.

For sleuth, we need t2g table, namely gene ID, transcript ID and gene name.

For gene ID conversion between Entrez ID, Ensembl ID, HGNC ID, and others, we have another table.

Note that during every conversion, you can definitely lose some genes.

We will use Ensembl transcript ID as the primary key.

Kallisto/Sleuth and STAR

We provide 3 tables:

1. [genome].ensembl_[version].txt — mm9.ensembl_v67.txt

Genomic location was obtained from Ensembl ftp, cnda.all: e.g., ftp://ftp.ensembl.org/pub/release-67/fasta/mus_musculus/cdna/Mus_musculus.NCBIM37.67.cdna.all.fa.gz.

Current Ensembl version is v96; for hg19, the highest version is v75; for mm9, it is v67.

ENSMUST00000176802      ENSMUSG00000067978      Vmn2r-ps113     chr17   18186448        18211184        1
ENSMUST00000097386      ENSMUSG00000067978      Vmn2r-ps113     chr17   18201001        18211184        1

2. [genome].ensembl_[version].bed — mm9.ensembl_v67.bed

This is bed6 format. Strand is denoted as 1 or -1. For human tables, they are + or -.

chr17   18186448        18211184        .       ENSMUST00000176802      1
chr17   18201001        18211184        .       ENSMUST00000097386      1

3. [hg19].ensembl_[version].t2g — hg19.ensembl_v75.t2g

This is for Sleuth.

target_id               ens_gene                ext_gene
ENST00000387314 ENSG00000210049    MT-TF
ENST00000389680 ENSG00000211459  MT-RNR1

code @ github.