Help

This is the help page of the LongGeneDB.

….

1. Which species are included in the LongGeneDB?

LongGeneDB contains information of 15 species listed as below.
The scientific classifications of these species are in the order of Kingdom, Phylum, Class, Order, Family, Genus, Species.

Homo sapiens: Animalia, Chordata, Mammalia, Primates, Hominidae, Homo, H. sapiens.
Pan troglodytes: Animalia, Chordata, Mammalia, Primates, Hominidae, Pan, P. troglodytes.
Callithrix jacchus: Animalia, Chordata, Mammalia, Primates, Callitrichidae, Callithrix, C. jacchus.
Rattus norvegicus: Animalia, Chordata, Mammalia, Rodentia, Muridae, Rattus, R. norvegicus.
Mus musculus: Animalia, Chordata, Mammalia, Rodentia, Muridae, Mus, M. musculus.
Oryctolagus cuniculus: Animalia, Chordata, Mammalia, Lagomorpha, Leporidae, Oryctolagus, O. cuniculus.
Canis lupus familiaris: Animalia, Chordata, Mammalia, Carnivora, Canidae, Canis, C. lupus.
Felis catus: Animalia, Chordata, Mammalia, Carnivora, Felidae, Felis, F. catus.
Sus scrofa: Animalia, Chordata, Mammalia, Artiodactyla, Suidae, Sus, S. scrofa.
Taeniopygia guttata: Animalia, Chordata, Aves, Passeriformes, Estrildidae, Taeniopygia, T. guttata.
Gallus gallus: Animalia, Chordata, Aves, Galliformes, Phasianidae, Gallus, G. gallus.
Xenopus tropicalis: Animalia, Chordata, Amphibia, Anura, Pipidae, Xenopus, X. tropicalis.
Danio rerio: Animalia, Chordata, Actinopterygii, Cypriniformes, Cyprinidae, Danio, D. rerio.
Drosophila melanogaster: Animalia, Arthropoda, Insecta, Diptera, Drosophilidae, Drosophila, D. melanogaster.
Caenorhabditis elegans: Animalia, Nematoda, Chromadorea, Rhabditida, Rhabditidae, Caenorhabditis, C. elegans.

2. Which long genes are included in the LongGeneDB Mouse omics?

Protein coding genes longer than 200 kb in the mouse genome and their orthologous genes in the 15 species are included in this database. The gene annotation gtf files were downloaded from the Ensembl release 93 for mouse and release 100 for other species.

3. What is the pipeline to obtain the orthologous genes of mouse long genes?

The longest protein isoform of each long gene in mice was extracted from the Ensembl release 93. Blastp was used to align these mouse protein sequences to the protein sequences of the 15 species downloaded from Ensembl release 100. Genes with the top alignment scores were considered as the orthologous genes in that species. We also manually refined the orthologous gene list using BLAST web tool in Ensembl for every long gene.

4. What is the pipeline to analyze mRNA-seq data?

The raw data of mRNA-seq, which were generated from 8-week-old mice by the Thomas Gingeras laboratory of the ENCODE project (Yue et. al., 2014), were obtained from the NCBI SRA database. SRA files were converted to FASTQ files by fastq-dump of the SRA Toolkit. The FASTQ files were aligned to the mouse mm10 genome by STAR using the parameters of “–runThreadN 13 –outFilterMultimapNmax 1 –outFilterMismatchNmax 3”. The numbers of read pairs mapped to the exonic regions of each gene were calculated by an in-house Perl script. Raw read counts were normalized to exon length and sequencing depth to get the FPKM values.

5. What is the pipeline to analyze ChIP-seq data?

The raw data of ChIP-seq, which were generated from 8-week-old mice by the Bing Ren laboratory of the ENCODE project (Yue et. al., 2014), were obtained from the EBI ENA database. SRA files were converted to FASTQ files by fastq-dump of the SRA Toolkit. The FASTQ files were aligned to the mouse mm10 genome by Bowtie using the parameters of “-v 2 -m 1 -p 20”. The sam files were converted to bam files by samtools, and the bam files were further converted to bedgraph files by bamCoverage using the parameters of “-of bedgraph –binSize 10 -p 20”.

6. What is the pipeline to analyze Hi-C data?

The raw data of Hi-C (SRX150196 and SRX150197), which were generated from 8-week-old mice by the Bing Ren laboratory of the ENCODE project (Shen et. al., 2012), were obtained from the EBI ENA database. SRA files were converted to FASTQ files by fastq-dump of the SRA Toolkit. The FASTQ files were trimmed by homerTools using the parameters of “trim -3 AAGCTAGCTT -mis 0 -matchStart 20 -min 20”. Trimmed FASTQ files were aligned to the mm10 genome by Bowtie. Paired-end tag directories were created by HOMER makeTagDirectory using the parameters of “-tbp 1 -genome mm10 -restrictionSite AAGCTT -both -removePEbg -removeSelfLigation -removeSpikes 10000 5”. Normalized Hi-C interaction matrices were created by analyzeHiC using the parameters of “-res 2000 -window 10000 -balance -cpu 40 -corr”.

7. What is the pipeline to analyze single cell RNA-seq data?

The raw data of single nuclear RNA-seq (SRR6269025 and SRR6269027), which were generated from 8-week-old mice by the Hao Wu laboratory (Hu et. al., 2017), were obtained from the NCBI SRA database. SRA files were converted to FASTQ files by fastq-dump of the SRA Toolkit. Dropseq-tools V2 was used to process the FASTQ files to get the Digital Gene Expression files. Seurat 2.3.4 was used to further analyze the Digital Gene Expression data.

8. What are the access numbers of the raw sequencing data?

ChIP-seq
Bone marrow: H3K4me1 (SRR317253 SRR317254), H3K27ac (SRR566857 SRR566858), H3K4me3 (SRR317247 SRR317248)
Brown adipose tissue: H3K27ac (SRR566783 SRR566784), H3K4me1 (SRR566793 SRR566794), H3K4me3 (SRR566791 SRR566792)
Cerebellum: H3K4me3 (SRR317259 SRR317260), H3K4me1 (SRR317241 SRR317242), H3K27ac (SRR566835 SRR566836)
Cortex: H3K4me1 (SRR317249 SRR317250), H3K27ac (SRR566841 SRR566842), H3K4me3 (SRR317257 SRR317258)
Heart: H3K27ac (SRR566827 SRR566828), H3K4me1 (SRR317255 SRR317256), H3K4me3 (SRR317239 SRR317240)
Kidney: H3K27ac (SRR566825 SRR566826), H3K4me1 (SRR317251 SRR317252), H3K4me3 (SRR317237 SRR317238)
Liver: H3K27ac (SRR566921 SRR566922), H3K4me1 (SRR317235 SRR317236), H3K4me3 (SRR317233 SRR317234)
Lung: H3K4me1 (SRR317231 SRR317232), H3K4me3 (SRR317229 SRR317230), H3K27ac (SRR392337)
Olfactory bulb: H3K4me3 (SRR566897 SRR566898), H3K27ac (SRR566851 SRR566852), H3K4me1 (SRR566849 SRR566850)
Placenta: H3K4me3 (SRR566905 SRR566906), H3K27ac (SRR566909 SRR566910), H3K4me1 (SRR566907 SRR566908)
Small intestine: H3K4me3 (SRR566807 SRR566808), H3K4me1 (SRR566805 SRR566806), H3K27ac (SRR566809 SRR566810)
Spleen: H3K27ac (SRR566917 SRR566918), H3K4me1 (SRR578316 SRR578317), H3K4me3 (SRR578318 SRR578319)
Testis: H3K4me3 (SRR566799 SRR566800), H3K27ac (SRR566803 SRR566804), H3K4me1 (SRR566797 SRR566798)
Thymus: H3K4me3 (SRR566843 SRR566844), H3K4me1 (SRR566845 SRR566846), H3K27ac (SRR566847 SRR566848)

RNA-seq
Frontal cortex (GSE90206)
Cortex (GSE90205)
Urinary bladder (GSE90204)
Placenta (GSE90203)
Subcutaneous adipose tissue (GSE90193)
Stomach (GSE90192)
Small intestine (GSE90191)
Ovary (GSE90190)
Mammary gland (GSE90189)
Large intestine (GSE90188)
Cerebellum (GSE90200)
Kidney (GSE90179)
Liver (GSE90180)
Colon (GSE90177)
Heart (GSE90178)
Thymus (GSE90183)
Testis (GSE90184)
Lung (GSE90181)
Spleen (GSE90182)
Gonadal fat pad (GSE90187)
Adrenal gland (GSE90185)
Duodenum (GSE90186)

Hi-C: GSM938750 GSM938751

single-cell RNA-seq: SRR6269025 SRR6269026 SRR6269027 SRR6269028