Last updated: 2018-02-01

Code version: 0f846a1


Background

Here’s the list for which a given gene symbol corresponds to multiple Ensembl gene ID in the data.

I learned that there are some regions on the genome that show substantial variability in the population, and subsequently can have multiple representations (sequences). These regions are known as “alternate loci”.

I got some of these info from this paper https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5155401/. According to this paper, the GRCh38 patch assembly has a total of 178 alternate-locus-containing regions associated with a total of 261 alternate loci.

For example, TAF9 has two alternative loci which correspond to two different Ensembl ID, and you can see the location fo these two here on this page http://useast.ensembl.org/Homo_sapiens/Gene/Alleles?db=core;g=ENSG00000273841;r=5:69364743-69370013. Click on “View alleles of this gene on alternate assemblies”.

Another example is TUBB which corresponds to 8 Ensembl IDs. It is located in MHC region which is known to have 8 alternative loci (corresponding to 8 different cell lines; https://www.ncbi.nlm.nih.gov/grc/human/regions/MHC). So there’s no surprise that TUBB has 8 Ensembl ID.


Data and packages

load cell cycle genes from the dropseq paper

cellcycle <- readRDS("../data/cellcycle-genes-previous-studies/rds/macosko-2017.rds")
dup <- which(duplicated(cellcycle[,-3]))

cellcycle[cellcycle$hgnc %in% unique(cellcycle$hgnc[dup]),]
        hgnc phase         ensembl
81    CCDC84     S ENSG00000280975
82    CCDC84     S ENSG00000186166
114     CDK7  M/G1 ENSG00000134058
115     CDK7  M/G1 ENSG00000277273
132      CFD    G2 ENSG00000197766
133      CFD    G2 ENSG00000274619
219  FAM189B  M/G1 ENSG00000262666
220  FAM189B  M/G1 ENSG00000160767
224     FAN1    G2 ENSG00000198690
225     FAN1    G2 ENSG00000276787
231    FOPNL  M/G1 ENSG00000276914
232    FOPNL  M/G1 ENSG00000133393
289     HRAS  G1/S ENSG00000276536
290     HRAS  G1/S ENSG00000174775
315 KIAA1147  G1/S ENSG00000257093
316 KIAA1147  G1/S ENSG00000262599
335    KIFC1    G2 ENSG00000204197
336    KIFC1    G2 ENSG00000237649
337    KIFC1    G2 ENSG00000056678
338    KIFC1    G2 ENSG00000233450
373     MDC1     M ENSG00000228575
374     MDC1     M ENSG00000137337
375     MDC1     M ENSG00000225589
376     MDC1     M ENSG00000206481
377     MDC1     M ENSG00000224587
378     MDC1     M ENSG00000234012
379     MDC1     M ENSG00000231135
380     MDC1     M ENSG00000237095
397  MRPS18B  M/G1 ENSG00000223775
398  MRPS18B  M/G1 ENSG00000226111
399  MRPS18B  M/G1 ENSG00000229861
400  MRPS18B  M/G1 ENSG00000204568
401  MRPS18B  M/G1 ENSG00000203624
402  MRPS18B  M/G1 ENSG00000233813
403  MRPS18B  M/G1 ENSG00000227420
484  PPP1R10     M ENSG00000238104
485  PPP1R10     M ENSG00000227804
486  PPP1R10     M ENSG00000204569
487  PPP1R10     M ENSG00000230995
488  PPP1R10     M ENSG00000235291
489  PPP1R10     M ENSG00000206489
490  PPP1R10     M ENSG00000231737
558  SMARCB1     M ENSG00000099956
559  SMARCB1     M ENSG00000275837
583    TAF15  G1/S ENSG00000276833
584    TAF15  G1/S ENSG00000270647
585     TAF9  M/G1 ENSG00000273841
586     TAF9  M/G1 ENSG00000276463
624     TUBB    G2 ENSG00000232421
625     TUBB    G2 ENSG00000224156
626     TUBB    G2 ENSG00000235067
627     TUBB    G2 ENSG00000183311
628     TUBB    G2 ENSG00000229684
629     TUBB    G2 ENSG00000227739
630     TUBB    G2 ENSG00000196230
631     TUBB    G2 ENSG00000232575
646     UBR7  G1/S ENSG00000012963
647     UBR7  G1/S ENSG00000278787

Session information

R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Scientific Linux 7.4 (Nitrogen)

Matrix products: default
BLAS: /home/joycehsiao/miniconda3/envs/fucci-seq/lib/R/lib/libRblas.so
LAPACK: /home/joycehsiao/miniconda3/envs/fucci-seq/lib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] compiler_3.4.1  backports_1.0.5 magrittr_1.5    rprojroot_1.2  
 [5] tools_3.4.1     htmltools_0.3.6 yaml_2.1.16     Rcpp_0.12.14   
 [9] stringi_1.1.2   rmarkdown_1.8   knitr_1.17      git2r_0.19.0   
[13] stringr_1.2.0   digest_0.6.12   evaluate_0.10.1

This R Markdown site was created with workflowr