Description

The UCSC Genes track is a set of gene predictions based on data from RefSeq, Genbank, CCDS, UniProt, Rfam, and the tRNA Genes track. The track includes both protein-coding genes and non-coding RNA genes. Both types of genes can produce noncoding transcripts, but non-coding RNA genes do not produce protein-coding transcripts. This is a moderately conservative set of predictions. Transcripts of protein-coding genes require the support of one RefSeq RNA, or one GenBank RNA sequence plus at least one additional line of evidence. Transcripts of non-coding RNA genes require the support of one Rfam or tRNA prediction. Compared to RefSeq, this gene set has generally about 10% more protein-coding genes, approximately four times as many putative non-coding genes, and about twice as many splice variants.

Display Conventions and Configuration

This track in general follows the display conventions for gene prediction tracks. The exons for putative noncoding genes and untranslated regions are represented by relatively thin blocks, while those for coding open reading frames are thicker. The following color key is used:

This track contains an optional codon coloring feature that allows users to quickly validate and compare gene predictions. To display codon colors, select the genomic codons option from the Color track by codons pull-down menu. Click here for more information about this feature.

Methods

The UCSC Genes are built using a multi-step pipeline:

  1. RefSeq and GenBank RNAs are aligned to the genome with BLAT, keeping only the best alignments for each RNA and discarding alignments of less than 98% identity.
  2. Alignments are broken up at non-intronic gaps, with small isolated fragments thrown out.
  3. Alignments are merged in from the hg19 tRNA track
  4. Alignments are also merged in for perfect alignments of sequences predicted as human noncoding genes by Rfam, in regions that are syntenic with the mm9 mouse genome. Perfect alignments to regions that are not syntenic with mm9 are excluded because these regions are enriched for pseudogenes.
  5. A splicing graph is created for each set of overlapping alignments. This graph has an edge for each exon or intron, and a vertex for each splice site, start, and end. Each RNA that contributes to an edge is kept as evidence for that edge. Gene models from the Consensus CDS project (CCDS) are also added to the graph.
  6. A similar splicing graph is created in the mouse, based on mouse RNA and ESTs. If the mouse graph has an edge that is orthologous to an edge in the human graph, that is added to the evidence for the human edge.
  7. If an edge in the splicing graph is supported by two or more human ESTs, it is added as evidence for the edge.
  8. If there is an Exoniphy prediction for an exon, that is added as evidence.
  9. The graph is traversed to generate all unique transcripts. The traversal is guided by the initial RNAs to avoid a combinatorial explosion in alternative splicing. All refSeq transcripts are output. For other multi-exon transcripts to be output, an edge supported by at least one additional line of evidence beyond the RNA is required. Single-exon genes require either two RNAs or two additional lines of evidence beyond the single RNA.
  10. Protein predictions are generated. For non-RefSeq transcripts we use the txCdsPredict program to determine if the transcript is protein-coding and if so, the locations of the start and stop codons. The program weighs as positive evidence the length of the protein, the presence of a Kozak consensus sequence at the start codon, and the length of the orthologous predicted protein in other species. As negative evidence it considers nonsense-mediated decay and start codons in any frame upstream of the predicted start codon. For RefSeq transcripts the RefSeq protein prediction is used directly instead of this procedure. For CCDS proteins the CCDS protein is used directly.
  11. The corresponding UniProt protein is found, if any.
  12. The transcript is assigned a permanent "uc" accession. If the transcript was not in the previous release of UCSC Genes, the accession ends with the suffix ".1" indicating that this is the first version of this transcript. If the transcript is identical to some transcript in the previous release of UCSC Genes, the accession is re-used with the same version number. If the transcript is not identical to any transcript in the previous release, but if it overlaps a similar transcript with a compatible structure, the previous accession is re-used with the version number incremented.

Related Data

The UCSC Genes transcripts are annotated in numerous tables, each of which is also available as a downloadable file. These include tables that link UCSC Genes transcripts to external datasets (such as knownToLocusLink, which maps UCSC Genes transcripts to Entrez identifiers, previously know as Locus Link identifiers), and tables that detail some property of UCSC Genes transcript sequences (such as knownToPfam, which identifies any Pfam domains found in the UCSC Genes protein-coding transcripts). One can see a full list of the associated tables in the Table Browser by selecting UCSC Genes at the track menu; this list is then available at the table menu. Note that some of these tables refer to UCSC Genes by its former name of Known Genes, sometimes abbreviated as known or kg. While the complete set of annotation tables is too long to describe, some of the more important tables are described below.

Credits

The UCSC Genes track was produced at UCSC using a computational pipeline developed by Jim Kent, Chuck Sugnet, Melissa Cline and Mark Diekhans. It is based on data from NCBI RefSeq, UniProt (including TrEMBL and TrEMBL-NEW), CCDS, and GenBank as well as data from Rfam and the Todd Lowe lab. Our thanks to the people running these databases and to the scientists worldwide who have made contributions to them.

Data Use Restrictions

Copyright information from the UniProt website:

Copyright 2002-2009 UniProt Consortium. We have chosen to apply the Creative Commons Attribution-NoDerivs License to all copyrightable parts of our databases. This means that you are free to copy, distribute, display and make commercial use of these databases, provided you give us credit. However, if you intend to distribute a modified version of one of our databases, you must ask us for permission first. All databases and documents in the UniProt FTP directory may be copied and redistributed freely, without advance permission, provided that this copyright statement is reproduced with each copy.

References

Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL. GenBank: update. Nucleic Acids Res. 2004 Jan 1;32:D23-6.

Chan PP, Lowe TM. GtRNAdb: A database of transfer RNA genes detected in genomic sequence. Nucleic Acids Res. 2009 Jan;37:D93-7.

Gardner PP, Daub J, Tate J, Moore BL, Osuch IH, Griffiths-Jones S, Finn RD, Nawrocki EP, Kolbe DL, Eddy SR, Bateman A. Rfam: Wikipedia, clans and the "decimal" release Nucleic Acids Res.2011 Jan;39:D141-5.

Hsu F, Kent WJ, Clawson H, Kuhn RM, Diekhans M, Haussler D. The UCSC Known Genes. Bioinformatics. 2006 May 1;22(9):1036-46.

Kent WJ. BLAT - the BLAST-like alignment tool. Genome Res. 2002 Apr;12(4):656-64.

Lowe TM, Eddy SR. tRNAscan-SE: A program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 1997 Mar 1;25(5):955-64.