BioMed Search
 Full View  |  Grid View
Results 0-9 of about 107 for author:durbin in 0.122 sec.
Genome Biology
Carter D, Durbin R      2006 Aug     >Caption source<
Extra large 
Vertebrate gene finding from multiple-species alignments using a two-level strategy
Alignment for a coding splice acceptor site. The figure shows the central part of a typical alignment window used by the classifier component of DOGFISH. Codon boundaries on the exon side of the splice site are indicated with dots. This site has an alignment with all species except frog: hs; Homo sapiens: mm; Mus musculus: rn; Rattus norvegicus: cf; Canis familiaris: gg; Gallus gallus: dr; Danio rerio: fr; Fugu rubripes. The AG dinucleotide for the acceptor site itself is shown in bold.
  • As explained in more detail in the Materials and methods section, DOGFISH's classifier consists of two main components, which adopt respectively a 'vertical' and a 'horizontal' view of alignments of multiple species around each feature of interest (see Figure 1 for an example alignment).
  • The inner 78 positions of a classifier window, for a typical phase-zero acceptor site, are shown in Figure 1.
Genome Biology
Carter D, Durbin R      2006 Aug     >Caption source<
Extra large 
Vertebrate gene finding from multiple-species alignments using a two-level strategy
DOGFISH-2E results. (a) Sensitivity and specificity for DOGFISH-2E output. The figure shows plots for specificity against specificity on the ENCODE test regions as the acceptance probability threshold is varied for internal exons, external (initial and terminal) exons, and all exons together. 'X' is used to mark the DOGFISH-2 sensitivity and specificity values, and the specificity value of 95% for almost 50% sensitivity is highlighted. (b) Probability of annotation as a function of DOGFISH-2E estimate. The figure shows DOGFISH-2E probability estimates on the x axis and, on the y axis, the probability that a site a DOGFISH-2E estimate of the given magnitude is annotated in ENCODE and Ensembl, respectively. The Y = X line is shown for comparison.
  • Figure 2a shows the behavior on the ENCODE test regions for internal exons, external exons (initial and terminal individually show similar behavior) and all exons together.
  • For DOGFISH-2E on the ENCODE test data, the corresponding factor was 1.001, though the relationship was less linear (Figure 2b).
  • For evaluating DOGFISH-2E on ENCODE test data (Figure 2a), we trained only on the ENCODE training regions, while for the whole-genome scan we used RVMs trained on all the ENCODE data; the resulting differences appeared to be minimal.
Genome Biology
Carter D, Durbin R      2006 Aug     >Caption source<
Extra large 
Vertebrate gene finding from multiple-species alignments using a two-level strategy
Mean RVM weights for horizontal and vertical component inputs. The figure shows the means, with p = 0.05 two-tail error bars, for weights assigned to inputs by acceptor site-type-pair RVMs in Classifier Two, averaging over all 20 pairings of decoy with true site types. The presence component has a single score. Two-letter abbreviations are used for the species-specific scores output by the horizontal component, while the vertical-component quantities are for eight 25 base-pair subregions (only six of which ever get non-zero scores) with one gap score. Species abbreviations are as in Figure 1.
  • Figure 3 illustrates one stage of the data-reduction process, showing how one presence, eight horizontal and nine vertical scores are weighted.
BMC Genomics
Jiménez JL, Durbin R      2006 Oct     >Caption source<
Extra large 
[X]uniqMAP: unique gene sequence regions in the human and mouse genomes
[X]uniqMAP statistics based on EnsEMBL version 40. Figures (a) to (c) summarise the data for the intra-comparisons within the human (left) and mouse (right) genomes. (a) The plots with the distributions of the proportion of exonic regions shared by all transcripts within a gene indicate that for some genes with high number of splicing variants it may be impossible to find a region to target simultaneously all their transcripts. (b) Distributions of the proportion of unique 19-mers found for genes, excluding pseudogenes, with single (red) and multiple transcripts (cyan) show that most genes present a high degree of uniqueness, although for nearly 25% of human genes the level of uniqueness is poor, i.e. between 0 and 5%. (c) Graphs summarising the lengths of the longest unique fragments found for each gene (red) or individual transcripts (blue). (d) Statistics from the inter-species comparisons of the human and mouse unique regions. The histograms correspond to the proportion of unique positions shared between the two organisms with respect to the total number of unique positions within each one of them (left) and the distribution of the longest unique fragments shared between the gene pairs (right).
  • A summary of the information held currently in the database, based on EnsEMBL release 40, can be found in Figure 2.
  • Figure 2a shows that with an increasing number of transcripts per gene it can sometimes be difficult to find unique regions at the gene level since the proportion of overlapping regions between all transcripts decreases, in some cases totally.
  • Nevertheless, when regions shared by all transcripts from a gene exist, 85% and 94% of human and mouse genes, respectively, do present unique regions that, in more than half of the cases, extend to 60% (human) and 72% (mouse) of their lengths (Figure 2b).
  • The distribution of the maximum lengths of the unique regions per gene/transcript shows that it is possible to find unique fragments that are at least 40 bases long in 80% (human) and 86% (mouse) of the cases (Figure 2c).
  • From the comparison of unique regions across the genomes, we observed that 15104 human-mouse gene pairs share identical unique regions, although these fragments only represent a small subset of the total length of the intra-genomic unique regions (Figure 2d, left) and in 76% of the cases the longest fragments are shorter than 40 bases (Figure 2d, right).
BMC Genomics
Jiménez JL, Durbin R      2006 Oct     >Caption source<
Extra large 
[X]uniqMAP: unique gene sequence regions in the human and mouse genomes
Graphical representation of unique genomic regions. (a) The graphical display for the unique regions within a genome depicts the full-length sequences, without common introns, as white rectangles with the unique regions in red for genes (top) and blue for individual transcripts (bottom). Repeats or low complexity regions are highlighted in black and those redundant in grey. (b) For the unique regions shared across genomes, the two levels of the display correspond to all the unique regions of the gene in the reference organism and those matched by the target organism. The colour-coded scheme for the reference is the same as in (a). For the target, the shared unique positions are placed relative to those matched with the reference and the colours represent all the possible combinations that can be found between the shared sequences, as explained in the main text.
  • In uniqMAP, the graphical display is split into two levels, namely gene and transcript, corresponding to regions shared by all transcripts or by a single transcript, respectively (Figure 3a).
  • Unique regions are shown in red for genes (Figure 3a, top) and blue for individual transcripts (Figure 3a, bottom).
  • In XuniqMAP, the display is also split into two levels, namely reference and target, corresponding to all the unique regions of the gene in the reference organism and those matched by the target organism, respectively (Figure 3b).
  • The colour-coded scheme for the reference is the same as in uniqMAP, i.e. red for unique regions shared by all transcripts and blue for those specific to individual transcripts (Figure 3b, top).
  • For the target, the shared unique positions are placed relative to those matched with the reference and the colours represent all the possible combination of matches that can be found between the shared sequences: (i) red when they are present in all transcripts for both genes; (ii) blue if present only in individual transcripts for both genes; and (iii) green when present in all the transcripts of one gene but only in a single transcript of the other (Figure 3b, bottom).
BMC Genomics
Jiménez JL, Durbin R      2006 Oct     >Caption source<
Extra large 
[X]uniqMAP: unique gene sequence regions in the human and mouse genomes
Building the non-redundant sequence set. The schema depicts an example for the establishment of the NR sequence set for a gene with three splicing variants. The different fragments are grouped according to their presence across all transcripts as described in the main text. Notice that these fragments (coloured) comprise only the central positions of all possible 19-mers and therefore transcript ends are not included (blank boxes at the top of the Figure). However, in the final NR sequence set (bottom) the 5' and 3' ends will be added to their corresponding fragments and the ends of the other fragments will be extended until they account for the full-length sequences of all 19-mers they represent. The philosophy behind this procedure is similar to that previously described by others [4].
  • All the central positions that fell within nine bases from the exonic boundaries were considered to be part of the regions that join exons whereas the others were part of the exon body (Figure 1, top).
  • Then, single copies of exonic ends, combined as seen in all transcripts, and exon bodies were extracted from each gene, storing in the database the genomic coordinates for these fragments as well as information about the number of transcripts they came from (Figure 1, bottom).
  • In the final NR sequence set, the fragments were extended by nine nucleotides at both ends to account for the full-length 19-mers they represented (Figure 1, bottom).
Nucleic acids research.
Hajarnavis A, Korf I, Durbin R      2004     >Caption source<
Extra large 
A probabilistic model of 3' end formation in Caenorhabditis elegans.
Figure 1. Properties of cleavage sites and 3'-UTRs AATAAA motifs are boxed in yellow; the cleavage site is seen where the mRNA diverges from genomic sequence (nucleotides not present in the genome are in lowercase). Where the divergence occurs downstream of genomic A's, the cleavage site is ambiguous and is boxed in green. (a) AC3.5 has one aligned mRNA; cleavage is detected between two G residues. (b) In C07A12.4a, the precise cleavage position cannot be assigned accurately, but it must occur in the region shown in green. (c) F17C11.9a has two distinct cleavage sites, as shown by the two different mRNAs diverging at different points. (d) F26E4.6 has seven aligned mRNAs and four distinct cleavage sites over five bases. (e) Length distribution for the distance between the AATAAA motif and the cleavage site for 106 sequences containing a single, unambiguous cleavage site and a single unique AATAAA within 40 nt upstream as in 1a. (f) Bars: histogram of observed lengths of 3'-UTRs from WS110. Line: expectation from a geometric distribution with a mean of 200.
  • In some cases, alignment of a cDNA identifies an unambiguous cleavage site, as in Figure 1a.
  • However, in many other cases it was difficult to define a unique cleavage site for the following reasons: The cleavage site may exist within a run of A's in the genomic sequence (Figure 1b).
  • These may be located hundreds of nucleotides apart or they may be much closer, even overlapping (Figure 1c).
  • Even when the position of the AATAAA is clearly defined and there are no A's in the genomic DNA, there may be multiple distinct cleavage sites clustered together (Figure 1d).
  • Of the 1156 poly(A) tail alignments found in this way, 111 had a single, non-overlapping AATAAA motif and an unambiguous cleavage site (Figure 1a).
  • 855 sequences were found to have ambiguous cleavage sites, as in Figure 1b.
  • The remaining sequences were of the form shown in Figure 1c and d.
  • Those sequences with well-defined AATAAA motifs (such as in Figure 1a and b) were retained for use in model building.
  • The length distribution of the remaining 106 sequences is shown in Figure 1e.
  • The length probability came from a smoothed version of Figure 1e that allows a range from 5 to 30 nt.
  • From 1156 sequences we were able to assign a unique maximum likelihood AATAAA motif and cleavage site for 961 sequences, with the remainder being of the forms seen in Figure 1c and d.
  • A histogram of 3'-UTR lengths is shown in Figure 1f.
  • Circular states have geometric distributions, rectangular states have fixed lengths, and the diamond state has a distribution similar to Figure 1e but with a range from 5 to 30 nt.
  • SP—T-rich spacer region of restricted length (Figure 1e)
  • The SP state has a distribution similar to Figure 1e with minimum and maximum values of 5 and 30 nt.
  • While collecting our dataset of unique AATAAA and cleavage sites we selected against those genes with high cDNA coverage, as genes containing a larger number of matching transcripts tended to have multiple distinct cleavage sites, such as in Figure 1d. Figure 3a shows the distribution of cleavage sites at each nucleotide for a 3'-UTR with 31 cDNA matches.
  • These 3' ends may therefore fall into the class depicted in Figure 1c with multiple AATAAA motifs.
Nucleic acids research.
Hajarnavis A, Korf I, Durbin R      2004     >Caption source<
Extra large 
A probabilistic model of 3' end formation in Caenorhabditis elegans.
Figure 2. Nucleotide composition near the cleavage site. (a) Nucleotide frequencies are shown from –80 to +40 with respect to the cleavage site at 0. Frequencies farther up- and downstream are given in the UTR and Gen columns outside the main graph. Gen corresponds to the average nucleotide composition in C.elegans genomic DNA. (b) State transition diagram for a generalized HMM that describes the sequence composition in the vicinity of the cleavage site. Circular states have geometric distributions, rectangular states have fixed lengths, and the diamond state has a distribution similar to Figure 1e but with a range from 5 to 30 nt.
  • To examine sequence features characteristic of 3' end formation, we aligned genomic sequences anchored at the cleavage site and plotted the nucleotide frequencies 80 bp upstream and 40 bp downstream (Figure 2a).
  • Using these six regions we designed an HMM to describe 3' end formation (Figure 2b).
Nucleic acids research.
Hajarnavis A, Korf I, Durbin R      2004     >Caption source<
Extra large 
A probabilistic model of 3' end formation in Caenorhabditis elegans.
Figure 3. Posterior probabilities mirror observed cleavage sites. The posterior probability of the AATAAA motif and cleavage site are shown in red and blue lines, respectively. The observed frequency of cleavage sites is indicated by a green line. When the cleavage site is ambiguous, the frequency is averaged over the ambiguous positions, which gives the green line a flat peak. (a) 31 mRNAs aligned to gene ZK652.4 show that there are multiple, tightly clustered cleavage sites. (b) 38 mRNAs aligned to gene R09B3.3 show a broad cluster of cleavage sites which are the result of three predicted AATAAA motifs.
  • While collecting our dataset of unique AATAAA and cleavage sites we selected against those genes with high cDNA coverage, as genes containing a larger number of matching transcripts tended to have multiple distinct cleavage sites, such as in Figure 1d. Figure 3a shows the distribution of cleavage sites at each nucleotide for a 3'-UTR with 31 cDNA matches.
  • The frequency of observed cleavage sites is very similar to the posterior probability. Figure 3b shows a case where there are multiple AATAAA motifs and cleavage sites.
Tell us what you think by sending feedback.