Intergenomics on QTLs

Topic
Quantitative trait loci (QTL) mapping efforts have been performed independently for several species for the same traits. If the animal models are driven by the same genetic mechanisms as those for the human diseases, we should expect to find common conserved sequences shared by the QTLs and susceptibility regions of all three organisms. Genes present as homologues in the QTLs of all three species, arise as best candidates to be relevant for the onset and/or further development of the disease.

Data sources
A comprehensive QTL database for rodent EAE was created with data collected from the public databases of the NCBI, the Jackson Laboratory (MGI: Mouse Genomics Informatics) and the Rat Genome Database (RGD) of the Medical College of Wisconsin. The data were complemented with human MS predisposition loci and genes assembled both from public databases of the NCBI and recent large scale genetic association studies on MS (mainly: J Neuroimmnol vol 143/1-2).
The rest of databases (alternative QTLs, homologous gene pairs, sequence similarity for syntenic blocks) used rely on a local installation of EnsEMBL.

Statistical methods
Consensuses may occur by chance with a certain probability. In order to determine that probability, here we test the performance of the intergenomics QTL synteny/homology tool on randomly located QTLs of the same size as the original ones (Permutation test). With an increasing number of iterations, the calculated distribution fits the real random distribution. That distribution yields the probability of finding by chance a certain number of genes in consensuses. Whenever the observed number of genes in consensuses is above the 95th percentile limit, it can be concluded that the observed number is reached or surpassed only five from hundred cases (i.e. p<0.05).

WARNING: Depending on the number and the size of the loci added to the system, the consensus search may take several hours to complete. However, the browser allows mostly to follow the process as it goes on (does not apply to some versions of Konqueror).

Input

  • Workflow


    The user introduces the search constraints (search flow, search type, etc.) in the web-interface (SSI). A provisional data set is generated querying the local or the EnsEMBL database, depending on the choice of the user. If the user chose to use his/her own data, the provisional data set is empty. That provisional data set may be furthermore modified (reduced, extended, changed) and finally submitted. For a search based on synteny, the source species (* only one is required, but a multiple selection is also possible) and the target species are analyzed for syntenic genes shared by their QTLs/SLs. If the search is based on homology, the species are analogously analyzed for homologous genes shared by their QTLs/SLs. The final list is then displayed in detail or as a summary and refers to the target species. The result may then be sent for validation to a second interface (see chapter "Validation", below).

  • Search flow

    Select source species:
    Select target species:

    • source: Start of the synteny/homology chain. Here you may select the organisms, whose QTLs will be checked jointly for common syntenic regions / homologous genes with the target species. (For a multiple selection, please keep the Ctrl-key pressed while selecting)

    • target: Chromosomal regions selected for one source species (two-way analysis), or eventually consensus regions between two source species (three-way analysis) are checked for further overlaps with syntenic regions or homologous genes in the target. For the case of synteny-based search, they include all syntenic regions affected (dark orange), as well as the ones among them located within disease susceptibility regions / QTLs (red). The user is therefore encouraged to select as target the species he/she is most interested in.

  • Search type

    Homology Synteny

    The search for consensus genes may be based on sequence comparisons of syntenic blocks or based on pair-wise gene homology. Both search types are complementary. A search based on synteny will be able to detect similarities beyond the coding regions, and will match genetic regions regardless of the gene functionality behind them. The search based on gene homology is not limited to syntenic blocks, and may affect gene homologues clearly outside of such blocks and - in contrast to synteny - is not affected by a threshold degree of similarity. Thus the performances with one and the other search method are very similar but always slightly different. Some genes are detected only by means of synteny and viceversa. The benefits of this slight discordance have been explained in a practical case by Serrano-Fernandez et al. (Genes and Immunity, in press). Both search types are based on EnsEMBL

  • Gap tolerance

    Gap tolerance:

    Neighbouring syntenic DNA fragments will be merged together whenever the DNA gap between them is smaller than the tolerated value that is selected here. Although increasing the gap tolerance results in a reduction of the computation time, it also may end in the identification of genes which are fully located inside those gaps, and thus possibly not syntenic at all. We recommend therefore the use of a gap tolerance of 10000 bp (10 Kbp) for a quick overview, and 1000 bp (1 Kbp) for a detailed analysis. This option is only valid for a search based on synteny.

  • Average percentage of sequence identity

    Avg. id. Species 1 -> 2:
    Avg. id. Species 1+2 -> 3:

    The thresholds that decide when an average percentage of identity between two DNA segments is enough to be considered as syntenic can be set manually here. The threshold value for species 1 (e.g. rat) and 2 (e.g. mouse) is default 90%. In contrast, the threshold for those towards the target species (e.g. human) may be set separately (by default 60% since the genomic drift between human and rodents is obviously greater than within rodents). This option is only valid for a search based on synteny.

  • Borders

    Fix borders on centroid

  • There are two ways implemented for setting the limits of a QTL / susceptibility locus. The first one fixes those borders on the corresponding centroid value (default). That value corresponds to the most probable physical position of the given genetic marker as calculated from the genetic position by CARTOGRAPHER (Voigt et al. 2004 Non-linear conversion between genetic and physical chromosomal distances. Bioinformatics, in press). The conversion was made with an outlier tolerance of 20% and a scrolling window size of 20 points. For those markers not listed in base pair position or, instead, the physical position when available in EnsEMBL. The second way is to extend the borders to the external confidence interval as provided by CARTOGRAPHER. This option is only valid for a search based on synteny.

  • Table Details

    Show detailed table

  • This checkbox allows to switch between the full table display (see OUTPUT) and the compact display showing only the final results (quicker). This option is only valid for a search based on synteny.

  • Metatrait selection

    Metatraits for Rattus_norvegicus   Consider whole genome

       Custom

    Local database for EnsEMBL release 16

    EnsEMBL

    Metatraits for Mus_musculus   Consider whole genome

       Custom

    Local database for EnsEMBL release 16

    EnsEMBL

    Metatraits for Homo_sapiens   Consider whole genome

       Custom

    Local database for EnsEMBL release 16

    EnsEMBL

    Metatrait stands for a group of traits (e.g. delay of disease development, intensity of the symptoms, etc.) accounting for the same superordinated phenotype (e.g. multiple sclerosis). The animal experimental autoimmune encephalomyelitis (EAE) is a model for the human multiple sclerosis (MS) and therefore both are merged here into the metatrait EAE/MS for comparative analysis. Although intense work is being done for other metatraits/diseases, the only one which is public at the moment is the aforementioned one. Nevertheless, the user is invited to insert his/her own data (option "custom"). Selecting this option has the effect that the submission text fields (see below), which are reserved for the data on QTLs and susceptibility loci, appear empty and allow the user to paste data inside at will (for instance control data for calibration). The option "Whole Genome" is only available for the target species and refers to an alternative usage of this software. It would check for all syntenic or homologous genes with respect to the ones in QTLs/SLs of the source species. This prospective approach is thought to be a help when searching for new SLs/QTLs in the target species.

  • Submission text fields

    Positions for Rattus norvegicus:
    Positions for Mus musculus:
    Positions for Homo sapiens:

    Once the query is submitted with the input parameters adjusted to the user's requisites, two or three textareas will appear below depending on the number of species involved in the analysis. By default the data on EAE and MS will be loaded into these fields. However, the data in the textarea may be modified, deleted or extended where necessary and if the user chooses to customize the metatraits, the textareas will appear empty. Then he/she may paste own genetic linkage data in base pairs in the corresponding textarea depending on the species and always according to the following format (TAB delimited):

    Position in base pairs(Syntax: chromosome / first bp / last bp / name)

    ... or paste directly the marker IDs according to the following formats (TAB delimited), where both format types (for flanking markers and for peak markers) may be mixed up in the same query:

    Positions for Mus musculus:

    In the example above, the two first QTLs (two first lines) are defined by peak markers (only one marker followed by a "-"), while the two last QTLs are defined by a flanking marker at each QTL border.

    Position in markersSpan size around peak marker (in Kbp):

        (Syntax: flanking marker / flanking marker / name)
    or (Syntax: peak marker / - / name)

    The selectable span size around the marker describes the window (in base pairs) to be considered around a peak marker. Moreover, in the rare cases where the span between two flanking markers is less than the selected minimum span size, the QTL described by flanking markers is automatically extended according to this span size.

    NOTE: If a marker ID is incorrectly typed or if it is not included in the EnsEMBL database, the script will ignore it. To confirm, which markers have been taken into account by the script, please check the current output after submission of the marker data. If necessary, we recommend to search for a neighbouring marker that could be included in the database and try the submission procedure again.

Output

Rattus_norvegicus QTLs

Homology in Mus_musculus

Homology in Homo_sapiens

Chr.Kbp StartKbp SizeNameChr.Kbp StartKbp SizeConsensus
Chr.Kbp StartKbp SizeConsensus
102580039200EAE311447940.544794-44794 (EAE6b)no match
102580039200EAE311448060.544806-44806 (EAE6b)no match
102580039200EAE311448201144820-44831 (EAE6b)
515882410
no match
102580039200EAE311448592544859-44884 (EAE6b)
51587809
no match
51588034
no match
51588290.5
no match
102580039200EAE311449083144908-44940 (EAE6b)
515868127
158681-158700 (IL12B)
51587300.5
no match
102580039200EAE31144952444952-44956 (EAE6b)no match
102580039200EAE3114498614444986-45131 (EAE6b)
5158478102
no match
515859343
no match
102580039200EAE311451419445141-45236 (EAE6b)
515837392
no match


The output is formatted as a large HTML table. The left column series (yellow) displays the chromosomal regions analyzed for the source species. The middle part (light orange) - if present - shows the chromosomal regions inside of QTLs of the second source species that are syntenic to the ones of the left column series (yellow). The right column series displays the chromosomal regions syntenic to the foregoing consensus regions between both source species (in case of three-way analysis) or syntenic to the only source species (in case of two-way analysis) outside (dark orange) or inside (red) of the QTLs or susceptibility loci of the target species. The loci displayed all link to the EnsEMBL contigview in order to assist in further analyses.

Kbp in consensus (all species): 14169Genes in consensus (all species): 283


Fragments (not joined) in consensus (sp1+sp2+ sp3_only_synt): 38537Fragments (joined) in consensus (sp1+sp2+ sp3_only_synt): 1627
Fragments (not joined) in consensus (all species): not queriedFragments (joined) in consensus (all species): 632
Kbp in consensus (sp1+sp2): 41612Genes in consensus (sp1+sp2): not queried
Fragments (not joined) (sp1+sp2): 45917Fragments (joined) (sp1+sp2): 3990
Fragments (not joined) in consensus (sp1+sp2): not queried Fragments (joined) in consensus (sp1+sp2): 1238


The output statistics are furthermore summarized in a table. The user can read the nr. of base pairs and genes affected by the consensus and - in case of search based on synteny - details about the the number of DNA fragments matched and later merged with their corresponding base pair sizes.

Validation


The validation process is based on a permutation test. The original QTLs and SLs submitted are merged where overlapping (shown in the textareas under "Merged QTLs and SLs") and then randomly rearranged over the genome. The size of the QTLs / SLs is respected and overlaps avoided, but their position is set randomly in each iteration.
  • Data Normalization

  • Normalize QTL sizes (recommended)

    QTLs and SLs may be located in chromosomal regions with unusual high or low gene density. This is particularly clear for QTLs / SLs affecting the MHC-Locus (very high gene density). When rearranging such a QTL / SL in each iteration its size is only respected in terms of base pairs. Normalization of the QTLs / SLs respects the original number of genes in the region to amielorate the effect of such disproportions in gene density on the permutation test.

  • Iterations

  • Iterations for permutation test
    (consider a local installation
    for a sufficient nr. of iterations)




    With increasing number of iterations, the distribution of the results of the permutation test approximates to an ideal random distribution. Such distribution is then used to match the number of observed consensus genes. If clearly at marginal percentiles of the random distribution (e.g. over the 95th percentile = p<0.05), the result can be considered as not met by chance. The script allows a maximum of 10 iterations in its on-line version clearly below the 1000 recommended. However, because of limitations of computing ressources we cannot offer that possibility on the web. We encourage the interested users to consider a (free) local installation of our software to solve this (see availability).

    WARNING: With increasing sizes of QTLs / SLs the chance to generate a random rearrangement of the QTLs / SLs without overlaps increases geometrically, thus greatly affecting the computing time. For the case of whole genome comparisons (see INPUT) the option for validation has been intentionally disregarded.

  • Type of permutation

  • Permutations based on: homology




    This box informs about the type of permutation that is being currently done. For validation of data generated by means of synteny, the type of permutation should by synteny, and analogously for data generated by means of homology. The web-interface allows only permutations based on homology, again because of lacking computing ressources. However the local installations would only need to activate a certain part of the script to use permutations based on synteny (it is already implemented). It is also important to keep in mind that the differences between both types of permutation are relatively little (about 10% difference in amount of genes detected) but the difference in computing time is dramatic (from few seconds to few minutes for homology, to several hours for synteny), so it may be interesting for the user to check always the type "homology" first and only dare the type "synteny" if the results for "homology" were promising enough.

  • Show results

  • Show detailed results list

    This option allows the user to get the detailed list of the results generated for each iteration. This may be particularly useful for exporting to other software programs in order to represent the random distribution curve and the cutoff of the observed value.



    The figure illustrates the distribution of the number of consensus genes calculated for each iteration in the permutation test (blue vertical bars) in comparison to the observed ones (red bar). If the observed number of genes is found at the green region of the distribution, it can be said that the observed number of consensus genes is significantly higher than expected by pure chance. In other words, the QTLs / SLs analyzed must share indeed a common - at least partial - explanation for the observed phenotype. However, for a maximum of 10 iterations as offered by this web-interface, one should take the calculated p-value with caution (e.g. here p=0). A local installation of the application will allow you to increase the number of iterations. The p-value for 100 iterations will be a good trend indicator and it stabilises at the second decimal position before reaching the 1000th iteration.

  • Segregation of consensuses

  • Rattus_norvegicusMus_musculusHomo_sapiens
    EAE12 (3)
    EAE7# (7)
    D17S907# (4)
    EAE3 (7)
    EAE23# (3)
    D17S796 (1)
    EAE12 (3)
    EAE7# (7)
    D17S1848# (1)
    EAE9 (1)
    EAE6a (1)
    D2S1779 (1)
    EAE3 (7)
    EAE23# (3)
    D17S975 (2)
    EAE12 (3)
    EAE7# (7)
    D19S585# (1)
    EAE3 (7)
    EAE6b (2)
    MSSL# (2)
    EAE1 (1)
    EAE1# (1)
    D6S1615# (2)
    EAE3 (7)
    EAE23# (3)
    D17S1879# (1)
    EAE3 (7)
    EAE7# (7)
    D17S975 (2)
    EAE3 (7)
    EAE17 (1)
    MSSL# (2)
    EAE3 (7)
    EAE6b (2)
    D5S2056# (1)
    EAE6 (1)
    EAE7# (7)
    D17S907# (4)
    EAEZ# (2)
    EAE7# (7)
    D17S907# (4)
    EAE4 (1)
    EAE5 (1)
    D6S1615# (2)
    EAEZ# (2)
    EAE25 (1)
    D18S1127 (1)
    EAE13 (1)
    EAE7# (7)
    D17S907# (4)


    Each consensus region, taken as a whole, may be syntenic or homologous to one or more consensus regions of the other species. In the latter case we have a reasonable hint to think that a consensus that seggregates into two in another species could be bearing at least two genes relevant for the trait. The argumentation is then analogous for increasing number of combinations of consensuses. The table above shows all combinations found and includes a number in parenthesis after each merged QTL / SL that stands for the number of different combinations with consensuses of the other species homologous / syntenic to it, This number is an estimation of the minimum number of disease-relevant genes in that consensus.

Availability

SCRIPTS

The web interface is programmed in PHP4 and is divided in three subprograms:
QTLMIX web interface (qtlmix.php)
The main program: gets all search parameters and the QTL and SL data input, calculates the consensuses and generates the output.
QTLMIX validation (check.php)
merges overlapping QTLs / SLs and runs a permutation test on the same data, but randomized in position over the genome. Generates an overlook of the data frequency distribution and checks for seggregation of consensus regions.
QTLMIX frequency distribution (freqdist.php)
embedded in the former script, generates a png image with the distribution curve of the permutation test.



DATABASES

The local databases are available in different formats and always separately for each metatrais: multiple sclerosis (MS) and rheumatoid arthritis (RA) and their respective animal models in mouse and rat, the experimental autoimmune encephalomyelitis (EAE) and the collagen or pristane induced arthritis (CIA/PIA).
      
Tables on EAE QTLs and susceptibility loci for MS (HTML, CSV or MYSQL-zipped format)
Tables on CIA and PIA QTLs and susceptibility loci for RA (currently only as MYSQL-zipped format)
ENSEMBL DATABASES on the human, mouse and rat genome and their syntenic relationships.


Important considerations:

The scripts and database available here are Open Source. This means they are license-free in use and distribution. However, the authorship should be indicated whenever the scripts/database are publicly used. The links and paths to the database in the PHP scripts and between the PHP scripts refer strictly to our local network architecture and should therefore be readjusted after a local installation (!). Developers are encouraged to feed back corrections and improvements, and any user is of course welcome to collaborate in debugging the software.

Authors

QTL View was created by Pablo Serrano-Fernández, Steffen Möller, Saleh M. Ibrahim, Hans-Jürgen Thiesen (Immunology, University of Rostock), Uwe K. Zettl (Neurology, University of Rostock), René Gödde and Jörg T. Epplen (Human Genetics, University of Bochum).

For technical help, comments or questions please contact Pablo Serrano-Fernández or Steffen Möller.

Back to the application, back to qtl.pzr.uni-rostock.de