Emerging high-throughput technologies like next-generation sequencing (NGS) have led to a dramatic increase of descriptive and functional genetic information over the past decade, revealing gene properties such as gene family, tissue distribution, gene function or pathway membership. Further processing of these properties into gene similarities beyond sequence homology enables the unbiased exploration of inter-gene relationships. Existing computational tools which apply such gene relationships are e.g. UCSC Gene Sorter  and EvoCor . However, these tools apply each similarity independently and don't make use of multidimensional scoring.
Genehopper is a new search engine with a focus on human genes which allows the exploration of gene-to-gene relationships. It can handle two different query types: the typical use case starts with a term-to-gene search (Figure 1), i.e. an optimized full-text search for an anchor-gene of interest. The web-interface can handle one or more terms including gene symbols and identifiers of Ensembl, UniProtKB, EntrezGene and RefSeq. Additionally Genehopper can find genes by publication or SNP variant identifiers, even unspecific vocabulary is handled.
When the anchor-gene is defined, the user can explore its neighbourhood as the weighted sum of normalized gene similarities according to Table 1.
|1.||Homology SHOM||Ensembl Compara||Sequence Identity|
|2.||Normal Tissue Expression Profile SNEX||Human Protein Atlas||Spearman|
|3.||Interpro Protein Domain SIPD||Swissprot||Cosine|
|4.||Swiss-Prot Protein Feature SSPF||Swissprot||Cosine|
|5.||Variant-related Publications SVP||Ensembl Variation||Cosine|
|6.||GO Cellular Component SCC||Ensembl Core||Resnik-BMA|
|7.||GO Molecular Function SMF||Ensembl Core||Resnik-BMA|
|8.||GO Biological Process SBP||Ensembl Core||Resnik-BMA|
|9.||HUGO Gene Symbol SHGS||HGNC||Prefix Distance|
All gene-to-gene similarities are pre-calculated to ensure fast retrieval time. Each weight can be adjusted by the users and thus allowing flexible customization of the gene search according to specific use cases. Result genes are ranked in descending order according to their overall ranking score which is given by the weighted sum of pairwise similarities between the anchor gene and all other genes (Figure 2).
All implemented similarities have a low to pairwise correlation (max r2 = 0.35) implying a low linear dependency i.e. any change in a single weight has an effect on the ranking. Thus, we treated them as separate dimensions in the search space.