Nearby: Workshop home page

W3C Technology and Society Domain

Summary Report - W3C Workshop on Semantic Web for Life Sciences

Summary

On October 27-28, 2004, W3C held a Workshop on Semantic Web for Life Sciences at the Radisson Hotel in Cambridge, MA.

Table of Contents

Attendees

Keynotes

Panels

Lunch Presentations

Next Steps

ATTENDEES

One hundred fifteen people participated from the following organizations:

Jackson Laboratories Berlex Biosciences Novartis Sanofi-Aventis Woods Hole Oceanographic Institute Fred Hutchinson Cancer Research Center
Infinity Pharmaceuticals AstraZeneca R&D Elsevier Millenium Pharmaceuticals Nature Publishing Group Pacific Northwest National Laboratory
Stanford Medical Informatics Harvard Partners Affymetrix Mayo Clinic American Chemical Society European Bioinformatics Institute
National Science Foundation Hewlett-Packard Pfizer Genentech MacArthur Foundation National Center for Genome Resources
Oracle BioGrid SemantxLS PRISM Forum Swiss Institute of Bioinformatics National Cancer Institute (Center for Bioinformatics)
Children's Hospital IBM INRIA University of Michigan University of Massachusetts Boston Harvard Medical School
AGFA Healthcare MIT / CSBi KEVRIC Object Management Group University of Cambridge (UK) Fujitsu Laboratories of America
Broad Institute / MIT MITRE Genstruct Network Inference Alzheimer's Research Forum German Cancer Research Center
Stanford Medical Informatics Annotea BioPAX HydroJoule University of Manchester VTT Finland
Matsushita / W3C SkyPrise Djinnisys Siderean Yale Center for Medical Informatics MIND (University of Maryland)
DSTC Pty Ltd Technion – Israel Institute of Technology Columbia University Intelligent Solutions Panther Informatics Image Bioinformatics Lab, University of Oxford
University of Colorado Northeastern University Tucana Technologies European Network of Excellence REWERSE University of Georgia Japan Biological Information Consortium
University of Zurich University of Michigan Life Sciences Insights De Novo Pharmaceuticals

Chevron Texaco

The workshop program included seven panel discussions on specific topics related to the future of Semantic Web for Life Sciences, and a closing discussion about next steps. In the sections below we provide a summary of each discussion and recommendations on how to proceed. We will include links to detailed notes that have been provided by workshop participants. The position papers submitted by the workshop participants also provide further details on these issues, and you can also browse the attendee list.

Please contact John Wilbanks if you have questions or comments about this report.

KEYNOTES

Keynote Address: Tim Berners-Lee (Director, World Wide Web Consortium)

Keynote Address: Ken Buetow (Director, National Cancer Institute Center for Bioinformatics)

PANELS

Industry Perspectives On Semantic Web For Life Sciences

Panelists: Hugh Salamon (Berlex Biosciences), Ted Slater (Pfizer), Otto Ritter (AstraZeneca); Moderator: Eric Neumann (Sanofi-Aventis)

This panel examined the meaning of common data content standards as opposed to data format standards, the impact of Semantic Web on inter-community communication in global pharmaceuticals, and the role of semantics in information management and decision-making. There was consensus among the panelists that all the theoretical elements are in place today (e.g., operations research, econometrics, systems theory, dynamical systems, machine learning and model building, etc.). But the practical tools for large-scale semantic unification and semantics-driven integration and interoperability are yet to be constructed.

One specific example was broadly discussed: in the life sciences, semantics are evolving -e.g. two proteins are binding but where is the context at which one should represent those proteins? At the cellular level? Tissue level? Molecular level? The challenge is how much detail is necessary; too much might be overwhelming. Current semantics are hard to manage even within the corporation; most knowledge transfer occurs inside Word documents, PowerPoint, email and Excel spreadsheets. Very difficult to automate processes even within domains, nearly impossible to automate multidomain knowledge processes (such as those relying on chemistry, intellectual property, literature and market research).

Audience discussion covered topics such as representation methods for content (how to define semantics with enough detail to be scientifically accurate without drowning in details) and the importance of accurately capturing how data is acquired and analyzed. Data provenance, and the importance of standard mechanisms to describe provenance, was the major theme. There were also discussions of how Semantic Web might aid in the context of pharmaceutical M&A, data integration and decision making.

Scientific Publishing and Semantic Web

Panelists: Ben Lund (Nature Publishing Group), David Wood (Tucana Technologies), Marc Krellenstein (Elsevier), Steve Chervitz (Affymetrix); Moderator: Alan Aronson (US National Library of Medicine)

This panel examined standard approaches to mapping between vocabularies, RDF stores in RSS engines, the role of Semantic Web for commercial publishers working on text mining, the value of ontologies in text mining, and the intersection of Semantic Web with the Distributed Annotation System. Nature and Elsevier presented different approaches to Semantic Web - Nature focused on RSS and RDF, with Elsevier focusing on using RDF and OWL internally, managing ontologies and storing outputs of text mining. Panelists agreed that Semantic Web offered significant opportunities to both publishers and readers.

The publishers spoke about methodologies to allow authors to self-publish semantics. Elsevier's "authorgateway" was described, which is operational for "classic" metadata (dublin core) but does not allow authors to add "assertional" metadata (e.g. "<genex><is upregulated in><tumors>) or create ontologies, and the implications of RDF in a copyright context were discussed - when are the publishers going to come up with a policy regarding an author self-publishing assertional metadata?

Audience discussion covered mechanisms for querying mechanisms for literature and the W3C effort SPARQL, author self-publication of semantics, and copyright implications of Semantic Web and science publications.

Triples and Ontologies

Panelists: Eric Jain (Swiss Institute of Bioinformatics / UniProt), Joel Richardson (Jackson Laboratories), Liju Fan (KEVRIC / Microarray Gene Expression Data Society); Moderator: David States (University of Michigan)

This panel examined the efforts required to convert UniProt (one of the most important public biological databases) into RDF, two of the most important biological ontologies (GO and MGED), reasons for and against conversion of ontologies into OWL, and the mechanisms to map equivalences between ontologies. The panel also discussed the implications of evolving an existing ontology standard such as MGED both within the microarray domain and by extension into other types of array-based analysis. There was a split among the panelists as to the value of converting databases to RDF - one in favor, one opposed (unless there was a user demand) and one neutral.

Audience discussion covered topics such as the relative value of maintaining UniProt in a text format versus metadata format, the use of web services in managing the ontologies, the difficulty of managing rapidly changing databases as scientific knowledge expands, the utility of using controlled vocabularies versus experts to ease data integration from multiple sources, and the ease of using tab-delimited text in managing such databases. The possibility of RDF-querying existing systems such as the Uniprot and Mouse GO and databases using MGED for microarrays was discussed and there was strong interest from the audience.

Web Services and Semantic Web

Panelists: Phil Lord (University of Manchester), Oliver Dameron (Stanford Medical Informatics), Gary Schiltz (National Center for Genome Resources); Moderator: Mark Adams (National Cancer Institute's caBIG)

This panel spoke extensively to the broad range of databases and services that underpin modern life sciences research - complex and rapidly changing stores of both data and metadata - and the potential applications of ontologies to both manage web services and be managed by web services. There was discussion of the mechanisms by which applications might use web services to help process semantic descriptions, and similarly, how web services and ontologies might work to facilitate database interconnectivity. The panel also discussed using semantic approaches to discover services and pre-bundled workflow in an enterprise informatics environment.

Audience discussion covered topcs such as the usability of OWL and OWL-S, resolution processes for querying graph systems, using LSIDs in a grid context or in an annotation query system and methodology for ranking searches across semantic annotations.

Life Science Identifiers - Use Cases, Future Directions

Panelists: Robert Robbins (Fred Hutchison Cancer Research Center), Jim Myers (Pacific Northwest National Laboratory), Sean Martin (IBM); Moderator: Ted Liefeld (Broad Institute / MIT)

This panel spent much time on the difficulties posed by the need to store, as bits in a database, knowledge about entities such as genes and proteins. There was agreement that the Object Management Group's Life Sciences Identifier specification represents a step forward in managing unique identifiers for the life sciences, but also concern over specific identity issues such as referential integrity and the need to separate "the science from the storage". The panel also examined the benefits of an LSID approach to construct collaborative research networks and to manage knowledge through a discovery research process.

Audience discussion covered topcs such as the problem of representing provenance, the desirability of URNs as opposed to http URIs and the presence / absence of semantics in unique identifiers. There was little consensus as to the URN / URI question, with some in the audience desiring to move on, continue working on existing implementations, and make no changes to LSID, and others in the room looking for changes such as checksums to guarantee integrity as well as moving to a URI scheme.

Cheminformatics and the Semantic Web

Panelists: Eric Neumann (Sanofi-Aventis), Richard Scott (De Novo Pharmaceuticals), Peter Murray-Rust (University of Cambridge); Moderator: Susie Stephens (Oracle)

This panel examined the differences between semantic approaches in chemistry as opposed to biology. The panel noted that, although the chemical side of the life sciences industry has a better history of naming - thanks to the field being older and more "settled" - there are still very significant efforts left, including the integration and normalization of chemical assays with different biological activities. The panel also addressed the issue of group annotation of the specific element of compounds (e.g. sidechains), the problems caused by non-standard chemical data formats (with over 40 in regular use and need of normalization to one another) and the applicability of knowledge gained from representing chemical entities to small biological entities.

Audience discussion covered topcs such as methodology for annotating specific elements of chemical modules, search mechanisms for chemical molecular annotations, the semantics implicit in the Chemical Markup Language and how to extract those semantics, the relative desirability of querying in the chemical databases natively versus RDF query (consensus that querying using an RDF to SQL translation system such as SPARQL or Algae appeared to be the best approach to explore in the short term).

Semantic Aggregation, Integration and Inference

Panelists: Andy Seaborne (HP), Nicole Alexander (Oracle), Greg Meredith (Djinnisys); Moderator: Joanne Luciano (Harvard Medical School / BioPAX / Biopathways Consortium)

This panel examined the efforts of HP and Oracle in Semantic Web for Life Sciences, and the potential for using markup languages such as SBML in a Semantic Web framework to enable the creation of interoperable mathematical models of dynamic cellular systems. The discussion covered methodologies for storing RDF triples in a graph network database setting, building distributed queries on RDF, the difficulties of managing large datasets in RDF, and tools available for download or use at the present moment. The panel also covered the use of process algebra to model dynamic systems.

Audience discussion covered topcs such as the usability of the Jena framework, the size of the Jena user community (over 20,000 downloads and an active developer list), the relationship between Oracle's Network Data Model and the data resident in other installations of Oracle, and the feasibility of actually using process algebra for modeling dynamics (leading to a spirited discussion, but little consensus either way).

LUNCH PRESENTATIONS

Exploring Semantic Web Infrastructure for Life Science Knowledge-bases, Ian Wilson - University of Colorado

GoPubMed and Beyond: Rules and Reasoning for Ontology-Based Literature Search, Michael Schroeder - European Network of Excellence REWERSE

The Protege Ontology Editor, Natasha Noy - Stanford Medical Informatics

A KnowledgeBase Project for Alzheimer Disease Research, Tim Clark - Massachusetts General Hospital

Semantic Web Technology in Support of Bioinformatics for Glycan Expression, Amit Sheth - University of Georgia and Semagix, Inc.

ExperiBase, C. Forbes Dewey - Professor of Mechanical Engineering / Bioengineering, MIT/CSBi

A Framework for Annotating High-Throughput Genome-wide Screens, Josh Moore - German Cancer Research Center

Semantic Web Technologies for Analysis of Transcriptome, Rose Dieng-Kuntz - INRIA, UR Sophia Antipolis Project ACACIA

Panelists: Danny Weitzner, Eric Miller, Eric Neumann, John Wilbanks (no presentations - group discussion)

Major lesson -- data integration tools (intra-enterprise, cross-community, cross public/private boundaries) are entirely inadequate.

In wrapping up two days of presentations and discussions, workshop participants considered what next steps would be useful for the life sciences community and W3C to pursue together. The closing discussion, lead by the chairs of the workshop program committee, concluded that work is needed in the following three areas:

  1. Core vocabularies: In order to stimulate cross-community data integration, collaborative efforts are required to define core vocabularies that can bridge data and ontologies developed by individual communities of practice. The most important vocabularies are:

    Workshop participants felt strongly that this work should focus on generic, bridging vocabularies, not detailed subject-matter ontologies. During the workshop, numerous examples of detailed ontology development, such as the CaBIG coordinated efforts, made clear that W3C should be careful not to duplicate effort or balkanize research communities. The focus on bridging vocabularies such as those identified above will leverage the power of the data already defined in community-specific ontologies by enabling better linking with other information.

  2. Close integration of life science identifiers and Web resources: Standardization of the Life Sciences Identifier (LSID) has made considerable progress though implementation is still sparse. Workshop participants agreed that more consideration should be given to the question: how to implement LSIDs in a manner that leverages existing Web resource resolution mechanisms such as http servers.

  3. Implementers interest group: Many participants who are now implementing semantic web technologies to solve life science problems, or who are considering options for doing so, expressed interest in an ongoing forum for sharing implementation experience. The growth of semantic web infrastructure has benefited from W3C's Semantic Web Interest Group. Workshop participants agreed that creation of a Semantic Web for Life Sciences Interest Group would be valuable and a number of individuals expressed a desire to participate if such a group were created.

Recommendations

Based on the wrap up discussion, there was strong support from workshop participants for launching three work efforts and W3C.

  1. Core Vocabularies Working Group
  2. Investigation of identifier mechanisms and implementation strategies
  3. Implementers Interest Group

The workshop made a significant contribution to W3C's understanding of the needs of the life sciences community and directions for the development of the Semantic Web. W3C staff is now planning for the launch of work efforts as recommended by the wrap up panel and hopes to begin this work as soon as the necessary resources and participants are available.


$Id: swls-workshop-report.html,v 1.39 2004/11/22 04:55:07 djweitzner Exp $