Project Description
A Word add-in that enables the annotation of Word documents based on terms that appear in Ontologies

Read the latest press release here.

Summary
Microsoft Research Connections' goal with this project is to enable communities who maintain ontologies to more easily experiment and to enhance the experience of authors who use Microsoft Word for content creation, incorporating semantic knowledge into the content. This add-in should simplify the development and validation of ontologies, by making ontologies more accessible to a wide audience of authors and by enabling semantic content to be integrated in the authoring experience, capturing the author’s intent and knowledge at the source, and facilitating downstream discoverability.

The goal of the add-in is to assist scientists in writing a manuscript that is easily integrated with existing and pending electronic resources. The major aims of this project are to add semantic information as XML mark-up to the manuscript using ontologies and controlled vocabularies (from the National Center for Biomedical Ontology) and identifiers from major biological databases, and to integrate manuscript content with existing public data repositories.

As part of the publishing workflow and archiving process, the terms added by the add-in, providing the semantic information, can be extracted from Word files, as they are stored as custom XML tags as part of the content. The semantic knowledge can then be preserved as the documented is converted to other formats, such as HTML or the XML format from the National Library of Medicine, which is commonly used for archiving.

The full benefit of semantic-rich content will result from an end-to-end approach to the preservation of semantics and metadata through the publishing pipeline, starting with capturing knowledge from the subject experts, the authors, and enabling this knowledge to be preserved when published, as well as made available to search engines and presented to people consuming the content.

This project resulted from an initial and ongoing collaboration between Microsoft External Research and Dr. Phil Bourne and Dr. Lynn Fink, at the University of California San Diego. Additional collaboration with the staff from Science Commons aims to make the add-in relevant to a wider audience and also to preserve semantic data along the publishing pipeline.

Audience
This project is focused on researchers and software developers in domains utilizing ontologies– as well as publishers, archivists, and early adopters in the scientific, technical, and scholarly publishing fields.

Specific features
Getting Started
  • Trying out the add-in:
    • You will need Microsoft Word 2007 or Microsoft Word 2010 (32-bit or 64-bit) running on Windows XP, Windows Vista or Windows 7.
    • Open the test document from the Releases tab in this page and enable Term Recognition in the Ontology tab within Word
If you have installed a previous version of the add-in, you may need to follow these instructions to achieve a full uninstall.
  • Examining the source code and contributing to the project:
    • Navigate to the Source Code tab
    • You can use the free version of Visual Studio (Visual C# 2008 Express Edition) to build the project
    • Add comments in the Discussion tab, and report problems under Issue Tracker

Design Documentation
Design Documents

Semantic Tagging

When a word or set of words is tagged by the add-in, the word is wrapped with some tags that associate it with the ontology term. The example below shows the word "astrocyte" being tagged with the Cell Line ontology.

<w:customXml w:uri="http://biolit.ucsd.edu/biolitschema1" w:element="named-content">
 <w:customXmlPr>
   <w:attr w:name="content-type" w:val="biolit" /> 
   <w:attr w:name="id" w:val="ncbo_id=40962;term_id=CL:0000127;term=astrocyte;url=http://purl.org/obo/owl/CL#CL_0000127" /> 
  </w:customXmlPr>
  <w:proofErr w:type="gramStart" /> 
  <w:smartTag w:uri="BioLitTags" w:element="Term">
   <w:r w:rsidRPr="00FA60F6">
    <w:rPr>
     <w:highlight w:val="yellow" /> 
    </w:rPr>
    <w:t>astrocyte</w:t> 
    </w:r>
  </w:smartTag>
  <w:proofErr w:type="gramEnd" /> 
</w:customXml>

If the Word file (docx) is to be transformed to other formats, this set of tags would need to be processed using xslt or other technologies. Note that there are other CodePlex projects available which implement transformations of docx files to other formats, which one can start from.

Background
Cyberinfrastructure is integral to all aspects of conducting experimental research and distributing those results. However, it has yet to make a similar impact on the way we communicate that information. Peer-reviewed publications have long been the currency of scientific research as they are the fundamental unit through which scientists communicate with and evaluate each other. However, in striking contrast to the data, publications have yet to benefit from the opportunities offered by cyberinfrastructure. While the means of distributing publications has vastly improved, publishers have done little else to capitalize on the electronic medium. In particular, semantic information describing the content of these publications is sorely lacking, as is the integration of this information with data in public repositories. This is confounding considering that many basic tools for marking-up and integrating publication content in this manner already exist, such as a centralized literature database, relevant ontologies, and machine-readable document standards. We propose to address this delay in the maturation of scholarly communication by developing open source tools to facilitate the semantic mark-up of new manuscripts and the submission of those manuscripts directly to a journal’s electronic publishing system.

Last edited Feb 25, 2013 at 11:00 PM by AlexWade, version 28