Detecting entities such as protein names

Mar 17, 2009 at 4:35 PM
Hi BioLit team,

I installed your plugin and I started experimenting with it, and for the moment it looks nice.

Something it is not so clear to me is whether you intend to do Named Entities Recognition for entities such as the proteins contained in UniProt or if you plan to support only the identifier detection from those databases.

I am also interested to know what NLP techniques you use to detect terms.

Thank you very much,

Stefano Bocconi
Developer
Mar 17, 2009 at 8:38 PM
Hi Stefano,

Thanks for trying out the add-in! I'm glad you've had a good experience.

Currently, we only support identification of database IDs from the databases listed in the config panel. However, there is no reason to not add more as long as the IDs are reliably detectable. For example, Ensembl IDs would be very easy, but HUGO gene symbols would be quite difficult. We use regular expressions for ID matching and there aren't any fancy NLP techniques involved in ontology term recognition. Right now it is just string matching, although it would be desirable to make this more sophisticated in the future. If the community response to this effort is positive, it would be justafiable to further develop this aspect.

If you have any suggestions, please feel free to mention them.

Best regards,
Lynn
Mar 18, 2009 at 7:58 PM
Edited Mar 18, 2009 at 8:00 PM
Lynn, how about using WhatIzIt (http://www.ebi.ac.uk/webservices/whatizit/info.jsf) for recognizing the terms and linking them to databases? adding this to the plug in should be very straight forward. In this way as the user "tags" links to external resources would be generated on the fly. WhatIzIt can do a lot more things.
Mar 18, 2009 at 8:57 PM
I think it would a great idea to plug in to a number of such services, ideally without having to touch any code. 

Consider only the case of entity recognition for the moment (not including relationship mining).

Here's a sketch of an idea. Generally, these services are REST based, and pass the text as some sort of post. They
reply with some XML format that reports the entities, position, and some metadata.

Suppose we can massage the XML that comes back into some relatively simple format, XML-based too, using either languages like XSLT, Xquery, or even a scripting language.

The plugin eats that simpler format.

So adding a new one is a matter of a configuration file that specifies the URI of the service, a template for the post, and a script to massage the results into the simpler format.

I can fill in more details, but that's the idea. 

The issue I see currently is that I don't think the underlying smart tags engine can handle overlapping markup. Because of the state of the art of such systems it is likely that there will be overlapping (incorrect) tagging and that has to be present at least long enough to be able to say which annotation is right and which is wrong.

-Alan



Developer
Mar 18, 2009 at 10:14 PM
Hi Alex and Alan,

I like the idea of being able to use web services to query with IDs or terms, but I think it is important that the add-in has full recognition functionality without an internet connection. I can imagine that a lot of people will download their ontologies of interest at the office, get on a plane, and work on a manuscript during the flight (for example).  I also wonder if those queries might be just a little too slow so that the Smart Tags don't appear until the author is several words ahead.

Maybe these could be optional, online-only services on top of the existing functionality, plugged-in as Alan mentioned?

Alan, do you think overlapping mark-up is a good idea? It seems to me that there should be only one database for a given ID and one ontology term ID per instance of term. I was looking through some examples of term collisions between ontologies and a good example might be  “seed maturation.” It appears in the Cereal Plant Trait Ontology and the Gene Ontology (biological process)  with subtle, but significant, differences in meaning. I think it would be inappropriate for an author to assign both ontology term IDs to a single instance of this term. Or do you mean overlapping recognition ("seed maturation" and "seed")?

Lynn

Mar 19, 2009 at 10:08 PM
Hi Lynn

On Wed, Mar 18, 2009 at 6:14 PM, [email removed] wrote:
> From: jlfink
>
> Hi Alex and Alan,
>
> I like the idea of being able to use web services to query with IDs or
> terms, but I think it is important that the add-in has full recognition
> functionality without an internet connection. I can imagine that a lot of
> people will download their ontologies of interest at the office, get on a
> plane, and work on a manuscript during the flight (for example).

Agreed.

> I also
> wonder if those queries might be just a little too slow so that the Smart
> Tags don't appear until the author is several words ahead.

I imagined that this was done on demand, rather than as words are
typed. I don't think that it is essential that tagging happen in real
time, however nice that it is. It's more important, IMO, to get good
recall and to get identifiers for all the found terms so that they can
be used for linking.

> Maybe these could be optional, online-only services on top of the existing
> functionality, plugged-in as Alan mentioned?

As Alan, I fully agree. I'd like to see this thing flexibly extendable
and configurable.

> Alan, do you think overlapping mark-up is a good idea? It seems to me that
> there should be only one database for a given ID and one ontology term ID
> per instance of term. I was looking through some examples of term collisions
> between ontologies and a good example might be  “seed maturation.” It
> appears in the Cereal Plant Trait Ontology and the Gene Ontology (biological
> process)  with subtle, but significant, differences in meaning. I think it
> would be inappropriate for an author to assign both ontology term IDs to a
> single instance of this term. Or do you mean overlapping recognition ("seed
> maturation" and "seed")?

I mean both, and I don't necessarily mean that both are retained. But
in the case you mention, it would be good to have the interface expose
that there is a potential incompatibility (or if they are the same,
arrange to send a note to the two teams suggesting that they be
merged). The issue you bring up is quite general. Even if there is one
term that matches, does it mean the same thing you do? In order to
encourage checking this it would be good to think about how make it
very easy to inspect them. For instance, definitions on mouseover, or
a mode in which the side bubbles used for track changes are used to
display the definitions of the terms that are tagged, with an easy
click to accept, reject, mark as "close but not quite right", or to
send a note to the developer of the ontology.

In a situation in which there are multiple ontologies and services
that are looking at your text, there are definitely going to be
overlaps of both the sorts you mention. We need to have some interface
in which they are preserved until the point at which they can be
reviewed.

-Alan

>
> Lynn
>
> Read the full discussion online.
>
> To add a post to this discussion, reply to this email
> ([email removed])
>
> To start a new discussion for this project, email
> [email removed]
>
> You are receiving this email because you subscribed to this discussion on
> CodePlex. You can unsubscribe or change your settings on codePlex.com.
>
> Please note: Images and attachments will be removed from emails. Any posts
> to this discussion will also be available online at codeplex.com