Home  »  ICIC  »  ICIC 2005 - 2009  »  Programme  »  Tuesday 20 Oct

Tuesday 20 October 2009

09:00 - 09:30

E-Discovery: A Challenge for Search

Corporations increasingly use and retain information only in the form of electronically held data and documents. As a result, the production and sharing of information in legal proceedings will depend heavily on techniques for accessing, searching, organizing and analyzing electronic data -- the principal focus of E-Discovery. Large corporations may have terabytes of e-mail and other files spanning many years that are potentially relevant to a case. In response to a court order, an E-Discovery team must identify, assemble, individuate and categorize an organization's files, segregate all "privileged" material (which may be withheld legally), and deliver a minimally comprehensive and exhaustive set of data to the opposing party -- all in a relatively short amount of time. The techniques needed to accomplish such a task necessarily include search, clustering, classification, filtering, social network analysis, extraction, and more -- and no one of these is sufficient. Such requirements challenge our traditional models for search. In particular, the appropriate user models do not reflect the standard "web" or "enterprise" conditions. This presentation explicates the requirements and types of solutions that dominate E-D.

09:30 - 10:00

Development of an Intranet in a Worldwide Operating Company – a Case Study

The case study shows the development of an Intranet in a global environment over the last nine years, how this worked in practical life and what changes and challenges we will have to face in the near and mid-term future. It gives some real-life experiences and shows the learning from experience and how to adapt to changing circumstances. A report is given about the new project to evolve the existing Intranet into an Enterprise Information Portal. The information management aspect of the project is presented as one of the major objectives besides the technical change management.

10:30 - 11:00

Too Many Choices: How Information Departments Evaluate and Choose New Information Tools

A modern information department is offered a rich selection of tools for searching, mining, analysing and displaying data. And every year sees more "miracle" tools coming into the marketplace. But how does a user company evaluate and select from all the offerings? The Scientific Information Services Director of a major pharmaceutical company reflects on the challenge and gives some guidance.

11:30 - 12:00

Building a Market for Intellectual Property

The IP market has been standing still for a rather long time. Several bottlenecks prevented its development. One of the most critical was the lack of powerful but user-friendly tools for accurately determining the value of a patent or of a portfolio of patents. Another was the lack of powerful tools permitting the easy and efficient analysis of the 33 million existing patents. Things have changed dramatically during the past three or four years. Several new smart tools have been developed (and are being evaluated) in order to speed-up the technology transfer process and facilitate the monetisation of IP.

The spread of those new tools will rapidly induce a dramatic behaviour change with respect to IP management for public research organisations as well as for companies. Their future impact will be discussed and recommendations will be provided in order to prepare for to the forthcoming IP "big bang".

This presentation explains how it is possible to play with Thomson Analyst, Patent Cafe, Ocean Tomo and other new tools to analyse intellectual property.

12:00 - 12:30

Progress in Automated Chemical Structure Recognition in Text and Images

Text mining in chemistry and drug discovery relies heavily on the automated extraction of chemical compounds and pharmaceutical substance names from text and images. In this presentation a hybrid approach combining information science, cheminformatics, computational linguistics and pattern recognition techniques will be presented. Various text mining applications have been developed recently that promise comprehensive access to knowledge for researchers. However in many cases the quality of the extracted chemical content in terms of precision and recall is questionable. Bad image quality, ambiguous notation or incorrect names can be the source of errors and wrong results. Thus strict chemical validation and verification of the extracted information is of utmost importance to achieve reliable and consistent results. The approach presented here combines specialised software tools for graphical structure recognition, chemical named entity extraction and name to structure conversion. Combination with established verification and checking tools for automatic chemical validation ensures high quality in the generated content.

14:30 - 15:00

When does a search in full-text and chemical databases fail?

Searching has improved tremendously during the last decade – searching substructures and full-text no longer forms a barrier. People mostly overlook the fact that retrieval of image-based information in patents and in scientific literature is still an unsolved problem. Not only that searchers have to translate roughly sketched ideas from more or less precisely drafted drawings into a search question, but they also have to cope with the inadequate search capabilities in databases for finding the appropriate information. It is not clear as to what information can be retrieved and when searching fails. Looking at different types of image information, the presentation examines the challenges for a searcher for finding image information, and what type of databases searchers still demand.

15:30 - 16:00

Visualisation of Statistical and Text Mining Results from Large Document Collections

Text analytics, which combines text mining and text visualisation is recently being used more on large patent and non patent collections. The visualisation of clustering and classification results of these large document sets can gain much when one integrates this with the visualisation of statistics of the document data. This can be the statistics from the text but also statistics of metadata related to the text to gain insight in patterns behind, for instance the inventors, company and year of a patent. The visualisation of statistical data can also provide insight in trends of statistical patent valuation and use.

This presentation describes research work on patent and non-patent data using advanced classification and clustering and multiple visualisation techniques. The technical principles and the business case of some applications of text mining and visualisation are presented and discussed.

 
 

16:30 - 17:00

Markush Structures: From Molecules Towards Patents

Cheminformatics systems usually focus primarily on handling specific molecules and reactions. However, Markush structures are also indispensable in various areas, like combinatorial library design or chemical patent applications for the description of compound classes.

The presentation discusses how an existing molecule drawing tool (Marvin) and chemical database engine (JChem Base/Cartridge) are extended to handle generic features (R-group definitions, atom and bond lists, link nodes and larger repeating units, position and homology variation). Markush structures can be drawn and visualised in the Marvin sketcher and viewer, registered in JChem databases and their library space is searchable without the enumeration of library members. Different enumeration methods allow the analysis of Markush structures and their enumerated libraries. These methods include full, partial and random enumerations as well as calculation of the library size. Furthermore, unique visualisation techniques will be demonstrated on real-life examples that illustrate the relationship between Markush structures and the chemical structures contained in their libraries (involving substructures and enumerated structures).

17:00 - 17:30

Chemical Depictions – The Grand Challenge in Patents

In the last few years the field of chemical image mining is on the rise again. We perceive several projects to develop structure reconstruction tools after the first approaches from the mid eighties came to a stop (e.g. Kekulé, CLiDE, and OROCS). Some of them have been marketed but none is used widely today. The most recent developments are chemReader, OSRA, CLiDE Pro, and chemoCR. The main question at hand is: What has changed over the years? Can this old problem now be tackled due to new techniques in software engineering, new advances in computer-based pattern recognition, or is it simply the shear compute power we to which we now have access?

We give some insights on the results of the patent grand challenge. We have performed a large scale experiment using our in-house reconstruction tool chemoCR on a benchmark set of more than 100,000 European patent documents (scanned PDF) selected from the IPC classes A61 and C07. All documents have been processed automatically to search for and convert chemical depictions back into connection tables making use of the large resources of the Super Computing Center in Jülich. We present the lessons learned: what kinds of images are in patents, what kind of chemical depictions have we identified, is such a project computationally feasible, what problems have we encountered, is substructure search a benefit.