Tuesday 24 October 2009
Knowledge management in specialty chemicals R&D: building the library of the future
Since 2003, the Cytec Information Center has undergone a radical transformation. From hiring new staff to launching a virtual library to integrating two information centres in North America and Europe, the Cytec CIC enables learning, idea exchange and innovation. As its mission, the CIC partners with Cytec Specialty Chemicals R&D to leverage appropriate technology in order to search, archive and disseminate internal and external information in a cost-effective, user-friendly manner. To achieve its mission, the Cytec CIC has designed and implemented a simple web portal for instant "one-stop" global access to technical information. Primary resources for external information include ACS, MicroPatent, Knovel, Elsevier ScienceDirect, Wiley Interscience, Teltech and SRI Consulting, while a web-based document management system is used for retrieving important internal information. In addition, the Cytec CIC has become a hub for cross-functional R&D activity by hosting scientific discussion forums, seminars and weekly poster sessions. This presentation will highlight experiences encountered during a Knowledge Management initiative including identifying system requirements, process design and implementation, organizational changes and lessons learned.
The role of paper – is this the end?
The introduction of an electronic laboratory notebook brings to an end a documentation process that has been in place for centuries. But it comes with enormous promise of benefits, a hefty price tag, and a certain amount of concern about long term data preservation and legal (patent) admissibility of electronic evidence. In addition, it represents a cultural change that may not be welcomed by some scientists. Nevertheless, it represents an important step towards engaging in the knowledge economy, but it is just a step! A successful implementation of an ELN will address the limitations of paper when it comes to collaboration, and it will provide a platform for an organisation to build and fully exploit its knowledge base.
Integrating information management with worldwide licensing: Merck's partnering transformation
The presentation will trace the evolution of external licensing activity in the pharmaceutical industry, with a particular focus on Merck. The licensing process at Merck along with case studies of best practices will be presented. The importance of a broad-based information-gathering network will be highlighted.
The virtualisation of pharmaceutical and biotechnology R&D is a reality. Alliances and collaborations, in-licensing and out-licensing of targets, compounds, or technologies generate a large portion of new product revenue for many companies. Most organisations lack a robust business framework to be an attractive alliance partner with a nimble and rigorous decision process for opportunity evaluation and alliance management. The presentation looks at the alliance process in a holistic way and at some of the critical decision gates.
Desktop text mining for life sciences
Hardware manufacturers, intelligence agencies for national security, credit risk assessment firms, pharmaceutical companies, banking institutions or auto-parts suppliers find text mining an excellent aid. Nevertheless, text mining has not yet become a mainstream technology because of two limiting factors:
- Extensive customisation is often required to meet the demands of an organisation. It entails the investigation as to what is actually informative and what words and sentences are significant in order to tune the system accordingly.
- Considerable skills are required to exploit most text mining systems in full. They employ rules to identify patterns in language that convey a certain meaning -- rules that often require a non-negligible previous knowledge of computational linguistics.
bioalma's almaKnowledgeServer (AKS) approach is to bring text mining to biomedical scientists´ desktops. Researchers do not need to become text mining experts to benefit from text mining advantages. The AKS approach relies on an offline process through which all the fundamental biomedical concepts (genes, proteins, small molecules and atoms, drugs, symptoms and biomedical terms and processes) are detected and tagged. This provides the foundation for a knowledge base that can be analysed later in a “transparent” fashion that does not require any specialist expertise from the user.
Results from real-life case studies will be presented, showing that they are not only knowledge-enriched compared to those obtained with searches in conventional scientific text databases plus manual curation, but also obtained in much shorter time. Most importantly, they can be achieved straight-forwardly by anyone dealing with bibliographic databases without specific training.
Searching large e-mail collections: the next challenge
Archiving and searching large collections of electronic mail is becoming an increasingly important process as the daily flow of e-mail messages is growing and companies are being held accountable for all information that they communicate. These days, auditors, compliance officers, customer service, and knowledge workers all need to search large collections of e-mail or instant messages to be able to do their job.
From a search engine perspective, searching in these large e-mail collections is a challenge:
- The collections can be extremely large. A terabyte is a common start.
- There are also a lot of repetitive data and repetitive wording, potentially confusing relevance ranking and other advanced algorithms.
- The language and wording used are often incomplete, sloppy or digital slang. Misspellings and typos are more common than exceptions.
- Proper formatting of the documents is often completely missing.
- Many collections are multi-lingual.
- Attachments can be in any form, from normal PDF or MS-Word files, to potentially unsearchable (encrypted) ZIPs and bitmap files.
- Often duplicate copies of the same e-mail are present in one collection.
Auditors or compliance officers require real-time full-text indexing of terabytes of data, enforcing normal companies to process as much data as intelligence agencies did in the end of the nineties. Parallel processing seems to be the only solution. On top of all this, a typical auditing or compliance application force users to have 100% recall. In other words, these professional investigators want to find and at least review every possible relevant e-mail. However, as recall accuracy goes up, the precision of these systems goes down. As a result, analysing and organising all possibly relevant information is a task that can take months and sometimes longer, thus seriously delaying these processes.
In the past, the unique document properties of large paper collections or the internet helped develop and fine-tune search techniques and relevant ranking algorithms such as fuzzy search to overcome scanning errors, hit-highlighting and hit navigation on the original image or, for instance, Google page ranking algorithm. The same development philosophy can be applied to this new problem: the searching and analysing of terabytes of e-mail. By understanding the features of the collection and the user requirements, unique search tools can be developed for searching e-mail.
This presentation gives practical examples and real-life cases, explaining search-related concepts specifically designed for e-mail in order to address problems such as overcoming repetitive data, ignoring double messages, indexing all types of attachments including graphical ones, relevance feedback, categorisation, classification tools and integrated data visualisation.
Improving search beyond relevancy
This presentation begins by discussing the challenge of information overload faced by organizations. Search technology for ranking results is now mature. Returning relevant results is no longer the issue, since today's technology delivers millions of relevant results in milliseconds. The more significant challenge in organisations is enabling users to process meaningfully the large amount of relevant information available. The simple, one-dimensional lists format of presenting results is inefficient for gaining knowledge and insights from search results. It causes information overlook as users are not able to navigate the hundreds or thousands of results in an effective manner and either simply ignore the deeper results or go through pages and pages of irrelevant results to find that nugget of information they are looking for.
The presentation then discusses clustering technology as a solution for the information overload problem. Clustering organises search results into clusters or folders, based upon the similarity between them. Going beyond simply listing results, it organises the results so that all results related to the same concept are grouped together. This gives users a quick overview of the main themes in the results and lets them focus on the areas of interest only - without having to go through irrelevant results.
Clustering is done completely on the fly, without requiring ANY pre-processing of the content being searched. There is no investment of time, money and labour in defining, implementing and managing taxonomies; the solution can be up and running in hours. Clustering changes the economics of offering organised search results.
Beyond search: Recognition of chemical entities in scientific literature
Until recently, text mining driven information extraction in the Life Sciences has very much focused on mining biological data. The recognition of genes and proteins and their respective protein-protein interactions are prominent examples, having a strong impact on the drug discovery research. Furthermore, introducing preferred names and object identifiers for recognised genes and proteins, is an important knowledge management element, supporting the unified access to heterogeneous data. Such elements thus can be used for cross-referencing scientific journals and public databases.
As a major part of the scientific literature is focusing on the analysis and exploration of drug candidates, a natural extension of the current approach is the incorporation of the small chemical entities into the information analysis process. Based on the identification of chemical entities, a set of highly relevant data such as SAR (structure activity data) is becoming accessible from the literature.
Here we present a new entity recognition component for the identification of chemical entities, co-developed by TEMIS and Elsevier MDL. Beside excellent tagging quality, this component provides dynamically chemical structures for the identified names. In addition a registration string, ie, a unique fingerprint, based on the structure, is calculated which allows ad-hoc de-duplication based on the structure.
Federated search is most often seen as a productivity tool, in that it can reduce the amount of time needed for searching disparate resources. However, without considerable refinements, it can be a blunt instrument. Pharmas, and other similar organizations, need to know that the results of a federated search produce information which is of high quality and which has a known provenance. They also need to ensure that the results are as all-inclusive as possible.
While a proper federated search can comb files and retrieve all that seems to be available, quality and provenance are often lacking. A federated search tool needs to include elements of taxonomy, authority control and semantic mapping, as well as allowing post-search processing (removing duplicates, filtering, further sorting or ranking, for example) in order to accomplish the necessary goals. This means suiting the search to the site, and processing the results for both accuracy and quality.
In considering quality of results, various methods can be employed, such as using secondary searching to verify the 'authenticity' of the site (checking D&B for example) and retrieving peer reviews and similar quality documents. Further analyses, such as properly displayed clustering, can also aid in quality assurance, as this process could allow users to see anomalous results and either include or discard them. Another area to be considered is the challenge of federated searching of chemical structures.
This presentation examines the problems and discusses a series of tools, methods and solutions to aid in resolving the issues.
Identification of chemical structures in literature sources using semantic analysis and automatic structure generation
There are many sources of scientific information in common use today, which present a challenge to the information specialist or research chemist, who are often interested in finding data related to a particular chemical structure, but find it difficult to retrieve all related documents. A key barrier is that chemical structure information within those documents may exist only as a chemical name (IUPAC, trivial name, trade name, etc.), rather than in any structure-searchable form. Extracting chemical names from within these document sources is possible by using modern text extraction tools with semantic and contextual analysis of the source documents to identify candidate chemical names. These identified chemical names are used to generate chemical structures automatically, which are themselves used as index terms into the original source documents. Thus an apparently chemically-barren set of information sources can be transformed into a chemically-enriched source of information to drive future discovery.