The overall goal of this research is to develop technology to read research papers, extract pieces of causal mechanisms, assemble these pieces into more complete causal models and reason over those models to produce explanations of cancer. If big mechanisms like these can be automated, it could forever change how science is done, as in the case study for cancer literature below.
Case Study: Automating the creation of software that extracts and explicitly records in databases the experimental methods and results reported in cancer research articles — without a human needing to read the paper.
Imagine you're a cancer researcher. The literature corpus is infinite, growing every day, and though current searchable databases may contain the studies that fit your experimental parameters, you still need to find and read the relevant papers and evaluate their findings to see if they would fit your mode. What would help you tremendously — and speed up the rate at which cancer research in general is accomplished — would be a program that could not only scan papers for your search terms, but could also extract the results of each study and clearly communicate them to you.
It seems like science fiction, but it's happening now at Carnegie Mellon.
A team of researchers from CMU, the University of Southern California and Elsevier Research Data Services, led by Language Technologies Institute Research Professor Eduard Hovy, is developing natural language processing methods to locate, extract and classify relevant individual passages of articles' results sections that describe experiments and findings.
The team's work has two parts. First, they developed models of the experimental methods that cancer researchers employ and the data they use to justify their impressions and findings — a deviation from the standard method of just focusing on the interpretations of the findings. Second, they used the typical structure of biological research articles to identify the textual elements that communicate observational and interpretive information. Their goal? To automate the creation of software that extracts and explicitly records in databases the experimental methods and results reported in cancer research articles — without a human needing to read the paper.
A typical passage from a primary research article describing experimental results with added annotations describing discourse segment types, internal links to figures and external links to cited references.
"We work with biologists who create very complicated models. Even focusing on some very specific procedures within cells, they can’t read the hundreds of possibly relevant articles generated every day. They want the system to say, 'Oh, you're doing this and it involves X and Y? There are two papers published in the past few hours on that.' And this is what we're doing," Hovy said. "We're reading papers about cancer research, identifying the right information. Automatically classifying some statements as being hypotheses, others as facts, others as conclusions. And then telling the people in the labs, 'This is what you should be checking. This is what other people are checking.' It accelerates their research."
Hovy's project is part of a large research program funded by the U.S. Government, where research projects around the country aim to build automated text reading software, automated models that simulate cancer-related processes in cells and methods to combine them. The overall goal is to develop technology to read research papers, extract pieces of causal mechanisms, assemble these pieces into more complete causal models and reason over those models to produce explanations of cancer. If big mechanisms like these can be automated, it could forever change how science is done — not only for cancer research, but for many other fields as well.
For More Information:
Gully A.P.C. Burns, Pradeep Dasigi, Anita de Waard, Eduard H. Hovy; Automated detection of discourse segment and experimental types from the text of cancer pathway results sections. Database (Oxford) 2016; 2016 baw122. doi: 10.1093/database/baw122