Natural Language research

My professional experience with natural language extends back to my undergraduate years at MIT. We all had to do an undergraduate thesis. Based on my interest in music and my readings in books like Colin Cherry's On Human Communication, I wanted to do my thesis on digital music synthesis, to go beyond Ercolino Ferretti's (RIP 2006) work at MIT on analog music synthesis. But I was dissuaded from that and told that speech synthesis was more sensible. So I did that, creating what may well have been the first purely digital speech synthesis program ever written. (Still working on the history of that field.)

Soon afterwards I went back into Physics, in the MIT PhD program. At that point, Noam Chomsky's first students arrived. I hung out with them, because my interest in linguistics had been piqued by my speech work. I roomed with David Perlmutter (now retired from the Linguistics faculty at UCSD). After that I kept in touch with my linguistics friends, and followed linguistics as a hobby, throughout my Physics and Biology careers that followed.

The breakthrough came in the summer of 1981, as I have related elsewhere, when I decided to focus on understanding and extracting knowledge from the biological literature. By 1982, I had published my first paper on the topic, as well as developed an entry system for structured text (with a student, R. Reinke). A good deal transpired after that, with a high point being the research we did funded by the large NSF grant that founded our Biological Knowledge Laboratory (BKL). The research is chronicled in our various publications. Highlights then and since included an MS thesis on parsing, Andrea Grimes' development of a pattern viewer, and currently, the ongoing development of a radically new computational infrastructure for natural language processing, NLP New Generation (NLP-NG), initially with the assistance of Jeff Satterley, a PhD student in the BKL. The project uses the 300M word corpus of the BioMed Central journals (in XML format), Apache Derby, our 12GB Xserve, Java, Eclipse, Subversion, Hadoop, HBase, and more. The linguistic basis of the work is construction grammar. Our first major results and publications began to appear in 2009.

In 2001, in response to discussions at PSB 2000 (the Pacific Symposium on Biocomputing), I launched the site BioNLP.org. It initially served as an information resource, but too much effort was required to keep it up to date. It now serves as an important forum through its mailing list, with over 600 members from around the world. The mailing list archives can be searched via Google, on this page.

NEWS - January 2010

In November 2009, we presented an important paper at the workshop, NLP Approaches for Unmet Information Needs in Health Care at the IEEE International Conference on Bioinformatics and Biomedicine 2009 in Washington DC. In early 2010 we are submitting an extended version for journal publication. The abstract is below, and the PDF is here

NLP-NG - A New NLP System for Biomedical Text Analysis

Robert P. Futrelle, Jeff Satterley, Tim McCormack
Biological Knowledge Laboratory
College of Computer and Information Science
Northeastern University, Boston, MA 02115
{ futrelle, jsatt, timmc} @ccs.neu.edu

Abstract

NLP-NG is a new NLP system consisting of three components: NG-CORE (language processing), NG-DB (database management), and NG-SEE (interactive visualization and entry). The ultimate goal of NLP-NG is to produce information retrieval systems in which users can choose full-text schema, adding specific items to focus their queries. Schema are created by a normalization process which elides adjunctive constructions as well as replacing items by prototypes. Biomedical text contains domain-specific constructions which are revealed by normalization. NLP-NG is based on Construction Grammar. Computationally, all representations are integer-based, allowing efficient storage, indexing, and retrieval. SEE, an Ajax web browser client, allows developers, linguists, and users to view a corpus and modify its properties. NLP-NG uses a 300 million word BioMed Central corpus. NLP-NG does not focus on specific strategies to extract limited classes of information from papers. Instead, it is a universal approach that can codify a wide variety of text in papers.

NEWS - November 2012

I retired from Northeastern in mid-2011. I am currently working on major extensions to the NLP-NG system described above. Without the hectic schedule of my earlier faculty job, I can take the time to carefully design and implement my system. Students have always been great collaborators over the years. Being young, they don't have the programming experience or the natural language expertise that I have. My code has a uniform style and is constantly guided by my understanding of how it needs to work with the Java language and the content, constraints, and strategies need for the analysis of natural language.

For now, I'm working under "deep cover". I'll only release the code and related data resources after the system is fully working and I have results and papers published.

Natural language research - R. P. Futrelle

NEWS - January 2010

Abstract

NEWS - November 2012