My professional experience with natural language extends back to my undergraduate years at MIT. We all had to do an undergraduate thesis. Based on my interest in music and my readings in books like Colin Cherry's On Human Communication, I wanted to do my thesis on digital music synthesis, to go beyond Ercolino Ferretti's (RIP 2006) work at MIT on analog music synthesis. But I was dissuaded from that and told that speech synthesis was more sensible. So I did that, creating what may well have been the first purely digital speech synthesis program ever written. (Still working on the history of that field.)
Soon afterwards I went back into Physics, in the MIT PhD program. At that point, Noam Chomsky's first students arrived. I hung out with them, because my interest in linguistics had been piqued by my speech work. I roomed with David Perlmutter (now retired from the Linguistics faculty at UCSD). After that I kept in touch with my linguistics friends, and followed linguistics as a hobby, throughout my Physics and Biology careers that followed.
The breakthrough came in the summer of 1981, as I have related elsewhere, when I decided to focus on understanding and extracting knowledge from the biological literature. By 1982, I had published my first paper on the topic, as well as developed an entry system for structured text (with a student, R. Reinke). A good deal transpired after that, with a high point being the research we did funded by the large NSF grant that founded our Biological Knowledge Laboratory (BKL). The research is chronicled in our various publications. Highlights then and since included an MS thesis on parsing, Andrea Grimes' development of a pattern viewer, and currently, the ongoing development of a radically new computational infrastructure for natural language processing, NLP New Generation (NLP-NG), initially with the assistance of Jeff Satterley, a PhD student in the BKL. The project uses the 300M word corpus of the BioMed Central journals (in XML format), Apache Derby, our 12GB Xserve, Java, Eclipse, Subversion, Hadoop, HBase, and more. The linguistic basis of the work is construction grammar. Our first major results and publications began to appear in 2009.
In 2001, in response to discussions at PSB 2000 (the Pacific Symposium on Biocomputing), I launched the site BioNLP.org. It initially served as an information resource, but too much effort was required to keep it up to date. It now serves as an important forum through its mailing list, with over 600 members from around the world. The mailing list archives can be searched via Google, on this page.