Natural Language Processing with Scala and UIMA: Part 1

UIMA (Unstructured Information Management Architecture) is a general framework for analyzing unstructured content (especially text). The Java implementation of UIMA provides some infrastructure for natural language processing, but you’ll need some other pieces to actually perform analyses.

In this series, I aim to cover some ways to get started using UIMA with Scala, with the ultimate objective of creating a simple server to run analyses on request. This post will walk through some trivial analysis pipelines and show how to iterate over the results.

UIMA Basics
Developing a Corpus Class
Wrap-up

UIMA Basics

Some preliminaries: UIMA’s basic unit of work is the CAS (“Common Analysis System”), a reusable (mutable) container for text and annotations. A typical UIMA pipeline consists of a Collection Reader, a series of Analysis Engines, and possibly some CAS Consumers (which might, for example, save the analysis results to disk). Initially, UIMA relied on XML configuration files for each of these components, but uimaFIT was developed to configure analyses programmatically.¹

UIMA provides the infrastructure, but doesn’t offer any analysis options on its own. DKPro Core is one project that has packaged a lot of libraries like OpenNLP and the Stanford Named Entity Recognizer for use with UIMA. Crucially, DKPro defines a set of UIMA types that can be used to annotate your data in an interoperable way, allowing you to mix and match analysis engines from different sources.

Armed with UIMA and DKPro, we can begin to execute some simple pipelines in Scala.

Developing a Corpus Class

We’ll define a Corpus class to wrap some of UIMA’s functionality, and a corresponding CorpusSpec class to test it.

To begin, let’s say that a Corpus contains a CollectionReaderDescription that will tell UIMA how to load its contents.

import org.apache.uima.collection.CollectionReaderDescription

case class Corpus(reader: CollectionReaderDescription)

In order to produce the reader description, we’ll add a fromDir method to the companion object that makes use of DKPro’s TextReader class.

import de.tudarmstadt.ukp.dkpro.core.api.io.ResourceCollectionReaderBase
import de.tudarmstadt.ukp.dkpro.core.io.text.TextReader
import org.apache.uima.fit.factory.CollectionReaderFactory.createReaderDescription

object Corpus {
  def fromDir(directory: String, pattern: String = "[+]**/*.txt", lang: String = "en"): Corpus =
    Corpus(createReaderDescription(
      classOf[TextReader],
      ResourceCollectionReaderBase.PARAM_SOURCE_LOCATION, directory,
      ResourceCollectionReaderBase.PARAM_PATTERNS, pattern,
      ResourceCollectionReaderBase.PARAM_LANGUAGE, lang))
}

Now we can create a Corpus from a directory of text files using Corpus.fromDir. However, our Corpus class only contains the description of a reader, which doesn’t do anything yet. Let’s extend the class with two methods, tokenize and lemmatize, each of which will return an Iterator[JCas].

import org.apache.uima.jcas.JCas

case class Corpus(reader: CollectionReaderDescription) {
  def tokenize(): Iterator[JCas] = ???
  def lemmatize(): Iterator[JCas] = ???
}

Before we write the definitions, let’s write some tests in CorpusSpec that will confirm they work as expected. (I’ve cribbed the collection of U.S. presidential inaugural addresses from NLTK for testing purposes, storing them in src/test/resources/inaugural/).

import org.scalatest._
import org.apache.uima.fit.util.JCasUtil
import de.tudarmstadt.ukp.dkpro.core.api.metadata.`type`.DocumentMetaData
import de.tudarmstadt.ukp.dkpro.core.api.segmentation.`type`.{Lemma, Token}

class CorpusSpec extends FunSpec with Matchers {
  val corpusDir =
    getClass().getClassLoader().getResource("inaugural/").getPath()

  describe("Corpus") {
    describe("tokenize") {
      describe("when passed a reader") {
        it("should tokenize a corpus") {
          val corpus = Corpus.fromDir(corpusDir)
          val tokenMap = (for {
            jcas <- corpus.tokenize()
            metadata = JCasUtil.selectSingle(jcas, classOf[DocumentMetaData])
            title = metadata.getDocumentTitle()
            tokens = JCasUtil.select(jcas, classOf[Token])
          } yield title -> tokens.size()).toMap
          tokenMap should have size 56
          all (tokenMap.values) should be > 0
        }
      }
    }
  }
}

Here we tokenize the corpus, and generate a map from the title of each document (its filename, by default) to the number of tokens found. We then check to see that we have the number of documents we expect and that all documents have a non-zero number of tokens. (The test for lemmatize is identical, except it calls lemmatize in lieu of tokenize and selects for classOf[Lemma].)

Note that we actually access the annotations using the variations of JCasUtil.select. If desired, these calls could also be wrapped by methods in the Scala Corpus class, but I’ll leave that for a later refinement.

In order to write out these methods, we can use the iteratePipeline method provided by uimaFIT. This method takes a reader description and a variable number of engine descriptions, and outputs a JCasIterable (a UIMA-specific wrapper for an iterator resulting from an analysis pipeline) that we can readily convert into an Iterator[JCas].

Here is the final case class definition:

import org.apache.uima.fit.factory.AnalysisEngineFactory.createEngineDescription
import org.apache.uima.fit.pipeline.SimplePipeline.iteratePipeline
import org.apache.uima.jcas.JCas
import de.tudarmstadt.ukp.dkpro.core.clearnlp.{ClearNlpSegmenter, ClearNlpLemmatizer, ClearNlpPosTagger}

case class Corpus(reader: CollectionReaderDescription) {
  // keep implicit conversions local
  import scala.collection.JavaConversions._

  def tokenize(): Iterator[JCas] =
    iteratePipeline(
      reader,
      createEngineDescription(classOf[ClearNlpSegmenter])
    ).iterator()

  def lemmatize(): Iterator[JCas] =
    iteratePipeline(
      reader,
      createEngineDescription(classOf[ClearNlpSegmenter]),
      createEngineDescription(classOf[ClearNlpPosTagger]),
      createEngineDescription(classOf[ClearNlpLemmatizer])
    ).iterator()
}

That completes our tiny Corpus class. You could try it out by pasting the following code into the REPL:

import scala.collection.JavaConversions._
import org.apache.uima.fit.util.JCasUtil
import de.tudarmstadt.ukp.dkpro.core.api.segmentation.`type`.Lemma
          
val corpus = Corpus.fromDir("src/test/resources/inaugural")
val jcasIterator = corpus.lemmatize()
val jcas = jcasIterator.next()
val lemmas = JCasUtil.select(jcas, classOf[Lemma]).take(5).map(_.getValue)

It should give the following result:

scala> val lemmas = JCasUtil.select(jcas, classOf[Lemma]).take(5).map(_.getValue)
lemmas: Iterable[String] = List(fellow, -, citizen, of, the)

Wrap-up

That’s it! To try this example yourself, you can clone the repository and run ./activator test. You may note that it runs somewhat slowly; unfortunately, UIMA is single-threaded by default, and can’t readily reuse analyses computed in separate pipelines (for example, the ClearNlpSegmenter is run anew each time).

This post covered the bare essentials of using UIMA from Scala. In the next part, I’ll explore how to work with UIMA’s multithreaded extensions in order to take advantage of multiple cores and speed up the process.

See part 2 about UIMA Asynchronous Scaleout.

uimaFIT is Java-based, but it’s easy enough to use from Scala if you’re using pre-existing UIMA components. If you do need to build new components, the awesome project uimaScala lets you do so with a more Scala-friendly approach.↩︎