UIMA (Unstructured Information Management Architecture) is a general framework for analyzing unstructured content (especially text). The Java implementation of UIMA provides some infrastructure for natural language processing, but you’ll need some other pieces to actually perform analyses.
In this series, I aim to cover some ways to get started using UIMA with Scala, with the ultimate objective of creating a simple server to run analyses on request. This post will walk through some trivial analysis pipelines and show how to iterate over the results.
UIMA Basics
Some preliminaries: UIMA’s basic unit of work is the CAS (“Common Analysis System”), a reusable (mutable) container for text and annotations. A typical UIMA pipeline consists of a Collection Reader, a series of Analysis Engines, and possibly some CAS Consumers (which might, for example, save the analysis results to disk). Initially, UIMA relied on XML configuration files for each of these components, but uimaFIT was developed to configure analyses programmatically.1
UIMA provides the infrastructure, but doesn’t offer any analysis options on its own. DKPro Core is one project that has packaged a lot of libraries like OpenNLP and the Stanford Named Entity Recognizer for use with UIMA. Crucially, DKPro defines a set of UIMA types that can be used to annotate your data in an interoperable way, allowing you to mix and match analysis engines from different sources.
Armed with UIMA and DKPro, we can begin to execute some simple pipelines in Scala.
Developing a Corpus Class
We’ll define a Corpus
class to wrap some of UIMA’s functionality, and a
corresponding CorpusSpec
class to test it.
To begin, let’s say that a Corpus
contains a CollectionReaderDescription
that will tell UIMA how to load its contents.
import org.apache.uima.collection.CollectionReaderDescription
case class Corpus(reader: CollectionReaderDescription)
In order to produce the reader description, we’ll add a fromDir
method to the
companion object that makes use of DKPro’s TextReader
class.
import de.tudarmstadt.ukp.dkpro.core.api.io.ResourceCollectionReaderBase
import de.tudarmstadt.ukp.dkpro.core.io.text.TextReader
import org.apache.uima.fit.factory.CollectionReaderFactory.createReaderDescription
object Corpus {
def fromDir(directory: String, pattern: String = "[+]**/*.txt", lang: String = "en"): Corpus =
Corpus(createReaderDescription(
[TextReader],
classOf.PARAM_SOURCE_LOCATION, directory,
ResourceCollectionReaderBase.PARAM_PATTERNS, pattern,
ResourceCollectionReaderBase.PARAM_LANGUAGE, lang))
ResourceCollectionReaderBase}
Now we can create a Corpus
from a directory of text files using
Corpus.fromDir
. However, our Corpus class only contains the description of a
reader, which doesn’t do anything yet. Let’s extend the class with two methods,
tokenize
and lemmatize
, each of which will return an Iterator[JCas]
.
import org.apache.uima.jcas.JCas
case class Corpus(reader: CollectionReaderDescription) {
def tokenize(): Iterator[JCas] = ???
def lemmatize(): Iterator[JCas] = ???
}
Before we write the definitions, let’s write some tests in CorpusSpec
that
will confirm they work as expected. (I’ve cribbed the collection of U.S.
presidential inaugural addresses from NLTK for testing
purposes, storing them in src/test/resources/inaugural/
).
import org.scalatest._
import org.apache.uima.fit.util.JCasUtil
import de.tudarmstadt.ukp.dkpro.core.api.metadata.`type`.DocumentMetaData
import de.tudarmstadt.ukp.dkpro.core.api.segmentation.`type`.{Lemma, Token}
class CorpusSpec extends FunSpec with Matchers {
val corpusDir =
getClass().getClassLoader().getResource("inaugural/").getPath()
describe("Corpus") {
describe("tokenize") {
describe("when passed a reader") {
it("should tokenize a corpus") {
val corpus = Corpus.fromDir(corpusDir)
val tokenMap = (for {
<- corpus.tokenize()
jcas = JCasUtil.selectSingle(jcas, classOf[DocumentMetaData])
metadata = metadata.getDocumentTitle()
title = JCasUtil.select(jcas, classOf[Token])
tokens } yield title -> tokens.size()).toMap
56
tokenMap should have size all (tokenMap.values) should be > 0
}
}
}
}
}
Here we tokenize the corpus, and generate a map from the title of each document
(its filename, by default) to the number of tokens found. We then check to see
that we have the number of documents we expect and that all documents have a
non-zero number of tokens. (The test for lemmatize
is identical, except it
calls lemmatize
in lieu of tokenize
and selects for classOf[Lemma]
.)
Note that we actually access the annotations using the variations of
JCasUtil.select
. If desired, these calls could also be wrapped by methods in
the Scala Corpus class, but I’ll leave that for a later refinement.
In order to write out these methods, we can use the iteratePipeline
method
provided by uimaFIT. This method takes a reader description and a variable
number of engine descriptions, and outputs a JCasIterable (a UIMA-specific
wrapper for an iterator resulting from an analysis pipeline) that we can readily
convert into an Iterator[JCas]
.
Here is the final case class definition:
import org.apache.uima.fit.factory.AnalysisEngineFactory.createEngineDescription
import org.apache.uima.fit.pipeline.SimplePipeline.iteratePipeline
import org.apache.uima.jcas.JCas
import de.tudarmstadt.ukp.dkpro.core.clearnlp.{ClearNlpSegmenter, ClearNlpLemmatizer, ClearNlpPosTagger}
case class Corpus(reader: CollectionReaderDescription) {
// keep implicit conversions local
import scala.collection.JavaConversions._
def tokenize(): Iterator[JCas] =
iteratePipeline(
,
readercreateEngineDescription(classOf[ClearNlpSegmenter])
).iterator()
def lemmatize(): Iterator[JCas] =
iteratePipeline(
,
readercreateEngineDescription(classOf[ClearNlpSegmenter]),
createEngineDescription(classOf[ClearNlpPosTagger]),
createEngineDescription(classOf[ClearNlpLemmatizer])
).iterator()
}
That completes our tiny Corpus class. You could try it out by pasting the following code into the REPL:
import scala.collection.JavaConversions._
import org.apache.uima.fit.util.JCasUtil
import de.tudarmstadt.ukp.dkpro.core.api.segmentation.`type`.Lemma
val corpus = Corpus.fromDir("src/test/resources/inaugural")
val jcasIterator = corpus.lemmatize()
val jcas = jcasIterator.next()
val lemmas = JCasUtil.select(jcas, classOf[Lemma]).take(5).map(_.getValue)
It should give the following result:
> val lemmas = JCasUtil.select(jcas, classOf[Lemma]).take(5).map(_.getValue)
scala: Iterable[String] = List(fellow, -, citizen, of, the) lemmas
Wrap-up
That’s it! To try this example yourself, you can clone
the repository and run
./activator test
. You may note that it runs somewhat slowly; unfortunately,
UIMA is single-threaded by default, and can’t readily reuse analyses computed in
separate pipelines (for example, the ClearNlpSegmenter
is run anew each time).
This post covered the bare essentials of using UIMA from Scala. In the next part, I’ll explore how to work with UIMA’s multithreaded extensions in order to take advantage of multiple cores and speed up the process.
See part 2 about UIMA Asynchronous Scaleout.