Pattern Matching using Mensa — A Quick Overview

Mensa | Open Source Java

Mensa | Open Source Java on Github

Since first writing about the new open source Java project, Mensa, I’ve received a number of questions asking who is the target audience, what specific problems Mensa solves, and how it might actually be used. In this post, I address those questions. I also describe potential problems with pattern matching in Java using the standard, built-in functions and how Mensa overcomes those issues. Finally, I offer an example use case where Mensa really shines. Please keep in mind though, this is just one example. Mensa is a software library–a collection of software building blocks. Creative software engineers can surely find other novel ways to leverage the Mensa functionality.

Who Should Use Mensa?

Mensa is a Java class library. As such, the primary Mensa user is a programmer working in Java. However, I am currently involved with a team that is using Mensa from C# (using IKVM.NET), but that’s a subject for another post. (Feel free to contact me if you’re specifically interested in that.)

Mensa in a Nutshell

Mensa provides a robust and efficient solution to a class of pattern matching problems. Specifically, Mensa makes it easy to find any or all occurrences of any keywords within a source text. For example, finding all references to named characters (the keywords) in a novel (the source text).

Mensa is efficient in that it will quickly process source text in a single pass regardless if whether it’s searching for one keyword or a million keywords.

Mensa is generic in the sense that it is not limited to matching textual data. Instead, a source text is, by definition, an arbitrarily long sequence of symbols. The actual data type of these symbols can be anything–characters, bytes, numbers, hieroglyphs, musical notes, etc.

Standard Pattern Matching in Java

The standard Java library includes support for finding specific patterns within character strings. There are two basic mechanisms to choose from:

  • String search finds an instance of a literal string value within another string. For example, you could search for the word “Arbor” in the text “Welcome to Ann Arbor”. The result would indicate that the word was found at position 15 (Java starts counting at zero).
  • Regular expression matching finds an instance of a string pattern within another string. For example, you could search for the pattern “A…r” in “Welcome to Ann Arbor”. The “.” in the pattern means “any character”, so this would also find a match at position 15.

In many situations, these built-in Java capabilities work just fine. But not always. Consider the following problem scenarios:

Each of these scenarios, respectively, is easily handled using Mensa.

  • The standard library functions operate on character strings in memory. They are not well-suited to searching through very large texts that don’t easily fit completely in memory.
  • The built-in regular expressions can recognize word boundaries but require more processing time than simple string literal matching, which cannot. For example, without regard to word boundaries, “boy” would match in “30-day boycott”, which is often not what you’d want.
  • The standard library functions do not scale well to finding very large numbers of keywords. Your choices would be either: a) perform a separate literal (or regular expression) search operation for each keyword, or b) create a regular expression containing alternative clauses for each keyword. These approaches suffer in terms of performance as the number of keywords increases.
  • The standard library functions are limited to finding character keywords in character texts. There is no way to use built-in functions, for example, to find all sequences of “317, 206, 827, 1106” in an array of 10 million integers.

Each of these scenarios, respectively, is easily handled using Mensa:

  • Mensa operates on an abstract text source. Implementations are provided for both in-memory and streaming text sources. Plus, it is easy to create new text source implementations for custom sources. For example, you could create a text source that reads from a data base, a version control system, a REST API, etc.
  • Mensa has an option to enable or disable word-boundary checking. When enabled, keywords are recognized only when delimited by word boundary symbols. What constitutes a word boundary symbol is determined by an abstract symbol classifier. An implementation is provided for character symbols, but you can also create custom symbol classifiers. For example, a gene search application may define certain nucleotides as gene boundaries.
  • Once a Mensa machine is constructed, its performance depends only on the length of the text being searched, not on the number of keywords. And, a given machine instance can be used any number of times, even by concurrent threads, to search multiple text sources.
  • Mensa is implemented using Java generics. Thus, it can be used to match any type of symbol as defined by the Java template type S — e.g., it is possible to create a machine to match bytes, characters, integers, gene sequences, bit sequences, etc.

Mensa Use Case Example

Suppose your company has an internal company web portal containing many thousands of web pages. You’ve been asked to write a program to determine how many times each employee is mentioned anywhere within that portal. You have access to an HR database that contains the full names and, in some cases, the nicknames of every employee in the company, of which there are roughly 25,000.

At a very high level, a Mensa-based solution to this problem looks like this:

  1. Create a Mensa keywords collection.
  2. Read the full names and nicknames of the employees from the HR database, and add each to the keywords collection.
  3. Create a Mensa matching machine.
  4. Initialize it with the keyword collection.
  5. For each page in the web portal, create a Mensa text source based on the page, then run the matching machine agains that source. The result is a list of matching keywords (and thus employee names or nicknames) for that page.

Mensa provides the ability to associate arbitrary data (called user data) with each keyword. This additional information is included with any match result. For this application, it would be useful to use this feature to store the employee’s identity with each keyword. This is especially important for nicknames–there could easily be more than one “Andy” in the company, for example. That way, each time the matching machine reports a match, you’ll know not only the text of the match, but also the specific employee(s) being mentioned.

Learn More

If this has whetted your pattern matching appetite, hop over to the Mensa project site where you can browse the Mensa Wiki and take a look at the Mensa source code examples. Then, download the latest release and build something cool!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.