I have been developing commercial software for more that thirty years. Now, I’m pleased to announce that, for the first time, some of my software has been released as a new Dell/Quest open source Java project, named Mensa.
Mensa is a generic, flexible, enhanced, and efficient open source Java implementation of a pattern matching state machine as described by the 1975 paper by Alfred V. Aho and Margaret J. Corasick: Efficient string matching: An aid to bibliographic search. This implementation is
- generic in that it can be used to match any type of symbols as defined by the Java template type S — e.g., it is possible to create a machine to match bytes, characters, integers, gene sequences, bit sequences, etc.;
- flexible in that the architecture allows for granular extension, customization, or replacement of framework components;
- enhanced in that it supports a number of useful extension not addressed in the original paper, such as whole-word matching, case-sensitivity controls, fuzzy whitespace matching, fuzzy punctuation matching, incremental matching (i.e., iterators), matching event listeners, etc.; and
- efficient in that it performs well in terms of both time and resource usages on very large (~million term) keyword sets.
Project History
For a number of years, beginning within Quest Software and continuing within Dell Software (after a company acquisition), the core Mensa contributors a have been working on platform technologies for discovering, mapping, and connecting otherwise disconnected data across various enterprise information sources. One key ingredient in such solutions is the ability to accurately and efficiently find stuff.
Most recently, we have focused on delivering automatic digital asset classification technologies, such as those used by Dell One Identity Manager Data Governance Edition – Classification Module [update: now Quest One Identity].
We had been using licensed, third-party software for dictionary-based keyword searching, but for a variety of reasons, we knew that component would eventually need to be replaced. About a year ago, we began looking for open source alternatives. However, no available open source solution had all the elements we were looking for: generics, flexibility, fuzzy matching, large dictionary efficiency, etc.
So, in early 2014, I set out to create a new “Java Aho-Corasick Library” that would satisfy all of these requirements.
Why Mensa?
When we decided to release this project as an open source Java library, we thought it should have a short, easily remembered, and easily spelled name–“Java Aho-Corasick Library” was a mouthful! Our colleagues at Dell had recently released Doradus as an open source Java project. We decided to piggyback on their naming approach and chose an astronomical name.
Mensa is a small constellation in the southern sky. Its name is Latin for table. This project is a modest size class library, primarily designed to search for sets–or tables–of keywords.
Open Source Java
The following are the primary Mensa resources:
- Source code: Source code can be downloaded from this Github project:
https://github.com/dell-oss/Mensa
- Documentation: There are two primary sources of documentation: the Mensa Wiki and the API Java docs. The wiki contains details on building Mensa, tutorials, examples, contributor information, and more. The API java docs are generated as part of the build process and contain detailed API documentation.
- Issues: Please feel free to post bug reports and feature enhancements in the Github Issues area:
https://github.com/dell-oss/Mensa/issues
- Downloads: Binaries may be downloaded from the Releases page. Source, binary, and doc download bundles will be available soon from Maven Central. Stay tuned!
Please also feel free to leave a comment here or contact me directly. This is my first time as an open source Java author; I’d love any feedback, suggestions, criticisms, etc. I’m especially interested in how you are (or might be) using Mensa.
1 comment for “Open Source Java–Mensa: Dictionary String Matching”