Detecting people's names is part and parcel of PII discovery. Traditional techniques like regexps and keywords don't work, because the set of all names is too varied. How do open source Named Entity Recognition (NER) engines compare, and can we do better?
This Part 1 has NER results and benchmarks. There's also Part 2 with technical neural network details.
Developing a state-of-the-art named entity recognizer
One could use a prepared list of names and surnames for this task, but such gazetteer obviously cannot be exhaustive and will fail on sentences like this:
"Calvin Klein founded Calvin Klein Company in 1968."
Humans recognize easily what is a person's name and what isn't. For machines, the task is more difficult because they have trouble understanding context. This leads to the famous two types of errors:
- False positives: words detected as personal names that are not, typically because they're capitalized (
"Skin Games and Jesus Jones were one the first western bands."); as well as
- False negatives: names missed by the detection, typically because they're lowercased, foreign or uncommon. Example:
"bill carpenter was playing football."
So in order to recognize person names in text it is necessary to know not only what names look like, but also in what context they're used and have a general domain knowledge.
Why not just use open source?
NER is a well-studied task in academia. Naturally, we turned to open source NER solutions first. We evaluated the most popular ready-made software: Stanford NER and Stanza from the Stanford University, FLAIR from Zalando Research, spaCy from Explosion AI.
To cut the story short, none of these open source tools were precise and fast enough for our purposes. While they work great on well-behaved data such as news articles or Wikipedia, their academic pedigree implodes when applied to the wild, messy documents of the real world.
For this reason we developed our own NER, with special focus on names as they appear in real company documents. The resulting NER is proprietary (part of PII Tools) but we decided to share some technical design tips and results here, in the hope they help others.
In this article, we'll compare our creation against popular open source options. Additionally, we'll benchmark a simple gazetteer-based NER that uses a predefined list of names, to serve as a baseline.
|list||-||list of names||multi-language||inhouse|
Our requirements for the new NER were:
- multi-language, with special focus on English, Spanish, German and Portuguese/Brazilian;
- to accept arbitrary document contexts: text coming from PDFs, Word documents, Excel, email, database field, OCR…;
- efficient, to process large amounts of text quickly on a CPU (no need for specialized GPU hardware) and with a low memory footprint,
- flexible to be able to evolve the model behaviour: retraining, adding new languages, correcting detection errors;
- and of course to be accurate, to minimize false positives and negatives.
We also included a manually annotated
openweb dataset, to make sure we test on data that no NER system (including ours!) has seen during training. Text for this dataset was randomly sampled from the English OpenWebTextCorpus. We value results on
openweb the most, because OpenWeb reflects the real (messy) data found in real documents the closest, among these public datasets.
We measured F1 scores for person names detected by each software. F1 scores range from 0 to 100, the higher the better. A "hit" (true positive) means the entire name was matched exactly, beginning to end. This is the strictest metric; we specifically don't calculate success over the number of correctly predicted tokens, or even individual characters.
In all tests our NER is among the top contenders – but keep in mind that accuracy was just one of our 5 design goals. For example, we "lost" a few F1 points on purpose by switching from Tensorflow to TF lite, trading accuracy for a much smaller & faster model.
For the Portuguese language (Brazil, LGPD) and
openweb dataset, PII Tools is the clear winner. As mentioned above
openweb reflects "real data" the closest, so this is great.
Let's look at some concrete examples:
|Calvin Klein founded Calvin Klein Company in 1968.||ok||ok||ok||ok||ok|
|Nawab Zulfikar Ali Magsi did not see Shama Parveen Magsi coming.||ok||ok||ok||fail||ok|
|bill carpenter was playing football.||fail||fail||fail||ok||ok|
|Llanfihangel Talyllyn is beautiful.||fail||fail||fail||fail||ok|
|Skin Games and Jesus Jones were one the first western bands.||fail||fail||fail||fail||fail|
These samples are rather edge cases, but serve to get an idea of what detectors have to deal with, and how successful they are.
Now we will look into other aspects of our requirements. Detection performance was measured on CPU, our
pii-tools NER was artificially restricted to a single process with a single thread, others were left in default settings.
|speed short [kB/s]||speed long [kB/s]||startup time [s]||RAM usage [MB]|
Overall FLAIR and Stanza are definitely out, due to their super slow speed and high RAM usage. A worthy competitor from the performance perspective is spaCy, whose authors put a great deal of effort into optimization. Unfortunately spaCy's tokenization quirks and opinionated architecture proved too inflexible for our needs.
Likewise, Stanford NER is the most accurate among the open source alternatives, but is quite rigid – it's really hard to update its models or add a new language. Plus its GNU GPL license won't be to everyone's liking.
Our focus on industry use calls for frequent model modifications: adding new languages, new kinds of documents (document contexts in which names may appear), fixing detection errors.
In order to adapt the PII Tools NER model quickly, we built a pipeline that utilizes several "weaker" NERs and automatic translation tools to build a huge training corpus from varied sources. This focus on real-world data, along with a robust automated Tensorflow training pipeline allows adding new languages and controlling NER outputs more easily than the open source solutions.
While developing our
pii-tools NER we implemented most components from scratch, including:
- a large scale annotated dataset (proprietary data)
- a tokenizer (critical; none of the open source variants do this part well)
- token features (input to the neural network)
- convolutional neural network (NN architecture)
- data augmentation and training pipeline (for grounded model updates)
- various tricks to push performance or to squeeze a lot of information into a smaller parameter space
If you are interested in these technical aspects, check the follow up article with technical NER details.
Questions? Want to see a live demo? Contact us.