Named entity recognition

The named entity recognition (NER) module recognizes mention spans of a particular entity type (e.g., abnormal findings) from the reports. We provide two options for NER.

Usage:
    medtext-ner spacy [--overwrite --spacy-model NAME --radlex FILE] -i FILE -o FILE
    medtext-ner regex [--overwrite] --phrases FILE -i FILE -o FILE
    medtext-ner download [--spacy-model NAME]

Options:
    -i FILE             Inpput file
    -o FILE             Output file
    --overwrite         Overwrite the existing file
    --phrases FILE      Phrase patterns
    --radlex FILE       The RadLex ontology file [default: .medtext/resources/Radlex4.1.xlsx]
    --spacy-model NAME  spaCy trained model [default: en_core_web_sm]

regex

The rule-based method uses regular expressions that combine information from terminological resources and characteristics of the entities of interest. They are manually constructed by domain experts.

from pathlib import Path
from medtext_ner.models.ner_regex import NerRegExExtractor, BioCNerRegex, load_yml

patterns = load_yml(argv['--phrases'])
extractor = NerRegExExtractor(patterns)
processor = BioCNerRegex(extractor, name=Path(argv['--phrases']).stem)

spacy

SpaCy’s PhraseMatcher provides another way to efficiently match large terminology lists. medtext uses PhraseMatcher to recognize concepts in the RadLex ontology.

import spacy
from medtext_ner.models.ner_spacy import NerSpacyExtractor, BioCNerSpacy
from medtext_ner.models.radlex import RadLex4

nlp = spacy.load(argv['--spacy-model'], exclude=['ner', 'parser', 'senter'])
radlex = RadLex4(argv['--radlex'])
matchers = radlex.get_spacy_matchers(nlp)
extractor = NerSpacyExtractor(nlp, matchers)
processor = BioCNerSpacy(extractor, 'RadLex')

Phrase patterns

The pattern file is in the yaml format. It contains a list of concepts where the key serves as the preferred name. Each concept should contain three attributes: concept_id, include, and exclude. include contains the regular expressions that the concept will match. exclude contains the regular expressions that the concept will not match, even if its substring will match the regular expressions in the include

Using the following example, medtext will recognize “emphysema”, but reject “subcutaneous emphysema” though “emphysema” is part of “subcutaneous emphysema”.

Emphysema:
  concept_id: RID4799
  include:
    - emphysema
  exclude:
    - subcutaneous emphysema