Named entity recognition
The named entity recognition (NER) module recognizes mention spans of a particular entity type (e.g., abnormal findings) from the reports. We provide two options for NER.
Usage:
medtext-ner spacy [--overwrite --spacy-model NAME --radlex FILE] -i FILE -o FILE
medtext-ner regex [--overwrite] --phrases FILE -i FILE -o FILE
medtext-ner download [--spacy-model NAME]
Options:
-i FILE Inpput file
-o FILE Output file
--overwrite Overwrite the existing file
--phrases FILE Phrase patterns
--radlex FILE The RadLex ontology file [default: .medtext/resources/Radlex4.1.xlsx]
--spacy-model NAME spaCy trained model [default: en_core_web_sm]
regex
The rule-based method uses regular expressions that combine information from terminological resources and characteristics of the entities of interest. They are manually constructed by domain experts.
from pathlib import Path
from medtext_ner.models.ner_regex import NerRegExExtractor, BioCNerRegex, load_yml
patterns = load_yml(argv['--phrases'])
extractor = NerRegExExtractor(patterns)
processor = BioCNerRegex(extractor, name=Path(argv['--phrases']).stem)
spacy
SpaCy’s PhraseMatcher provides another way to efficiently match large terminology lists. medtext uses PhraseMatcher to recognize concepts in the RadLex ontology.
import spacy
from medtext_ner.models.ner_spacy import NerSpacyExtractor, BioCNerSpacy
from medtext_ner.models.radlex import RadLex4
nlp = spacy.load(argv['--spacy-model'], exclude=['ner', 'parser', 'senter'])
radlex = RadLex4(argv['--radlex'])
matchers = radlex.get_spacy_matchers(nlp)
extractor = NerSpacyExtractor(nlp, matchers)
processor = BioCNerSpacy(extractor, 'RadLex')
Phrase patterns
The pattern file is in the yaml format. It contains a list of concepts where the key serves as the
preferred name. Each concept should contain three attributes: concept_id, include, and
exclude.
include contains the regular expressions that the concept will match.
exclude contains the regular expressions that the concept will not match, even if its substring will match the regular
expressions in the include
Using the following example, medtext will recognize “emphysema”, but reject “subcutaneous emphysema” though “emphysema” is part of “subcutaneous emphysema”.
Emphysema:
concept_id: RID4799
include:
- emphysema
exclude:
- subcutaneous emphysema