Text preprocessing
This module provides sentence split, tokenization, part-of-speech tagging, lemmatization and dependency parsing. We provide two options for text preprocessing.
Usage:
medtext-preprocess stanza [--overwrite] -i FILE -o FILE
medtext-preprocess spacy [--overwrite --spacy-model NAME] -i FILE -o FILE
medtext-preprocess download spacy [--spacy-model=NAME]
medtext-preprocess download stanza
Options:
-i FILE Input file
-o FILE Output file
--overwrite Overwrite the existing file
--spacy-model NAME spaCy trained model [default: en_core_web_sm]
spacy
spaCy is an open-source Python library for Natural Language Processing.
import spacy
from medtext_preprocess.models.preprocess_spacy import BioCSpacy
nlp = spacy.load(argv['--spacy-model'])
processor = BioCSpacy(nlp)
stanza
Stanza is a collection of efficient tools for Natural Language Processing.
import stanza
from medtext_preprocess.models.preprocess_stanza import BioCStanza
nlp = stanza.Pipeline('en', processors='tokenize,pos,lemma,depparse')
processor = BioCStanza(nlp)