Text preprocessing

This module provides sentence split, tokenization, part-of-speech tagging, lemmatization and dependency parsing. We provide two options for text preprocessing.

Usage:
    medtext-preprocess stanza [--overwrite] -i FILE -o FILE
    medtext-preprocess spacy [--overwrite --spacy-model NAME] -i FILE -o FILE
    medtext-preprocess download spacy [--spacy-model=NAME]
    medtext-preprocess download stanza

Options:
    -i FILE             Input file
    -o FILE             Output file
    --overwrite         Overwrite the existing file
    --spacy-model NAME  spaCy trained model [default: en_core_web_sm]

spacy

spaCy is an open-source Python library for Natural Language Processing.

import spacy
from medtext_preprocess.models.preprocess_spacy import BioCSpacy

nlp = spacy.load(argv['--spacy-model'])
processor = BioCSpacy(nlp)

stanza

Stanza is a collection of efficient tools for Natural Language Processing.

import stanza
from medtext_preprocess.models.preprocess_stanza import BioCStanza

nlp = stanza.Pipeline('en', processors='tokenize,pos,lemma,depparse')
processor = BioCStanza(nlp)