Data conversion

This section describes how to seamlessly convert between BioC data model, the plain text, and CDM NOTE table.

BioC

MedText uses the BioC format as the unified interface. BioC is a simple format to share text data and annotations. It allows a large number of different annotations to be represented. The BioC data model can represent a broad range of data elements from a collection of documents through passages, sentences, down to annotations on individual tokens and relations between them. Thus it is suitable for reflecting information at different levels and is appropriate for a wide range of common tasks.

<?xml version='1.0' encoding='utf-8' standalone='yes'?>
<collection>
  <source>SOURCE</source>
  <date>DATE</date>
  <key>KEY</key>
  <document>
    <id>0001</id>
    <passage>
      <offset>0</offset>
      <text>FINDINGS:...</text>
    </passage>
    <passage>
      <offset>120</offset>
      <text>IMPRESSION:...</text>
    </passage>  
  </document>
  <document>
    <id>0002</id>
    <passage>
      <offset>0</offset>
      <text>FINDINGS:...</text>
    </passage>
    <passage>
      <offset>170</offset>
      <text>IMPRESSION:...</text>
    </passage>
  </document>
</collection>

Warning

If you have lots of reports, it is recommended to put them into several BioC files, for example, 5000 reports per BioC file.

OMOP CDM NOTE and NOTE_NLP tables

MedText also offers a tool to convert from OMOP CDM NOTE table (in the CSV format) to the BioC collection. By default, column note_id stores the report ids, and column note_text stores the reports.

# Convert from csv to BioC
$ medtext-csv2bioc -i /path/to/csv_file.csv -o /path/to/bioc_file.xml

# Convert from NOTE table to BioC
$ medtext-cdm2bioc -i /path/to/csv_file.csv -o /path/to/bioc_file.xml