Data conversion

This section describes how to seamlessly convert between BioC data model, the plain text, and CDM NOTE table.

BioC

MedText uses the BioC format as the unified interface. BioC is a simple format to share text data and annotations. It allows a large number of different annotations to be represented. The BioC data model can represent a broad range of data elements from a collection of documents through passages, sentences, down to annotations on individual tokens and relations between them. Thus it is suitable for reflecting information at different levels and is appropriate for a wide range of common tasks.

<?xml version='1.0' encoding='utf-8' standalone='yes'?>
<collection>
  <source>SOURCE</source>
  <date>DATE</date>
  <key>KEY</key>
  <document>
    <id>0001</id>
    <passage>
      <offset>0</offset>
      <text>FINDINGS:...</text>
    </passage>
    <passage>
      <offset>120</offset>
      <text>IMPRESSION:...</text>
    </passage>  
  </document>
  <document>
    <id>0002</id>
    <passage>
      <offset>0</offset>
      <text>FINDINGS:...</text>
    </passage>
    <passage>
      <offset>170</offset>
      <text>IMPRESSION:...</text>
    </passage>
  </document>
</collection>

Warning

If you have lots of reports, it is recommended to put them into several BioC files, for example, 5000 reports per BioC file.

OMOP CDM NOTE and NOTE_NLP tables

MedText also offers a tool to convert from OMOP CDM NOTE table (in the CSV format) to the BioC collection. By default, column note_id stores the report ids, and column note_text stores the reports.

# Convert from csv to BioC
$ medtext-csv2bioc -i /path/to/csv_file.csv -o /path/to/bioc_file.xml

# Convert from NOTE table to BioC
$ medtext-cdm2bioc -i /path/to/csv_file.csv -o /path/to/bioc_file.xml