Skip to content

bioc_supplementary

autocorpus.bioc_supplementary ¤

This module provides functionality for converting text extracted from various file types into a BioC format.

Classes¤

BioCTableConverter ¤

Converts tables from nested lists into a BioC table object.

Functions¤
build_bioc(table_data, input_file) staticmethod ¤

Builds a BioCTableCollection object from the provided table data and input file.

Parameters:

Name Type Description Default
table_data list[DataFrame]

List of pandas DataFrames representing tables.

required
input_file str

The path to the input file.

required
Source code in autocorpus/bioc_supplementary.py
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
@staticmethod
def build_bioc(table_data: list[DataFrame], input_file: str) -> BioCTableCollection:
    """Builds a BioCTableCollection object from the provided table data and input file.

    Args:
        table_data (list[DataFrame]): List of pandas DataFrames representing tables.
        input_file (str): The path to the input file.
    """
    bioc = BioCTableCollection()
    bioc.source = "Auto-CORPus (supplementary)"
    bioc.date = datetime.date.today().strftime("%Y%m%d")
    bioc.key = "autocorpus_supplementary.key"
    bioc.infons = {}
    bioc.documents = BioCTableConverter.__build_tables(table_data, input_file)
    return bioc

BioCTextConverter ¤

Converts text content into a BioC format for supplementary material processing.

Functions¤
build_bioc(text, input_file, file_type) staticmethod ¤

Builds a BioCCollection object from the provided text, input file, and file type.

Parameters:

Name Type Description Default
text str | list[WordText]

The text content to be converted.

required
input_file str

The path to the input file.

required
file_type str

The type of the input file ('word' or 'pdf').

required

Returns:

Name Type Description
BioCCollection BioCCollection

The constructed BioCCollection object.

Source code in autocorpus/bioc_supplementary.py
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
@staticmethod
def build_bioc(
    text: str | list[WordText], input_file: str, file_type: str
) -> BioCCollection:
    """Builds a BioCCollection object from the provided text, input file, and file type.

    Args:
        text: The text content to be converted.
        input_file: The path to the input file.
        file_type: The type of the input file ('word' or 'pdf').

    Returns:
        BioCCollection: The constructed BioCCollection object.
    """
    bioc = BioCCollection()
    bioc.source = "Auto-CORPus (supplementary)"
    bioc.date = datetime.date.today().strftime("%Y%m%d")
    bioc.key = "autocorpus_supplementary.key"
    temp_doc = BioCDocument(id="1")
    if file_type == "word":
        text = cast(list[WordText], text)
        temp_doc.passages = BioCTextConverter.__identify_word_passages(text)
    elif file_type == "pdf":
        text = cast(str, text)
        temp_doc.passages = BioCTextConverter.__identify_passages(text)
    else:
        text = cast(str, text)
        temp_doc.passages = BioCTextConverter.__identify_passages(text)
    temp_doc.inputfile = input_file
    bioc.documents.append(temp_doc)
    return bioc

WordText(text, is_header) dataclass ¤

Represents a text element extracted from a Word document.

Functions¤

extract_table_from_pdf_text(text) ¤

Extracts tables from PDF text and returns the remaining text and parsed tables.

Source code in autocorpus/bioc_supplementary.py
87
88
89
90
91
92
def extract_table_from_pdf_text(text: str) -> tuple[str, list[DataFrame]]:
    """Extracts tables from PDF text and returns the remaining text and parsed tables."""
    main_text_lines, raw_tables = _split_text_and_tables(text)
    tables_output = _parse_tables(raw_tables)
    text_output = "\n\n".join(main_text_lines)
    return text_output, tables_output

replace_unicode(text) ¤

Replaces specific Unicode characters with their corresponding replacements in the given text.

Parameters:

Name Type Description Default
text str or list

The input text or list of texts to process.

required

Returns:

Type Description
T

str or list: The processed text or list of processed texts.

If the input text is empty or None, the function returns None.

If the input text is a list, it iterates over each element of the list and replaces the following Unicode characters: - ' ': Replaced with a space ' ' - '­': Replaced with a hyphen '-' - '‐': Replaced with a hyphen '-' - '×': Replaced with a lowercase 'x'

If the input text is not a list, it directly replaces the Unicode characters mentioned above.

Returns the processed text or list of processed texts.

Source code in autocorpus/bioc_supplementary.py
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
def replace_unicode(text: T) -> T:
    """Replaces specific Unicode characters with their corresponding replacements in the given text.

    Args:
        text (str or list): The input text or list of texts to process.

    Returns:
        str or list: The processed text or list of processed texts.

    If the input `text` is empty or None, the function returns None.

    If the input `text` is a list, it iterates over each element of the list and replaces the following Unicode characters:
        - '\u00a0': Replaced with a space ' '
        - '\u00ad': Replaced with a hyphen '-'
        - '\u2010': Replaced with a hyphen '-'
        - '\u00d7': Replaced with a lowercase 'x'

    If the input `text` is not a list, it directly replaces the Unicode characters mentioned above.

    Returns the processed text or list of processed texts.
    """
    if isinstance(text, list):
        clean_texts = []
        for t in text:
            if t:
                clean_texts.append(string_replace_unicode(t))
        return clean_texts
    else:
        clean_text = string_replace_unicode(text)
        return clean_text

string_replace_unicode(text) ¤

Replaces specific Unicode characters with their corresponding replacements in the given text.

Source code in autocorpus/bioc_supplementary.py
 95
 96
 97
 98
 99
100
101
102
def string_replace_unicode(text: str) -> str:
    """Replaces specific Unicode characters with their corresponding replacements in the given text."""
    return (
        text.replace("\u00a0", " ")
        .replace("\u00ad", "-")
        .replace("\u2010", "-")
        .replace("\u00d7", "x")
    )