bioc_supplementary

`autocorpus.bioc_supplementary` ¤

This module provides functionality for converting text extracted from various file types into a BioC format.

Classes¤

`BioCTableConverter` ¤

Converts tables from nested lists into a BioC table object.

Functions¤

`build_bioc(table_data, input_file)` `staticmethod` ¤

Builds a BioCTableCollection object from the provided table data and input file.

Parameters:

Name	Type	Description	Default
`table_data`	`list[DataFrame]`	List of pandas DataFrames representing tables.	required
`input_file`	`str`	The path to the input file.	required

Source code in autocorpus/bioc_supplementary.py

@staticmethod
def build_bioc(table_data: list[DataFrame], input_file: str) -> BioCTableCollection:
    """Builds a BioCTableCollection object from the provided table data and input file.

    Args:
        table_data (list[DataFrame]): List of pandas DataFrames representing tables.
        input_file (str): The path to the input file.
    """
    bioc = BioCTableCollection()
    bioc.source = "Auto-CORPus (supplementary)"
    bioc.date = datetime.date.today().strftime("%Y%m%d")
    bioc.key = "autocorpus_supplementary.key"
    bioc.infons = {}
    bioc.documents = BioCTableConverter.__build_tables(table_data, input_file)
    return bioc

`BioCTextConverter` ¤

Converts text content into a BioC format for supplementary material processing.

Functions¤

`build_bioc(text, input_file, file_type)` `staticmethod` ¤

Builds a BioCCollection object from the provided text, input file, and file type.

Parameters:

Name	Type	Description	Default
`text`	`str \| list[WordText]`	The text content to be converted.	required
`input_file`	`str`	The path to the input file.	required
`file_type`	`str`	The type of the input file ('word' or 'pdf').	required

Returns:

Name	Type	Description
`BioCCollection`	`BioCCollection`	The constructed BioCCollection object.

Source code in autocorpus/bioc_supplementary.py

@staticmethod
def build_bioc(
    text: str | list[WordText], input_file: str, file_type: str
) -> BioCCollection:
    """Builds a BioCCollection object from the provided text, input file, and file type.

    Args:
        text: The text content to be converted.
        input_file: The path to the input file.
        file_type: The type of the input file ('word' or 'pdf').

    Returns:
        BioCCollection: The constructed BioCCollection object.
    """
    bioc = BioCCollection()
    bioc.source = "Auto-CORPus (supplementary)"
    bioc.date = datetime.date.today().strftime("%Y%m%d")
    bioc.key = "autocorpus_supplementary.key"
    temp_doc = BioCDocument(id="1")
    if file_type == "word":
        text = cast(list[WordText], text)
        temp_doc.passages = BioCTextConverter.__identify_word_passages(text)
    elif file_type == "pdf":
        text = cast(str, text)
        temp_doc.passages = BioCTextConverter.__identify_passages(text)
    else:
        text = cast(str, text)
        temp_doc.passages = BioCTextConverter.__identify_passages(text)
    temp_doc.inputfile = input_file
    bioc.documents.append(temp_doc)
    return bioc

`WordText(text, is_header)` `dataclass` ¤

Represents a text element extracted from a Word document.

Functions¤

`extract_table_from_pdf_text(text)` ¤

Extracts tables from PDF text and returns the remaining text and parsed tables.

Source code in autocorpus/bioc_supplementary.py

def extract_table_from_pdf_text(text: str) -> tuple[str, list[DataFrame]]:
    """Extracts tables from PDF text and returns the remaining text and parsed tables."""
    main_text_lines, raw_tables = _split_text_and_tables(text)
    tables_output = _parse_tables(raw_tables)
    text_output = "\n\n".join(main_text_lines)
    return text_output, tables_output

`replace_unicode(text)` ¤

Replaces specific Unicode characters with their corresponding replacements in the given text.

Parameters:

Name	Type	Description	Default
`text`	`str or list`	The input text or list of texts to process.	required

Returns:

Type	Description
`T`	str or list: The processed text or list of processed texts.

If the input text is empty or None, the function returns None.

If the input text is a list, it iterates over each element of the list and replaces the following Unicode characters: - ' ': Replaced with a space ' ' - '': Replaced with a hyphen '-' - '‐': Replaced with a hyphen '-' - '×': Replaced with a lowercase 'x'

If the input text is not a list, it directly replaces the Unicode characters mentioned above.

Returns the processed text or list of processed texts.

Source code in autocorpus/bioc_supplementary.py

def replace_unicode(text: T) -> T:
    """Replaces specific Unicode characters with their corresponding replacements in the given text.

    Args:
        text (str or list): The input text or list of texts to process.

    Returns:
        str or list: The processed text or list of processed texts.

    If the input `text` is empty or None, the function returns None.

    If the input `text` is a list, it iterates over each element of the list and replaces the following Unicode characters:
        - '\u00a0': Replaced with a space ' '
        - '\u00ad': Replaced with a hyphen '-'
        - '\u2010': Replaced with a hyphen '-'
        - '\u00d7': Replaced with a lowercase 'x'

    If the input `text` is not a list, it directly replaces the Unicode characters mentioned above.

    Returns the processed text or list of processed texts.
    """
    if isinstance(text, list):
        clean_texts = []
        for t in text:
            if t:
                clean_texts.append(string_replace_unicode(t))
        return clean_texts
    else:
        clean_text = string_replace_unicode(text)
        return clean_text

`string_replace_unicode(text)` ¤

Replaces specific Unicode characters with their corresponding replacements in the given text.

Source code in autocorpus/bioc_supplementary.py

def string_replace_unicode(text: str) -> str:
    """Replaces specific Unicode characters with their corresponding replacements in the given text."""
    return (
        text.replace("\u00a0", " ")
        .replace("\u00ad", "-")
        .replace("\u2010", "-")
        .replace("\u00d7", "x")
    )

bioc_supplementary

autocorpus.bioc_supplementary ¤

Classes¤

BioCTableConverter ¤

Functions¤

build_bioc(table_data, input_file) staticmethod ¤

BioCTextConverter ¤

Functions¤

build_bioc(text, input_file, file_type) staticmethod ¤

WordText(text, is_header) dataclass ¤

Functions¤

extract_table_from_pdf_text(text) ¤

replace_unicode(text) ¤

string_replace_unicode(text) ¤

`autocorpus.bioc_supplementary` ¤

`BioCTableConverter` ¤

`build_bioc(table_data, input_file)` `staticmethod` ¤

`BioCTextConverter` ¤

`build_bioc(text, input_file, file_type)` `staticmethod` ¤

`WordText(text, is_header)` `dataclass` ¤

`extract_table_from_pdf_text(text)` ¤

`replace_unicode(text)` ¤

`string_replace_unicode(text)` ¤