Skip to content

Auto-CORPus

html

html

`autocorpus.html` ¤

The Auto-CORPus HTML processing module.

Attributes¤

Classes¤

Functions¤

`load_html_file(fpath)` ¤

Convert the input file into a BeautifulSoup object.

Parameters:

Name	Type	Description	Default
`fpath`	`Path`	Path to the input file.	required

Returns:

Type	Description
`BeautifulSoup`	BeautifulSoup object of the input file.

Source code in autocorpus/html.py

def load_html_file(fpath: Path) -> BeautifulSoup:
    """Convert the input file into a BeautifulSoup object.

    Args:
        fpath: Path to the input file.

    Returns:
        BeautifulSoup object of the input file.
    """
    with fpath.open(encoding="utf-8") as fp:
        soup = BeautifulSoup(fp.read(), "html.parser")
        for e in soup.find_all(attrs={"style": ["display:none", "visibility:hidden"]}):
            e.extract()
        return soup

`process_html_article(config, file_path, linked_tables=[])` ¤

Create valid BioC versions of input HTML journal articles based off config.

Processes the main text file and tables specified in the configuration.

This method performs the following steps: 1. Checks if a valid configuration is loaded. If not, raises a RuntimeError. 2. Handles the main text file: - Parses the HTML content of the file. - Extracts the main text from the parsed HTML. - Attempts to extract abbreviations from the main text and HTML content. If an error occurs during this process, it prints the error. 3. Processes linked tables, if any: - Parses the HTML content of each linked table file. 4. Merges table data. 5. Checks if there are any documents in the tables and sets the has_tables attribute accordingly.

Parameters:

Name	Type	Description	Default
`config`	`dict[str, Any]`	Configuration dictionary for the input journal articles	required
`file_path`	`Path`	Path to the article file to be processed	required
`linked_tables`	`list[Path]`	list of linked table file paths to be included in this run (HTML files only)	`[]`

Returns:

Type	Description
`tuple[dict[str, Any], dict[str, Any], dict[str, Any]]`	A tuple containing: - main_text: Extracted main text as a dictionary. - abbreviations: Extracted abbreviations as a dictionary. - tables: Extracted tables as a dictionary (possibly empty).

Raises:

Type	Description
`RuntimeError`	If no valid configuration is loaded.

Source code in autocorpus/html.py

def process_html_article(
    config: dict[str, Any], file_path: Path, linked_tables: list[Path] = []
) -> tuple[dict[str, Any], dict[str, Any], dict[str, Any]]:
    """Create valid BioC versions of input HTML journal articles based off config.

    Processes the main text file and tables specified in the configuration.

    This method performs the following steps:
    1. Checks if a valid configuration is loaded. If not, raises a RuntimeError.
    2. Handles the main text file:
        - Parses the HTML content of the file.
        - Extracts the main text from the parsed HTML.
        - Attempts to extract abbreviations from the main text and HTML content.
          If an error occurs during this process, it prints the error.
    3. Processes linked tables, if any:
        - Parses the HTML content of each linked table file.
    4. Merges table data.
    5. Checks if there are any documents in the tables and sets the `has_tables`
        attribute accordingly.

    Args:
        config: Configuration dictionary for the input journal articles
        file_path: Path to the article file to be processed
        linked_tables: list of linked table file paths to be included in this run
            (HTML files only)

    Returns:
        A tuple containing:
            - main_text: Extracted main text as a dictionary.
            - abbreviations: Extracted abbreviations as a dictionary.
            - tables: Extracted tables as a dictionary (possibly empty).

    Raises:
        RuntimeError: If no valid configuration is loaded.
    """
    if config == {}:
        raise RuntimeError("A valid config file must be loaded.")

    soup = load_html_file(file_path)
    main_text = _extract_text(soup, config)
    try:
        abbreviations = get_abbreviations(main_text, soup, file_path)
    except Exception as e:
        logger.error(e)

    if "tables" not in config:
        return main_text, abbreviations, dict()

    tables, empty_tables = get_table_json(soup, config, file_path)

    new_documents = []
    for table_file in linked_tables:
        soup = load_html_file(table_file)
        new_tables, new_empty_tables = get_table_json(soup, config, table_file)
        new_documents.extend(new_tables.get("documents", []))
        empty_tables.extend(new_empty_tables)
    tables["documents"] = _extend_tables_documents(
        tables.get("documents", []), new_documents
    )
    if empty_tables:
        tables["documents"] = _merge_tables_with_empty_tables(
            tables["documents"], empty_tables
        )

    return main_text, abbreviations, tables