Skip to content

html

autocorpus.html ¤

The Auto-CORPus HTML processing module.

Attributes¤

Classes¤

Functions¤

load_html_file(fpath) ¤

Convert the input file into a BeautifulSoup object.

Parameters:

Name Type Description Default
fpath Path

Path to the input file.

required

Returns:

Type Description
BeautifulSoup

BeautifulSoup object of the input file.

Source code in autocorpus/html.py
16
17
18
19
20
21
22
23
24
25
26
27
28
29
def load_html_file(fpath: Path) -> BeautifulSoup:
    """Convert the input file into a BeautifulSoup object.

    Args:
        fpath: Path to the input file.

    Returns:
        BeautifulSoup object of the input file.
    """
    with fpath.open(encoding="utf-8") as fp:
        soup = BeautifulSoup(fp.read(), "html.parser")
        for e in soup.find_all(attrs={"style": ["display:none", "visibility:hidden"]}):
            e.extract()
        return soup

process_html_article(config, file_path, linked_tables=[]) ¤

Create valid BioC versions of input HTML journal articles based off config.

Processes the main text file and tables specified in the configuration.

This method performs the following steps: 1. Checks if a valid configuration is loaded. If not, raises a RuntimeError. 2. Handles the main text file: - Parses the HTML content of the file. - Extracts the main text from the parsed HTML. - Attempts to extract abbreviations from the main text and HTML content. If an error occurs during this process, it prints the error. 3. Processes linked tables, if any: - Parses the HTML content of each linked table file. 4. Merges table data. 5. Checks if there are any documents in the tables and sets the has_tables attribute accordingly.

Parameters:

Name Type Description Default
config dict[str, Any]

Configuration dictionary for the input journal articles

required
file_path Path

Path to the article file to be processed

required
linked_tables list[Path]

list of linked table file paths to be included in this run (HTML files only)

[]

Returns:

Type Description
tuple[dict[str, Any], dict[str, Any], dict[str, Any]]

A tuple containing: - main_text: Extracted main text as a dictionary. - abbreviations: Extracted abbreviations as a dictionary. - tables: Extracted tables as a dictionary (possibly empty).

Raises:

Type Description
RuntimeError

If no valid configuration is loaded.

Source code in autocorpus/html.py
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
def process_html_article(
    config: dict[str, Any], file_path: Path, linked_tables: list[Path] = []
) -> tuple[dict[str, Any], dict[str, Any], dict[str, Any]]:
    """Create valid BioC versions of input HTML journal articles based off config.

    Processes the main text file and tables specified in the configuration.

    This method performs the following steps:
    1. Checks if a valid configuration is loaded. If not, raises a RuntimeError.
    2. Handles the main text file:
        - Parses the HTML content of the file.
        - Extracts the main text from the parsed HTML.
        - Attempts to extract abbreviations from the main text and HTML content.
          If an error occurs during this process, it prints the error.
    3. Processes linked tables, if any:
        - Parses the HTML content of each linked table file.
    4. Merges table data.
    5. Checks if there are any documents in the tables and sets the `has_tables`
        attribute accordingly.

    Args:
        config: Configuration dictionary for the input journal articles
        file_path: Path to the article file to be processed
        linked_tables: list of linked table file paths to be included in this run
            (HTML files only)

    Returns:
        A tuple containing:
            - main_text: Extracted main text as a dictionary.
            - abbreviations: Extracted abbreviations as a dictionary.
            - tables: Extracted tables as a dictionary (possibly empty).

    Raises:
        RuntimeError: If no valid configuration is loaded.
    """
    if config == {}:
        raise RuntimeError("A valid config file must be loaded.")

    soup = load_html_file(file_path)
    main_text = _extract_text(soup, config)
    try:
        abbreviations = get_abbreviations(main_text, soup, file_path)
    except Exception as e:
        logger.error(e)

    if "tables" not in config:
        return main_text, abbreviations, dict()

    tables, empty_tables = get_table_json(soup, config, file_path)

    new_documents = []
    for table_file in linked_tables:
        soup = load_html_file(table_file)
        new_tables, new_empty_tables = get_table_json(soup, config, table_file)
        new_documents.extend(new_tables.get("documents", []))
        empty_tables.extend(new_empty_tables)
    tables["documents"] = _extend_tables_documents(
        tables.get("documents", []), new_documents
    )
    if empty_tables:
        tables["documents"] = _merge_tables_with_empty_tables(
            tables["documents"], empty_tables
        )

    return main_text, abbreviations, tables