bioc_supplementary
autocorpus.bioc_supplementary
¤
This module provides functionality for converting text extracted from various file types into a BioC format.
Classes¤
BioCTableConverter
¤
Converts tables from nested lists into a BioC table object.
Functions¤
build_bioc(table_data, input_file)
staticmethod
¤
Builds a BioCTableCollection object from the provided table data and input file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
table_data
|
list[DataFrame]
|
List of pandas DataFrames representing tables. |
required |
input_file
|
str
|
The path to the input file. |
required |
Source code in autocorpus/bioc_supplementary.py
143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 |
|
BioCTextConverter
¤
Converts text content into a BioC format for supplementary material processing.
Functions¤
build_bioc(text, input_file, file_type)
staticmethod
¤
Builds a BioCCollection object from the provided text, input file, and file type.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
str | list[WordText]
|
The text content to be converted. |
required |
input_file
|
str
|
The path to the input file. |
required |
file_type
|
str
|
The type of the input file ('word' or 'pdf'). |
required |
Returns:
Name | Type | Description |
---|---|---|
BioCCollection |
BioCCollection
|
The constructed BioCCollection object. |
Source code in autocorpus/bioc_supplementary.py
212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 |
|
WordText(text, is_header)
dataclass
¤
Represents a text element extracted from a Word document.
Functions¤
extract_table_from_pdf_text(text)
¤
Extracts tables from PDF text and returns the remaining text and parsed tables.
Source code in autocorpus/bioc_supplementary.py
87 88 89 90 91 92 |
|
replace_unicode(text)
¤
Replaces specific Unicode characters with their corresponding replacements in the given text.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text
|
str or list
|
The input text or list of texts to process. |
required |
Returns:
Type | Description |
---|---|
T
|
str or list: The processed text or list of processed texts. |
If the input text
is empty or None, the function returns None.
If the input text
is a list, it iterates over each element of the list and replaces the following Unicode characters:
- ' ': Replaced with a space ' '
- '': Replaced with a hyphen '-'
- '‐': Replaced with a hyphen '-'
- '×': Replaced with a lowercase 'x'
If the input text
is not a list, it directly replaces the Unicode characters mentioned above.
Returns the processed text or list of processed texts.
Source code in autocorpus/bioc_supplementary.py
108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 |
|
string_replace_unicode(text)
¤
Replaces specific Unicode characters with their corresponding replacements in the given text.
Source code in autocorpus/bioc_supplementary.py
95 96 97 98 99 100 101 102 |
|