file_type
autocorpus.file_type
¤
Contains utilities for identifying file types based on content and extension.
Attributes¤
Classes¤
FileType
¤
Bases: Enum
Enum for different file types.
Access the attributes like so FileType.HTML, FileType.XML, etc.
Attributes:
Name | Type | Description |
---|---|---|
HTML |
Represents an HTML file. |
|
XML |
Represents an XML file. |
|
PDF |
Represents a PDF file. |
|
WORD |
Represents a Word document (DOCX or DOC). |
|
UNKNOWN |
Represents any other file type that is not recognized. |
Functions¤
check_file_type(file_path)
¤
Determines the type of a file based on its content and extension.
This function checks the given file type by checking the file extension and then attempting to parse it using appropriate parsers. If the file cannot be parsed or the fileextension is not recognised, it is classified as "OTHER".
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path
|
Path
|
The path to the file to be checked. |
required |
Returns:
Type | Description |
---|---|
FileType
|
A FileType Enum value indicating the type of the file. |
Raises:
Type | Description |
---|---|
FileNotFoundError
|
If the provided path does not point to a file. |
Source code in autocorpus/file_type.py
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 |
|