1
0
Fork 0
mirror of https://github.com/qurator-spk/modstool.git synced 2025-06-08 19:29:57 +02:00
No description
Find a file
2024-07-31 10:27:46 +02:00
.circleci ✔ Test on Python 3.12 2024-07-29 07:02:59 +02:00
.vscode ⚙️ Add VSCode settings 2022-04-05 14:24:08 +02:00
src/mods4pandas 🐛 Fix converting/writing out per-page information (e.g. structure information) 2024-07-31 10:27:46 +02:00
.editorconfig ⚙️ Add .editorconfig 2022-04-01 16:04:47 +02:00
.gitignore 🧹 .gitignore pyenv's .python-version 2023-11-28 15:45:48 +01:00
LICENSE 📝 modstool: Add LICENSE 2019-10-11 13:41:33 +02:00
pyproject.toml ⚙ Migrate to pyproject.toml 2024-07-25 13:20:18 +02:00
README-DEV.md 🐛 Fix tests 2024-07-25 13:26:12 +02:00
README.md 🐛 Fix converting/writing out per-page information (e.g. structure information) 2024-07-31 10:27:46 +02:00
requirements-test.txt ✔ Enable/document profiling 2023-12-08 16:28:45 +01:00
requirements.txt 🚧 Write a Parquet file 2024-07-27 12:57:33 +02:00

Extract the MODS/ALTO metadata of a bunch of METS/ALTO files into pandas DataFrames.

Build Status

mods4pandas converts the MODS metadata from METS files into a pandas DataFrame.

Column names are derived from the corresponding MODS elements. Some domain knowledge is used to convert elements to a useful column, e.g. produce sets instead of ordered lists for topics, etc. Parts of the tool are specific to our environment/needs at the State Library Berlin and may need to be changed for your library.

Per-page information (e.g. structure information from the METS structMap) can be converted as well (--output-page-info).

alto4pandas converts the metadata from ALTO files into a pandas DataFrame.

Column names are derived from the corresponding ALTO elements. Some columns contain descriptive statistics (e.g. counts or mean) of the corresponding ALTO elements or attributes.

Usage

mods4pandas /path/to/a/directory/containing/mets_files
alto4pandas /path/to/a/directory/full/of/alto_files

Example

In this example we convert the MODS metadata contained in the METS files in /srv/data/digisam_mets-sample-300 to a pandas DataFrame under mods_info_df.parquet. This file can then be read by your data scientist using pd.read_parquet().

% mods4pandas /srv/data/digisam_mets-sample-300
INFO:root:Scanning directory /srv/data/digisam_mets-sample-300
301it [00:00, 19579.19it/s]
INFO:root:Processing METS files
100%|████████████████████████████████████████| 301/301 [00:01<00:00, 162.59it/s]
INFO:root:Writing DataFrame to mods_info_df.parquet

In the next example we convert the metadata from the ALTO files in the test data directory:

% alto4pandas qurator/mods4pandas/tests/data/alto
Scanning directory qurator/mods4pandas/tests/data/alto
Scanning directory qurator/mods4pandas/tests/data/alto/PPN636777308
Scanning directory qurator/mods4pandas/tests/data/alto/734008031
Scanning directory qurator/mods4pandas/tests/data/alto/PPN895016346
Scanning directory qurator/mods4pandas/tests/data/alto/PPN640992293
Scanning directory qurator/mods4pandas/tests/data/alto/alto-ner
Scanning directory qurator/mods4pandas/tests/data/alto/PPN767883624
Scanning directory qurator/mods4pandas/tests/data/alto/PPN715049151
Scanning directory qurator/mods4pandas/tests/data/alto/749782137
Scanning directory qurator/mods4pandas/tests/data/alto/weird-ns
INFO:alto4pandas:Processing ALTO files
INFO:alto4pandas:Writing DataFrame to alto_info_df.parquet