1
0
Fork 0
mirror of https://github.com/qurator-spk/modstool.git synced 2025-06-09 11:49:55 +02:00
No description
Find a file
2022-06-21 13:12:44 +02:00
.circleci ✔️ CircleCI: - Python 3.5 + Python 3.10 2022-03-31 21:10:13 +02:00
.vscode ⚙️ Add VSCode settings 2022-04-05 14:24:08 +02:00
qurator ✔️ Add test data 2022-06-21 12:54:26 +02:00
.editorconfig ⚙️ Add .editorconfig 2022-04-01 16:04:47 +02:00
.gitignore 🧹 .gitignore 2022-04-01 16:04:21 +02:00
LICENSE 📝 modstool: Add LICENSE 2019-10-11 13:41:33 +02:00
README.md 📝 README: Add some documentation for alto4pandas 2022-06-21 13:12:44 +02:00
requirements-test.txt 🚧 Add support for ALTO Description 2022-05-04 20:02:27 +02:00
requirements.txt 🐛 Fix install on Python 3.10 2022-04-07 15:17:19 +02:00
setup.py Rename altotool to alto4pandas 2022-06-16 19:27:59 +02:00

Extract the MODS/ALTO metadata of a bunch of METS/ALTO files into pandas DataFrames.

Build Status

modstool converts the MODS metadata from METS files into a pandas DataFrame.

Column names are derived from the corresponding MODS elements. Some domain knowledge is used to convert elements to a useful column, e.g. produce sets instead of ordered lists for topics, etc. Parts of the tool are specific to our environment/needs at the State Library Berlin and may need to be changed for your library.

alto4pandas convets the metadata from ALTO files into a pandas DataFrame.

Column names are derived from the corresponding ALTO elements. Some columns contain descriptive statistics (e.g. counts or mean) of the corresponding ALTO elements or attributes.

Usage

modstool /path/to/a/directory/containing/mets_files

~~ alto4pandas /path/to/a/directory/full/of/alto_files


## Example
In this example we convert the MODS metadata contained in the METS files in
`/srv/data/digisam_mets-sample-300` to a pandas DataFrame under
`mods_info_df.pkl`. This file can then be read by your data scientist using
`pd.read_pickle()`.

% modstool /srv/data/digisam_mets-sample-300 INFO:root:Scanning directory /srv/data/digisam_mets-sample-300 301it [00:00, 19579.19it/s] INFO:root:Processing METS files 100%|████████████████████████████████████████| 301/301 [00:01<00:00, 162.59it/s] INFO:root:Writing DataFrame to mods_info_df.pkl


In the next example we convert the metadata from the ALTO files in the test data
directory:

% alto4pandas qurator/modstool/tests/data/alto Scanning directory qurator/modstool/tests/data/alto Scanning directory qurator/modstool/tests/data/alto/PPN636777308 Scanning directory qurator/modstool/tests/data/alto/734008031 Scanning directory qurator/modstool/tests/data/alto/PPN895016346 Scanning directory qurator/modstool/tests/data/alto/PPN640992293 Scanning directory qurator/modstool/tests/data/alto/alto-ner Scanning directory qurator/modstool/tests/data/alto/PPN767883624 Scanning directory qurator/modstool/tests/data/alto/PPN715049151 Scanning directory qurator/modstool/tests/data/alto/749782137 Scanning directory qurator/modstool/tests/data/alto/weird-ns INFO:alto4pandas:Processing ALTO files INFO:alto4pandas:Writing DataFrame to alto_info_df.pkl