From f507370729c80502f4b677be95deb46fcc071692 Mon Sep 17 00:00:00 2001 From: "Gerber, Mike" Date: Tue, 21 Jun 2022 13:12:44 +0200 Subject: [PATCH] =?UTF-8?q?=F0=9F=93=9D=20README:=20Add=20some=20documenta?= =?UTF-8?q?tion=20for=20alto4pandas?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- README.md | 31 ++++++++++++++++++++++++++++++- 1 file changed, 30 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 6cef259..b89782d 100644 --- a/README.md +++ b/README.md @@ -1,19 +1,29 @@ -Extract the MODS metadata of a bunch of METS files into a pandas DataFrame. +Extract the MODS/ALTO metadata of a bunch of METS/ALTO files into pandas DataFrames. [![Build Status](https://circleci.com/gh/qurator-spk/modstool.svg?style=svg)](https://circleci.com/gh/qurator-spk/modstool) +**modstool** converts the MODS metadata from METS files into a pandas DataFrame. + Column names are derived from the corresponding MODS elements. Some domain knowledge is used to convert elements to a useful column, e.g. produce sets instead of ordered lists for topics, etc. Parts of the tool are specific to our environment/needs at the State Library Berlin and may need to be changed for your library. +**alto4pandas** convets the metadata from ALTO files into a pandas DataFrame. + +Column names are derived from the corresponding ALTO elements. Some columns +contain descriptive statistics (e.g. counts or mean) of the corresponding ALTO +elements or attributes. ## Usage ~~~ modstool /path/to/a/directory/containing/mets_files ~~~ +~~ +alto4pandas /path/to/a/directory/full/of/alto_files +~~~ ## Example In this example we convert the MODS metadata contained in the METS files in @@ -29,3 +39,22 @@ INFO:root:Processing METS files 100%|████████████████████████████████████████| 301/301 [00:01<00:00, 162.59it/s] INFO:root:Writing DataFrame to mods_info_df.pkl ~~~ + +In the next example we convert the metadata from the ALTO files in the test data +directory: + +~~~ +% alto4pandas qurator/modstool/tests/data/alto +Scanning directory qurator/modstool/tests/data/alto +Scanning directory qurator/modstool/tests/data/alto/PPN636777308 +Scanning directory qurator/modstool/tests/data/alto/734008031 +Scanning directory qurator/modstool/tests/data/alto/PPN895016346 +Scanning directory qurator/modstool/tests/data/alto/PPN640992293 +Scanning directory qurator/modstool/tests/data/alto/alto-ner +Scanning directory qurator/modstool/tests/data/alto/PPN767883624 +Scanning directory qurator/modstool/tests/data/alto/PPN715049151 +Scanning directory qurator/modstool/tests/data/alto/749782137 +Scanning directory qurator/modstool/tests/data/alto/weird-ns +INFO:alto4pandas:Processing ALTO files +INFO:alto4pandas:Writing DataFrame to alto_info_df.pkl +~~~