From 9057148d8de71a2a5b1c561ea97a3a6767a1497f Mon Sep 17 00:00:00 2001 From: Kai Date: Mon, 21 Feb 2022 16:40:16 +0100 Subject: [PATCH] fix README --- README.md | 31 +++++++++++++++++++++++++++++++ 1 file changed, 31 insertions(+) diff --git a/README.md b/README.md index 78ea8fb..db38cfe 100644 --- a/README.md +++ b/README.md @@ -237,6 +237,37 @@ Perform BERT for NER supervised training and test/cross-validation. bert-ner --help ``` +## BERT-Pre-training: + +### collectcorpus + +``` +collectcorpus --help + +Usage: collectcorpus [OPTIONS] FULLTEXT_FILE SELECTION_FILE CORPUS_FILE + + Reads the fulltext from a CSV or SQLITE3 file (see also altotool) and + write it to one big text file. + + FULLTEXT_FILE: The CSV or SQLITE3 file to read from. + + SELECTION_FILE: Consider only a subset of all pages that is defined by the + DataFrame that is stored in . + + CORPUS_FILE: The output file that can be used by bert-pregenerate-trainingdata. + +Options: + --chunksize INTEGER Process the corpus in chunks of . + default:10**4 + + --processes INTEGER Number of parallel processes. default: 6 + --min-line-len INTEGER Lower bound of line length in output file. + default:80 + + --help Show this message and exit. + +``` + ### bert-pregenerate-trainingdata Generate data for BERT pre-training from a corpus text file where