sbb_ner/README.md

![sbb-ner-demo example](.screenshots/sbb_ner_demo.png?raw=true)

How the models have been obtained is described in our [paper](https://corpora.linguistik.uni-erlangen.de/data/konvens/proceedings/papers/KONVENS2019_paper_4.pdf).

***

# Installation:

Recommended python version is 3.11. 
Consider use of [pyenv](https://github.com/pyenv/pyenv) if that python version is not available on your system. 

Activate virtual environment (virtualenv):
```
source venv/bin/activate
```
or (pyenv):
```
pyenv activate my-python-3.11-virtualenv
```

Update pip:
```
pip install -U pip
```
Install sbb_ner:
```
pip install git+https://github.com/qurator-spk/sbb_ner.git
```
Download required models: https://zenodo.org/records/18634575. 

Extract model archive:
```
tar -xzf models.tar.gz
```

Copy [config file](qurator/sbb_ner/webapp/config.json) into working directory.
Set USE_CUDA environment variable to True/False depending on GPU availability.

Run webapp directly:

```
env CONFIG=config.json env FLASK_APP=qurator/sbb_ner/webapp/app.py env FLASK_ENV=development env USE_CUDA=True/False flask run --host=0.0.0.0
```

For production purposes rather use
```
env CONFIG=config.json env USE_CUDA=True/False gunicorn --bind 0.0.0.0:5000 qurator.sbb_ner.webapp.wsgi:app
```

# Docker

## CPU-only:

```
docker build --build-arg http_proxy=$http_proxy  -t qurator/webapp-ner-cpu -f Dockerfile.cpu .
```

```
docker run -ti --rm=true --mount type=bind,source=data/konvens2019,target=/usr/src/qurator-sbb-ner/data/konvens2019 -p 5000:5000 qurator/webapp-ner-cpu
```

## GPU:

Make sure that your GPU is correctly set up and that nvidia-docker has been installed.

```
docker build --build-arg http_proxy=$http_proxy  -t qurator/webapp-ner-gpu -f Dockerfile .
```

```
docker run -ti --rm=true --mount type=bind,source=data/konvens2019,target=/usr/src/qurator-sbb-ner/data/konvens2019 -p 5000:5000 qurator/webapp-ner-gpu
```

NER web-interface is availabe at http://localhost:5000 . 

# REST - Interface

Get available models:
```
curl http://localhost:5000/models
```

Output:

```
[
  {
    "default": true, 
    "id": 1, 
    "model_dir": "data/konvens2019/build-wd_0.03/bert-all-german-de-finetuned", 
    "name": "DC-SBB + CONLL + GERMEVAL"
  }, 
  {
    "default": false, 
    "id": 2, 
    "model_dir": "data/konvens2019/build-on-all-german-de-finetuned/bert-sbb-de-finetuned", 
    "name": "DC-SBB + CONLL + GERMEVAL + SBB"
  }, 
  {
    "default": false, 
    "id": 3, 
    "model_dir": "data/konvens2019/build-wd_0.03/bert-sbb-de-finetuned", 
    "name": "DC-SBB + SBB"
  }, 
  {
    "default": false, 
    "id": 4, 
    "model_dir": "data/konvens2019/build-wd_0.03/bert-all-german-baseline", 
    "name": "CONLL + GERMEVAL"
  }
]
```

Perform NER using model 1: 

```
curl -d '{ "text": "Paris Hilton wohnt im Hilton Paris in Paris." }' -H "Content-Type: application/json" http://localhost:5000/ner/1
```

Output:

```
[
  [
    {
      "prediction": "B-PER", 
      "word": "Paris"
    }, 
    {
      "prediction": "I-PER", 
      "word": "Hilton"
    }, 
    {
      "prediction": "O", 
      "word": "wohnt"
    }, 
    {
      "prediction": "O", 
      "word": "im"
    }, 
    {
      "prediction": "B-ORG", 
      "word": "Hilton"
    }, 
    {
      "prediction": "I-ORG", 
      "word": "Paris"
    }, 
    {
      "prediction": "O", 
      "word": "in"
    }, 
    {
      "prediction": "B-LOC", 
      "word": "Paris"
    }, 
    {
      "prediction": "O", 
      "word": "."
    }
  ]
]
```
The JSON above is the expected input format of the 
[SBB named entity linking and disambiguation system](https://github.com/qurator-spk/sbb_ned).
# Model-Training 

***
## Preprocessing of NER ground-truth:


### compile_conll

Read CONLL 2003 ner ground truth files from directory and
write the outcome of the data parsing to some pandas DataFrame that is
stored as pickle.

#### Usage

```
compile_conll --help
```

### compile_germ_eval

Read germ eval .tsv files from directory and write the
outcome of the data parsing to some pandas DataFrame that is stored as
pickle.

#### Usage

```
compile_germ_eval --help
```

### compile_europeana_historic

Read europeana historic ner ground truth .bio files from directory 
and write the outcome of the data parsing to some pandas
DataFrame that is stored as pickle.

#### Usage

```
compile_europeana_historic --help
```


### compile_wikiner

Read wikiner files from directory and write the outcome
of the data parsing to some pandas DataFrame that is stored as pickle.

#### Usage

```
compile_wikiner --help
```

***
## Train BERT - NER model:

### bert-ner

Perform BERT for NER supervised training and test/cross-validation.

#### Usage

```
bert-ner --help
```

## BERT-Pre-training:

### collectcorpus

```
collectcorpus --help

Usage: collectcorpus [OPTIONS] FULLTEXT_FILE SELECTION_FILE CORPUS_FILE

  Reads the fulltext from a CSV or SQLITE3 file (see also altotool) and
  write it to one big text file.

  FULLTEXT_FILE: The CSV or SQLITE3 file to read from.

  SELECTION_FILE: Consider only a subset of all pages that is defined by the
  DataFrame that is stored in <selection_file>.

  CORPUS_FILE: The output file that can be used by bert-pregenerate-trainingdata.

Options:
  --chunksize INTEGER     Process the corpus in chunks of <chunksize>.
                          default:10**4

  --processes INTEGER     Number of parallel processes. default: 6
  --min-line-len INTEGER  Lower bound of line length in output file.
                          default:80

  --help                  Show this message and exit.

```

### bert-pregenerate-trainingdata

Generate data for BERT pre-training from a corpus text file where 
the documents are separated by an empty line (output of corpuscollect).

#### Usage

```
bert-pregenerate-trainingdata --help
```

### bert-finetune

Perform BERT pre-training on pre-generated data.

#### Usage

```
bert-finetune --help
```
Update README.md 2019-08-22 14:48:13 +02:00			`![sbb-ner-demo example](.screenshots/sbb_ner_demo.png?raw=true)`
re-structure repo 2019-08-16 15:22:13 +02:00
fix paper link 2020-01-15 11:15:55 +01:00			`How the models have been obtained is described in our [paper](https://corpora.linguistik.uni-erlangen.de/data/konvens/proceedings/papers/KONVENS2019_paper_4.pdf).`
improve README 2019-08-27 10:53:28 +02:00
			`***`
Update README.md 2019-08-27 10:57:39 +02:00
Update README.md 2019-08-27 10:57:49 +02:00			`# Installation:`
improve README 2019-08-27 10:53:28 +02:00
Update README.md 2025-04-04 15:26:06 +02:00			`Recommended python version is 3.11.`
			`Consider use of [pyenv](https://github.com/pyenv/pyenv) if that python version is not available on your system.`
improve README 2019-08-27 10:53:28 +02:00
Update README.md 2025-04-04 15:26:06 +02:00			`Activate virtual environment (virtualenv):`
improve README 2019-08-27 10:53:28 +02:00			```
			`source venv/bin/activate`
			```
Update README.md 2025-04-04 15:26:06 +02:00			`or (pyenv):`
			```
			`pyenv activate my-python-3.11-virtualenv`
			```
improve README 2019-08-27 10:53:28 +02:00
Update README.md 2025-04-04 15:26:06 +02:00			`Update pip:`
improve README 2019-08-27 10:53:28 +02:00			```
			`pip install -U pip`
			```
Update README.md 2025-04-09 08:38:26 +02:00			`Install sbb_ner:`
improve README 2019-08-27 10:53:28 +02:00			```
Update README.md 2025-04-04 15:26:06 +02:00			`pip install git+https://github.com/qurator-spk/sbb_ner.git`
improve README 2019-08-27 10:53:28 +02:00			```
Change model download link to Zenodo Updated the model download link in the README. 2026-02-13 18:48:06 +01:00			`Download required models: https://zenodo.org/records/18634575.`
improve README 2019-08-27 10:53:28 +02:00
			`Extract model archive:`
			```
			`tar -xzf models.tar.gz`
			```

Update README.md 2025-04-04 15:32:04 +02:00			`Copy [config file](qurator/sbb_ner/webapp/config.json) into working directory.`
			`Set USE_CUDA environment variable to True/False depending on GPU availability.`

improve README 2019-08-27 10:53:28 +02:00			`Run webapp directly:`

			```
Update README.md 2025-04-04 15:32:04 +02:00			`env CONFIG=config.json env FLASK_APP=qurator/sbb_ner/webapp/app.py env FLASK_ENV=development env USE_CUDA=True/False flask run --host=0.0.0.0`
improve README 2019-08-27 10:53:28 +02:00			```

add gunicorn to README and requirements 2019-11-25 14:42:11 +01:00			`For production purposes rather use`
			```
Update README.md 2025-04-04 15:32:04 +02:00			`env CONFIG=config.json env USE_CUDA=True/False gunicorn --bind 0.0.0.0:5000 qurator.sbb_ner.webapp.wsgi:app`
add gunicorn to README and requirements 2019-11-25 14:42:11 +01:00			```

improve README 2019-08-27 10:53:28 +02:00			`# Docker`

			`## CPU-only:`

			```
			`docker build --build-arg http_proxy=$http_proxy -t qurator/webapp-ner-cpu -f Dockerfile.cpu .`
			```

			```
			`docker run -ti --rm=true --mount type=bind,source=data/konvens2019,target=/usr/src/qurator-sbb-ner/data/konvens2019 -p 5000:5000 qurator/webapp-ner-cpu`
			```

			`## GPU:`

			`Make sure that your GPU is correctly set up and that nvidia-docker has been installed.`

			```
			`docker build --build-arg http_proxy=$http_proxy -t qurator/webapp-ner-gpu -f Dockerfile .`
			```

			```
			`docker run -ti --rm=true --mount type=bind,source=data/konvens2019,target=/usr/src/qurator-sbb-ner/data/konvens2019 -p 5000:5000 qurator/webapp-ner-gpu`
			```

			`NER web-interface is availabe at http://localhost:5000 .`

			`# REST - Interface`

			`Get available models:`
			```
			`curl http://localhost:5000/models`
			```

			`Output:`

			```
			`[`
			`{`
			`"default": true,`
			`"id": 1,`
			`"model_dir": "data/konvens2019/build-wd_0.03/bert-all-german-de-finetuned",`
			`"name": "DC-SBB + CONLL + GERMEVAL"`
			`},`
			`{`
			`"default": false,`
			`"id": 2,`
			`"model_dir": "data/konvens2019/build-on-all-german-de-finetuned/bert-sbb-de-finetuned",`
			`"name": "DC-SBB + CONLL + GERMEVAL + SBB"`
			`},`
			`{`
			`"default": false,`
			`"id": 3,`
			`"model_dir": "data/konvens2019/build-wd_0.03/bert-sbb-de-finetuned",`
			`"name": "DC-SBB + SBB"`
			`},`
			`{`
			`"default": false,`
			`"id": 4,`
			`"model_dir": "data/konvens2019/build-wd_0.03/bert-all-german-baseline",`
			`"name": "CONLL + GERMEVAL"`
			`}`
			`]`
			```

			`Perform NER using model 1:`

			```
			`curl -d '{ "text": "Paris Hilton wohnt im Hilton Paris in Paris." }' -H "Content-Type: application/json" http://localhost:5000/ner/1`
			```

			`Output:`

			```
			`[`
			`[`
			`{`
			`"prediction": "B-PER",`
			`"word": "Paris"`
			`},`
			`{`
			`"prediction": "I-PER",`
			`"word": "Hilton"`
			`},`
			`{`
			`"prediction": "O",`
			`"word": "wohnt"`
			`},`
			`{`
			`"prediction": "O",`
			`"word": "im"`
			`},`
			`{`
			`"prediction": "B-ORG",`
			`"word": "Hilton"`
			`},`
			`{`
			`"prediction": "I-ORG",`
			`"word": "Paris"`
			`},`
			`{`
			`"prediction": "O",`
			`"word": "in"`
			`},`
			`{`
			`"prediction": "B-LOC",`
			`"word": "Paris"`
			`},`
			`{`
			`"prediction": "O",`
			`"word": "."`
			`}`
			`]`
			`]`
			```
add cache 2020-08-18 11:49:16 +02:00			`The JSON above is the expected input format of the`
			`[SBB named entity linking and disambiguation system](https://github.com/qurator-spk/sbb_ned).`
improve README 2019-08-27 10:53:28 +02:00			`# Model-Training`

re-structure repo 2019-08-16 15:22:13 +02:00			`***`
improve README 2019-08-27 10:53:28 +02:00			`## Preprocessing of NER ground-truth:`
re-structure repo 2019-08-16 15:22:13 +02:00

improve README 2019-08-27 10:53:28 +02:00			`### compile_conll`
re-structure repo 2019-08-16 15:22:13 +02:00
			`Read CONLL 2003 ner ground truth files from directory and`
			`write the outcome of the data parsing to some pandas DataFrame that is`
			`stored as pickle.`

improve README 2019-08-27 10:53:28 +02:00			`#### Usage`
re-structure repo 2019-08-16 15:22:13 +02:00
			```
			`compile_conll --help`
			```

improve README 2019-08-27 10:53:28 +02:00			`### compile_germ_eval`
re-structure repo 2019-08-16 15:22:13 +02:00
			`Read germ eval .tsv files from directory and write the`
			`outcome of the data parsing to some pandas DataFrame that is stored as`
			`pickle.`

improve README 2019-08-27 10:53:28 +02:00			`#### Usage`
re-structure repo 2019-08-16 15:22:13 +02:00
			```
			`compile_germ_eval --help`
			```

improve README 2019-08-27 10:53:28 +02:00			`### compile_europeana_historic`
re-structure repo 2019-08-16 15:22:13 +02:00
			`Read europeana historic ner ground truth .bio files from directory`
			`and write the outcome of the data parsing to some pandas`
			`DataFrame that is stored as pickle.`

improve README 2019-08-27 10:53:28 +02:00			`#### Usage`
re-structure repo 2019-08-16 15:22:13 +02:00
			```
			`compile_europeana_historic --help`
			```


improve README 2019-08-27 10:53:28 +02:00			`### compile_wikiner`
re-structure repo 2019-08-16 15:22:13 +02:00
			`Read wikiner files from directory and write the outcome`
			`of the data parsing to some pandas DataFrame that is stored as pickle.`

improve README 2019-08-27 10:53:28 +02:00			`#### Usage`
re-structure repo 2019-08-16 15:22:13 +02:00
			```
			`compile_wikiner --help`
			```

			`***`
improve README 2019-08-27 10:53:28 +02:00			`## Train BERT - NER model:`
re-structure repo 2019-08-16 15:22:13 +02:00
improve README 2019-08-27 10:53:28 +02:00			`### bert-ner`
re-structure repo 2019-08-16 15:22:13 +02:00
			`Perform BERT for NER supervised training and test/cross-validation.`

improve README 2019-08-27 10:53:28 +02:00			`#### Usage`
re-structure repo 2019-08-16 15:22:13 +02:00
			```
			`bert-ner --help`
			```
fix README 2022-02-21 16:27:17 +01:00
fix README 2022-02-21 16:40:16 +01:00			`## BERT-Pre-training:`

			`### collectcorpus`

			```
			`collectcorpus --help`

			`Usage: collectcorpus [OPTIONS] FULLTEXT_FILE SELECTION_FILE CORPUS_FILE`

			`Reads the fulltext from a CSV or SQLITE3 file (see also altotool) and`
			`write it to one big text file.`

			`FULLTEXT_FILE: The CSV or SQLITE3 file to read from.`

			`SELECTION_FILE: Consider only a subset of all pages that is defined by the`
			`DataFrame that is stored in <selection_file>.`

			`CORPUS_FILE: The output file that can be used by bert-pregenerate-trainingdata.`

			`Options:`
			`--chunksize INTEGER Process the corpus in chunks of <chunksize>.`
			`default:10**4`

			`--processes INTEGER Number of parallel processes. default: 6`
			`--min-line-len INTEGER Lower bound of line length in output file.`
			`default:80`

			`--help Show this message and exit.`

			```

fix README 2022-02-21 16:27:17 +01:00			`### bert-pregenerate-trainingdata`

			`Generate data for BERT pre-training from a corpus text file where`
			`the documents are separated by an empty line (output of corpuscollect).`

			`#### Usage`

			```
			`bert-pregenerate-trainingdata --help`
			```

			`### bert-finetune`

			`Perform BERT pre-training on pre-generated data.`

			`#### Usage`

			```
			`bert-finetune --help`
			```