mirror of https://github.com/qurator-spk/sbb_ner.git synced 2026-07-27 12:39:13 +02:00

No description

Find a file

Clemens Neudecker 3185d7f7a1 Change model download link to Zenodo Updated the model download link in the README.		2026-02-13 18:48:06 +01:00
.screenshots	add screenshot	2019-08-22 14:43:42 +02:00
doc	Update sbb-ner-model-card.md	2023-02-01 09:00:31 +01:00
qurator	force static web files to be installed	2025-04-04 14:49:08 +02:00
.dockerignore	ner + textline work	2019-08-21 15:38:56 +02:00
__init__.py	refator	2022-06-10 10:06:16 +02:00
Dockerfile	fix docker file	2019-08-22 14:09:30 +02:00
Dockerfile.cpu	ner + textline work	2019-08-21 15:38:56 +02:00
LICENSE	re-structure repo	2019-08-16 15:22:13 +02:00
Makefile	update requirements; add make target for models from git annex	2025-04-04 14:29:30 +02:00
README.md	Change model download link to Zenodo	2026-02-13 18:48:06 +01:00
requirements.txt	update requirements; add make target for models from git annex	2025-04-04 14:30:38 +02:00
setup.py	force static web files to be installed	2025-04-04 14:54:03 +02:00

README.md

How the models have been obtained is described in our paper.

Installation:

Recommended python version is 3.11. Consider use of pyenv if that python version is not available on your system.

Activate virtual environment (virtualenv):

source venv/bin/activate

or (pyenv):

pyenv activate my-python-3.11-virtualenv

Update pip:

pip install -U pip

Install sbb_ner:

pip install git+https://github.com/qurator-spk/sbb_ner.git

Download required models: https://zenodo.org/records/18634575.

Extract model archive:

tar -xzf models.tar.gz

Copy config file into working directory. Set USE_CUDA environment variable to True/False depending on GPU availability.

Run webapp directly:

env CONFIG=config.json env FLASK_APP=qurator/sbb_ner/webapp/app.py env FLASK_ENV=development env USE_CUDA=True/False flask run --host=0.0.0.0

For production purposes rather use

env CONFIG=config.json env USE_CUDA=True/False gunicorn --bind 0.0.0.0:5000 qurator.sbb_ner.webapp.wsgi:app

Docker

CPU-only:

docker build --build-arg http_proxy=$http_proxy  -t qurator/webapp-ner-cpu -f Dockerfile.cpu .

docker run -ti --rm=true --mount type=bind,source=data/konvens2019,target=/usr/src/qurator-sbb-ner/data/konvens2019 -p 5000:5000 qurator/webapp-ner-cpu

GPU:

Make sure that your GPU is correctly set up and that nvidia-docker has been installed.

docker build --build-arg http_proxy=$http_proxy  -t qurator/webapp-ner-gpu -f Dockerfile .

docker run -ti --rm=true --mount type=bind,source=data/konvens2019,target=/usr/src/qurator-sbb-ner/data/konvens2019 -p 5000:5000 qurator/webapp-ner-gpu

NER web-interface is availabe at http://localhost:5000 .

REST - Interface

Get available models:

curl http://localhost:5000/models

Output:

[
  {
    "default": true, 
    "id": 1, 
    "model_dir": "data/konvens2019/build-wd_0.03/bert-all-german-de-finetuned", 
    "name": "DC-SBB + CONLL + GERMEVAL"
  }, 
  {
    "default": false, 
    "id": 2, 
    "model_dir": "data/konvens2019/build-on-all-german-de-finetuned/bert-sbb-de-finetuned", 
    "name": "DC-SBB + CONLL + GERMEVAL + SBB"
  }, 
  {
    "default": false, 
    "id": 3, 
    "model_dir": "data/konvens2019/build-wd_0.03/bert-sbb-de-finetuned", 
    "name": "DC-SBB + SBB"
  }, 
  {
    "default": false, 
    "id": 4, 
    "model_dir": "data/konvens2019/build-wd_0.03/bert-all-german-baseline", 
    "name": "CONLL + GERMEVAL"
  }
]

Perform NER using model 1:

curl -d '{ "text": "Paris Hilton wohnt im Hilton Paris in Paris." }' -H "Content-Type: application/json" http://localhost:5000/ner/1

Output:

[
  [
    {
      "prediction": "B-PER", 
      "word": "Paris"
    }, 
    {
      "prediction": "I-PER", 
      "word": "Hilton"
    }, 
    {
      "prediction": "O", 
      "word": "wohnt"
    }, 
    {
      "prediction": "O", 
      "word": "im"
    }, 
    {
      "prediction": "B-ORG", 
      "word": "Hilton"
    }, 
    {
      "prediction": "I-ORG", 
      "word": "Paris"
    }, 
    {
      "prediction": "O", 
      "word": "in"
    }, 
    {
      "prediction": "B-LOC", 
      "word": "Paris"
    }, 
    {
      "prediction": "O", 
      "word": "."
    }
  ]
]

The JSON above is the expected input format of the SBB named entity linking and disambiguation system.

Model-Training

Preprocessing of NER ground-truth:

compile_conll

Read CONLL 2003 ner ground truth files from directory and write the outcome of the data parsing to some pandas DataFrame that is stored as pickle.

Usage

compile_conll --help

compile_germ_eval

Read germ eval .tsv files from directory and write the outcome of the data parsing to some pandas DataFrame that is stored as pickle.

Usage

compile_germ_eval --help

compile_europeana_historic

Read europeana historic ner ground truth .bio files from directory and write the outcome of the data parsing to some pandas DataFrame that is stored as pickle.

Usage

compile_europeana_historic --help

compile_wikiner

Read wikiner files from directory and write the outcome of the data parsing to some pandas DataFrame that is stored as pickle.

Usage

compile_wikiner --help

Train BERT - NER model:

bert-ner

Perform BERT for NER supervised training and test/cross-validation.

Usage

bert-ner --help

BERT-Pre-training:

collectcorpus

collectcorpus --help

Usage: collectcorpus [OPTIONS] FULLTEXT_FILE SELECTION_FILE CORPUS_FILE

  Reads the fulltext from a CSV or SQLITE3 file (see also altotool) and
  write it to one big text file.

  FULLTEXT_FILE: The CSV or SQLITE3 file to read from.

  SELECTION_FILE: Consider only a subset of all pages that is defined by the
  DataFrame that is stored in <selection_file>.

  CORPUS_FILE: The output file that can be used by bert-pregenerate-trainingdata.

Options:
  --chunksize INTEGER     Process the corpus in chunks of <chunksize>.
                          default:10**4

  --processes INTEGER     Number of parallel processes. default: 6
  --min-line-len INTEGER  Lower bound of line length in output file.
                          default:80

  --help                  Show this message and exit.

bert-pregenerate-trainingdata

Generate data for BERT pre-training from a corpus text file where the documents are separated by an empty line (output of corpuscollect).

Usage

bert-pregenerate-trainingdata --help

bert-finetune

Perform BERT pre-training on pre-generated data.

Usage

bert-finetune --help