You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
260 lines
4.9 KiB
Markdown
260 lines
4.9 KiB
Markdown
![sbb-ner-demo example](.screenshots/sbb_ner_demo.png?raw=true)
|
|
|
|
How the models have been obtained is described in our [paper](https://corpora.linguistik.uni-erlangen.de/data/konvens/proceedings/papers/KONVENS2019_paper_4.pdf).
|
|
|
|
***
|
|
|
|
# Installation:
|
|
|
|
Setup virtual environment:
|
|
```
|
|
virtualenv --python=python3.6 venv
|
|
```
|
|
|
|
Activate virtual environment:
|
|
```
|
|
source venv/bin/activate
|
|
```
|
|
|
|
Upgrade pip:
|
|
```
|
|
pip install -U pip
|
|
```
|
|
|
|
Install package together with its dependencies in development mode:
|
|
```
|
|
pip install -e ./
|
|
```
|
|
|
|
Download required models: https://qurator-data.de/sbb_ner/models.tar.gz
|
|
|
|
Extract model archive:
|
|
```
|
|
tar -xzf models.tar.gz
|
|
```
|
|
|
|
Run webapp directly:
|
|
|
|
```
|
|
env FLASK_APP=qurator/sbb_ner/webapp/app.py env FLASK_ENV=development env USE_CUDA=True flask run --host=0.0.0.0
|
|
```
|
|
|
|
Set USE_CUDA=False, if you do not have a GPU available/installed.
|
|
|
|
|
|
For production purposes rather use
|
|
```
|
|
env USE_CUDA=True/False gunicorn --bind 0.0.0.0:5000 qurator.sbb_ner.webapp.wsgi:app
|
|
```
|
|
|
|
If you want to use a different model configuration file:
|
|
|
|
```
|
|
env USE_CUDA=True/False env CONFIG=`realpath ./my-config.json` gunicorn --bind 0.0.0.0:5000 qurator.sbb_ner.webapp.wsgi:app
|
|
```
|
|
|
|
# Docker
|
|
|
|
## CPU-only:
|
|
|
|
```
|
|
docker build --build-arg http_proxy=$http_proxy -t qurator/webapp-ner-cpu -f Dockerfile.cpu .
|
|
```
|
|
|
|
```
|
|
docker run -ti --rm=true --mount type=bind,source=data/konvens2019,target=/usr/src/qurator-sbb-ner/data/konvens2019 -p 5000:5000 qurator/webapp-ner-cpu
|
|
```
|
|
|
|
## GPU:
|
|
|
|
Make sure that your GPU is correctly set up and that nvidia-docker has been installed.
|
|
|
|
|
|
```
|
|
docker build --build-arg http_proxy=$http_proxy -t qurator/webapp-ner-gpu -f Dockerfile .
|
|
```
|
|
|
|
```
|
|
docker run -ti --rm=true --mount type=bind,source=data/konvens2019,target=/usr/src/qurator-sbb-ner/data/konvens2019 -p 5000:5000 qurator/webapp-ner-gpu
|
|
```
|
|
|
|
NER web-interface is availabe at http://localhost:5000 .
|
|
|
|
# REST - Interface
|
|
|
|
Get available models:
|
|
```
|
|
curl http://localhost:5000/models
|
|
```
|
|
|
|
Output:
|
|
|
|
```
|
|
[
|
|
{
|
|
"default": true,
|
|
"id": 1,
|
|
"model_dir": "data/konvens2019/build-wd_0.03/bert-all-german-de-finetuned",
|
|
"name": "DC-SBB + CONLL + GERMEVAL"
|
|
},
|
|
{
|
|
"default": false,
|
|
"id": 2,
|
|
"model_dir": "data/konvens2019/build-on-all-german-de-finetuned/bert-sbb-de-finetuned",
|
|
"name": "DC-SBB + CONLL + GERMEVAL + SBB"
|
|
},
|
|
{
|
|
"default": false,
|
|
"id": 3,
|
|
"model_dir": "data/konvens2019/build-wd_0.03/bert-sbb-de-finetuned",
|
|
"name": "DC-SBB + SBB"
|
|
},
|
|
{
|
|
"default": false,
|
|
"id": 4,
|
|
"model_dir": "data/konvens2019/build-wd_0.03/bert-all-german-baseline",
|
|
"name": "CONLL + GERMEVAL"
|
|
}
|
|
]
|
|
```
|
|
|
|
Perform NER using model 1:
|
|
|
|
```
|
|
curl -d '{ "text": "Paris Hilton wohnt im Hilton Paris in Paris." }' -H "Content-Type: application/json" http://localhost:5000/ner/1
|
|
```
|
|
|
|
Output:
|
|
|
|
```
|
|
[
|
|
[
|
|
{
|
|
"prediction": "B-PER",
|
|
"word": "Paris"
|
|
},
|
|
{
|
|
"prediction": "I-PER",
|
|
"word": "Hilton"
|
|
},
|
|
{
|
|
"prediction": "O",
|
|
"word": "wohnt"
|
|
},
|
|
{
|
|
"prediction": "O",
|
|
"word": "im"
|
|
},
|
|
{
|
|
"prediction": "B-ORG",
|
|
"word": "Hilton"
|
|
},
|
|
{
|
|
"prediction": "I-ORG",
|
|
"word": "Paris"
|
|
},
|
|
{
|
|
"prediction": "O",
|
|
"word": "in"
|
|
},
|
|
{
|
|
"prediction": "B-LOC",
|
|
"word": "Paris"
|
|
},
|
|
{
|
|
"prediction": "O",
|
|
"word": "."
|
|
}
|
|
]
|
|
]
|
|
```
|
|
The JSON above is the expected input format of the
|
|
[SBB named entity linking and disambiguation system](https://github.com/qurator-spk/sbb_ned).
|
|
# Model-Training
|
|
|
|
***
|
|
## Preprocessing of NER ground-truth:
|
|
|
|
|
|
### compile_conll
|
|
|
|
Read CONLL 2003 ner ground truth files from directory and
|
|
write the outcome of the data parsing to some pandas DataFrame that is
|
|
stored as pickle.
|
|
|
|
#### Usage
|
|
|
|
```
|
|
compile_conll --help
|
|
```
|
|
|
|
### compile_germ_eval
|
|
|
|
Read germ eval .tsv files from directory and write the
|
|
outcome of the data parsing to some pandas DataFrame that is stored as
|
|
pickle.
|
|
|
|
#### Usage
|
|
|
|
```
|
|
compile_germ_eval --help
|
|
```
|
|
|
|
### compile_europeana_historic
|
|
|
|
Read europeana historic ner ground truth .bio files from directory
|
|
and write the outcome of the data parsing to some pandas
|
|
DataFrame that is stored as pickle.
|
|
|
|
#### Usage
|
|
|
|
```
|
|
compile_europeana_historic --help
|
|
```
|
|
|
|
|
|
### compile_wikiner
|
|
|
|
Read wikiner files from directory and write the outcome
|
|
of the data parsing to some pandas DataFrame that is stored as pickle.
|
|
|
|
#### Usage
|
|
|
|
```
|
|
compile_wikiner --help
|
|
```
|
|
|
|
***
|
|
## Train BERT - NER model:
|
|
|
|
### bert-ner
|
|
|
|
Perform BERT for NER supervised training and test/cross-validation.
|
|
|
|
#### Usage
|
|
|
|
```
|
|
bert-ner --help
|
|
```
|
|
|
|
### bert-pregenerate-trainingdata
|
|
|
|
Generate data for BERT pre-training from a corpus text file where
|
|
the documents are separated by an empty line (output of corpuscollect).
|
|
|
|
#### Usage
|
|
|
|
```
|
|
bert-pregenerate-trainingdata --help
|
|
```
|
|
|
|
### bert-finetune
|
|
|
|
Perform BERT pre-training on pre-generated data.
|
|
|
|
#### Usage
|
|
|
|
```
|
|
bert-finetune --help
|
|
```
|