1
0
Fork 0
mirror of https://github.com/qurator-spk/sbb_ner.git synced 2025-07-05 17:09:58 +02:00
sbb_ner/README.md

239 lines
4.5 KiB
Markdown
Raw Normal View History

2019-08-22 14:48:13 +02:00
![sbb-ner-demo example](.screenshots/sbb_ner_demo.png?raw=true)
2019-08-16 15:22:13 +02:00
2020-01-15 11:15:55 +01:00
How the models have been obtained is described in our [paper](https://corpora.linguistik.uni-erlangen.de/data/konvens/proceedings/papers/KONVENS2019_paper_4.pdf).
2019-08-27 10:53:28 +02:00
***
2019-08-27 10:57:39 +02:00
2019-08-27 10:57:49 +02:00
# Installation:
2019-08-27 10:53:28 +02:00
Setup virtual environment:
```
virtualenv --python=python3.6 venv
```
Activate virtual environment:
```
source venv/bin/activate
```
Upgrade pip:
```
pip install -U pip
```
Install package together with its dependencies in development mode:
```
pip install -e ./
```
2020-01-15 11:11:41 +01:00
Download required models: https://qurator-data.de/sbb_ner/models.tar.gz
2019-08-27 10:53:28 +02:00
Extract model archive:
```
tar -xzf models.tar.gz
```
Run webapp directly:
```
env FLASK_APP=qurator/sbb_ner/webapp/app.py env FLASK_ENV=development env USE_CUDA=True flask run --host=0.0.0.0
```
Set USE_CUDA=False, if you do not have a GPU available/installed.
For production purposes rather use
```
2019-11-29 10:14:11 +01:00
env USE_CUDA=True/False gunicorn --bind 0.0.0.0:5000 qurator.sbb_ner.webapp.wsgi:app
```
If you want to use a different model configuration file:
```
env USE_CUDA=True/False env CONFIG=`realpath ./my-config.json` gunicorn --bind 0.0.0.0:5000 qurator.sbb_ner.webapp.wsgi:app
```
2019-08-27 10:53:28 +02:00
# Docker
## CPU-only:
```
docker build --build-arg http_proxy=$http_proxy -t qurator/webapp-ner-cpu -f Dockerfile.cpu .
```
```
docker run -ti --rm=true --mount type=bind,source=data/konvens2019,target=/usr/src/qurator-sbb-ner/data/konvens2019 -p 5000:5000 qurator/webapp-ner-cpu
```
## GPU:
Make sure that your GPU is correctly set up and that nvidia-docker has been installed.
```
docker build --build-arg http_proxy=$http_proxy -t qurator/webapp-ner-gpu -f Dockerfile .
```
```
docker run -ti --rm=true --mount type=bind,source=data/konvens2019,target=/usr/src/qurator-sbb-ner/data/konvens2019 -p 5000:5000 qurator/webapp-ner-gpu
```
NER web-interface is availabe at http://localhost:5000 .
# REST - Interface
Get available models:
```
curl http://localhost:5000/models
```
Output:
```
[
{
"default": true,
"id": 1,
"model_dir": "data/konvens2019/build-wd_0.03/bert-all-german-de-finetuned",
"name": "DC-SBB + CONLL + GERMEVAL"
},
{
"default": false,
"id": 2,
"model_dir": "data/konvens2019/build-on-all-german-de-finetuned/bert-sbb-de-finetuned",
"name": "DC-SBB + CONLL + GERMEVAL + SBB"
},
{
"default": false,
"id": 3,
"model_dir": "data/konvens2019/build-wd_0.03/bert-sbb-de-finetuned",
"name": "DC-SBB + SBB"
},
{
"default": false,
"id": 4,
"model_dir": "data/konvens2019/build-wd_0.03/bert-all-german-baseline",
"name": "CONLL + GERMEVAL"
}
]
```
Perform NER using model 1:
```
curl -d '{ "text": "Paris Hilton wohnt im Hilton Paris in Paris." }' -H "Content-Type: application/json" http://localhost:5000/ner/1
```
Output:
```
[
[
{
"prediction": "B-PER",
"word": "Paris"
},
{
"prediction": "I-PER",
"word": "Hilton"
},
{
"prediction": "O",
"word": "wohnt"
},
{
"prediction": "O",
"word": "im"
},
{
"prediction": "B-ORG",
"word": "Hilton"
},
{
"prediction": "I-ORG",
"word": "Paris"
},
{
"prediction": "O",
"word": "in"
},
{
"prediction": "B-LOC",
"word": "Paris"
},
{
"prediction": "O",
"word": "."
}
]
]
```
2020-08-18 11:49:16 +02:00
The JSON above is the expected input format of the
[SBB named entity linking and disambiguation system](https://github.com/qurator-spk/sbb_ned).
2019-08-27 10:53:28 +02:00
# Model-Training
2019-08-16 15:22:13 +02:00
***
2019-08-27 10:53:28 +02:00
## Preprocessing of NER ground-truth:
2019-08-16 15:22:13 +02:00
2019-08-27 10:53:28 +02:00
### compile_conll
2019-08-16 15:22:13 +02:00
Read CONLL 2003 ner ground truth files from directory and
write the outcome of the data parsing to some pandas DataFrame that is
stored as pickle.
2019-08-27 10:53:28 +02:00
#### Usage
2019-08-16 15:22:13 +02:00
```
compile_conll --help
```
2019-08-27 10:53:28 +02:00
### compile_germ_eval
2019-08-16 15:22:13 +02:00
Read germ eval .tsv files from directory and write the
outcome of the data parsing to some pandas DataFrame that is stored as
pickle.
2019-08-27 10:53:28 +02:00
#### Usage
2019-08-16 15:22:13 +02:00
```
compile_germ_eval --help
```
2019-08-27 10:53:28 +02:00
### compile_europeana_historic
2019-08-16 15:22:13 +02:00
Read europeana historic ner ground truth .bio files from directory
and write the outcome of the data parsing to some pandas
DataFrame that is stored as pickle.
2019-08-27 10:53:28 +02:00
#### Usage
2019-08-16 15:22:13 +02:00
```
compile_europeana_historic --help
```
2019-08-27 10:53:28 +02:00
### compile_wikiner
2019-08-16 15:22:13 +02:00
Read wikiner files from directory and write the outcome
of the data parsing to some pandas DataFrame that is stored as pickle.
2019-08-27 10:53:28 +02:00
#### Usage
2019-08-16 15:22:13 +02:00
```
compile_wikiner --help
```
***
2019-08-27 10:53:28 +02:00
## Train BERT - NER model:
2019-08-16 15:22:13 +02:00
2019-08-27 10:53:28 +02:00
### bert-ner
2019-08-16 15:22:13 +02:00
Perform BERT for NER supervised training and test/cross-validation.
2019-08-27 10:53:28 +02:00
#### Usage
2019-08-16 15:22:13 +02:00
```
bert-ner --help
```