Extended entity recognition#
There can be cases when the default four-type NER does not suffice, one needs a more fine-grained entity type system. Attila Novak developed a corpus and an entity recognition system consisting more than 30 entity types. We provide easy integration with his tool.
Load#
Loading the model can be achieved by adding the nerpp
component.
transformers
, torch
and spacy-alignments
to be installed.
Installing HuSpaCy with trf
extras installs all these dependencies: pip install huspacy[trf]
Get entity annotations#
The nerpp
components stores entities as spans on the document under the "ents"
key:
doc = nlp("A Ford Focus egy alsó-középkategóriás családi autó")
print(doc.spans["ents"])
print(doc.spans["ents"][0].label_)
gives
Citing#
If you use this component, please cite:
@InProceedings{novak-novak:2022:LREC,
author = {Nov{\'{a}}k, Attila and Nov{\'{a}}k, Borb{\'{a}}la},
title = {NerKor+Cars-OntoNotes++},
booktitle = {Proceedings of the 13th Language Resources and Evaluation Conference (LREC 2022)},
month = {June},
year = {2022},
address = {Marseille, France},
publisher = {European Language Resources Association},
pages = {1907--1916},
url = {http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.203.pdf}
}