Skip to content

Frequently asked questions#

HuSpaCy is slow, what can I do?#

Not it's not. :) You have several options to speed up your processing pipeline.

  1. If accuracy is not crucial use a smaller model: md < lg < trf
  2. Utilize GPU: use the following directive before loading the model (and make sure all GPU related dependencies are installer). This simple notebook might help you get started.
    spacy.prefer_gpu()
    
  3. Batch processing of multiple documents are always faster. Use the Language.pipe() method, and increase the batch_size if needed. Additionally, the n_process parameter can be used to orchestrate multiprocessing when running models on CPU.
    texts = ["first doc", "second doc"]
    docs = nlp.pipe(texts, batch_size=1024, n_jobs=2)
    
  4. Disable components not needed. When mining documents for named entities, the default model unnecessarily computes lemmata, PoS tags and dependency trees. You can easily disable them during model loading (c.f. spacy.load() or huspacy.load()) or using Language.disable_pipe()
    nlp = huspacy.load("hu_core_news_lg", disable=["tagger"])
    
    or
    nlp.disable_pipe("tagger")
    

Models require too much RAM, how can I reduce their memory footprint?#

HuSpaCy models use distinct language models for almost each of their components. This architecture decision enables the model achieving higher precision compromising on increased memory usage. However, if you only need certain components, others can be disabled as shown above.

The NER model usually confuses ORG and LOC entities, why is that?#

The underlying model has been trained on corpora following the "tag-for-meaning" guideline which yields context dependent labels. For example referring to "Budapest" in the context of the Hungarian government should yield the ORG label while in other contexts it should be tagged as a LOC.

Can I use HuSpaCy for my commercial software?#

Yes, the tool is licensed under Apache 2.0 license, while all the models are CC BY-SA 4.0.


Last update: January 3, 2024