Skip to content

Extracting noun phrases#

Even though, spaCy's rule-based noun chunking facilities are not yet supported, you can still extract noun phrases using the Berkeley Neural Parser (benepar).

Install dependencies#

First you'll need to install the tool by issuing pip install benepar or pip install huspacy[np]

Usage#

Then, benepar models should be downloaded and added to a HuSpaCy pipeline:

import benepar
import os

# Workaround for incompatible protobuf versions
os.environ["PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION"]="python"

benepar.download('benepar_hu2')

nlp.add_pipe("benepar", config={"model": "benepar_hu2"})

One can use this simple method to extract maximal noun phrase:

from spacy.tokens import Span
from typing import *

def extract_max_np(span: Span) -> Iterable[Span]:
  if "NP" in span._.labels:
    yield span
  else:
    for child in span._.children:
      yield from extract_max_np(child)

Then

doc = nlp("Ők korábban népszavazási kérdéseket jelentettek be, és azt ígérik, folytatják.")

for sent in doc.sents:
  for np in extract_max_np(sent):
    print(np)

prints:

Ők
népszavazási kérdéseket
azt