Processing pipelines
A D VA N C E D N L P W I T H S PA C Y
Ines Montani
spaCy core developer
What happens when you call nlp?
doc = nlp("This is a sentence.")
ADVANCED NLP WITH SPACY
Built-in pipeline components
Name Description Creates
tagger Part-of-speech tagger Token.tag
parser Dependency parser Token.dep , Token.head , Doc.sents , Doc.noun_chunks
ner Named entity recognizer Doc.ents , Token.ent_iob , Token.ent_type
textcat Text classi er Doc.cats
ADVANCED NLP WITH SPACY
Under the hood
Pipeline de ned in model's meta.json in order
Built-in components need binary data to make predictions
ADVANCED NLP WITH SPACY
Pipeline attributes
nlp.pipe_names : list of pipeline component names
print(nlp.pipe_names)
['tagger', 'parser', 'ner']
nlp.pipeline : list of (name, component) tuples
print(nlp.pipeline)
[('tagger', <spacy.pipeline.Tagger>),
('parser', <spacy.pipeline.DependencyParser>),
('ner', <spacy.pipeline.EntityRecognizer>)]
ADVANCED NLP WITH SPACY
Let's practice!
A D VA N C E D N L P W I T H S PA C Y
Custom pipeline
components
A D VA N C E D N L P W I T H S PA C Y
Ines Montani
spaCy core developer
Why custom components?
Make a function execute automatically when you call nlp
Add your own metadata to documents and tokens
Updating built-in a ributes like doc.ents
ADVANCED NLP WITH SPACY
Anatomy of a component (1)
Function that takes a doc , modi es it and returns it
Can be added using the nlp.add_pipe method
def custom_component(doc):
# Do something to the doc here
return doc
nlp.add_pipe(custom_component)
ADVANCED NLP WITH SPACY
Anatomy of a component (2)
def custom_component(doc):
# Do something to the doc here
return doc
nlp.add_pipe(custom_component)
Argument Description Example
last If True , add last nlp.add_pipe(component, last=True)
first If True , add rst nlp.add_pipe(component, first=True)
before Add before component nlp.add_pipe(component, before='ner')
after Add a er component nlp.add_pipe(component, after='tagger')
ADVANCED NLP WITH SPACY
Example: a simple component (1)
# Create the nlp object
nlp = spacy.load('en_core_web_sm')
# Define a custom component
def custom_component(doc):
# Print the doc's length
print('Doc length:' len(doc))
# Return the doc object
return doc
# Add the component first in the pipeline
nlp.add_pipe(custom_component, first=True)
# Print the pipeline component names
print('Pipeline:', nlp.pipe_names)
Pipeline: ['custom_component', 'tagger', 'parser', 'ner']
ADVANCED NLP WITH SPACY
Example: a simple component (2)
# Create the nlp object
nlp = spacy.load('en_core_web_sm')
# Define a custom component
def custom_component(doc):
# Print the doc's length
print('Doc length:' len(doc))
# Return the doc object
return doc
# Add the component first in the pipeline
nlp.add_pipe(custom_component, first=True)
# Process a text
doc = nlp("Hello world!")
Doc length: 3
ADVANCED NLP WITH SPACY
Let's practice!
A D VA N C E D N L P W I T H S PA C Y
Extension attributes
A D VA N C E D N L P W I T H S PA C Y
Ines Montani
spaCy core developer
Setting custom attributes
Add custom metadata to documents, tokens and spans
Accessible via the ._ property
doc._.title = 'My document'
token._.is_color = True
span._.has_color = False
registered on the global Doc , Token or Span using the set_extension method
# Import global classes
from spacy.tokens import Doc, Token, Span
# Set extensions on the Doc, Token and Span
Doc.set_extension('title', default=None)
Token.set_extension('is_color', default=False)
Span.set_extension('has_color', default=False)
ADVANCED NLP WITH SPACY
Extension attribute types
1. A ribute extensions
2. Property extensions
3. Method extensions
ADVANCED NLP WITH SPACY
Attribute extensions
Set a default value that can be overwri en
from spacy.tokens import Token
# Set extension on the Token with default value
Token.set_extension('is_color', default=False)
doc = nlp("The sky is blue.")
# Overwrite extension attribute value
doc[3]._.is_color = True
ADVANCED NLP WITH SPACY
Property extensions (1)
De ne a ge er and an optional se er function
Ge er only called when you retrieve the a ribute value
from spacy.tokens import Token
# Define getter function
def get_is_color(token):
colors = ['red', 'yellow', 'blue']
return token.text in colors
# Set extension on the Token with getter
Token.set_extension('is_color', getter=get_is_color)
doc = nlp("The sky is blue.")
print(doc[3]._.is_color, '-', doc[3].text)
blue - True
ADVANCED NLP WITH SPACY
Property extensions (2)
Span extensions should almost always use a ge er
from spacy.tokens import Span
# Define getter function
def get_has_color(span):
colors = ['red', 'yellow', 'blue']
return any(token.text in colors for token in span)
# Set extension on the Span with getter
Span.set_extension('has_color', getter=get_has_color)
doc = nlp("The sky is blue.")
print(doc[1:4]._.has_color, '-', doc[1:4].text)
print(doc[0:2]._.has_color, '-', doc[0:2].text)
True - sky is blue
False - The sky
ADVANCED NLP WITH SPACY
Method extensions
Assign a function that becomes available as an object method
Lets you pass arguments to the extension function
from spacy.tokens import Doc
# Define method with arguments
def has_token(doc, token_text):
in_doc = token_text in [token.text for token in doc]
# Set extension on the Doc with method
Doc.set_extension('has_token', method=has_token)
doc = nlp("The sky is blue.")
print(doc._.has_token('blue'), '- blue')
print(doc._.has_token('cloud'), '- cloud')
True - blue
False - cloud
ADVANCED NLP WITH SPACY
Let's practice!
A D VA N C E D N L P W I T H S PA C Y
Scaling and
performance
A D VA N C E D N L P W I T H S PA C Y
Ines Montani
spaCy core developer
Processing large volumes of text
Use nlp.pipe method
Processes texts as a stream, yields Doc objects
Much faster than calling nlp on each text
BAD:
docs = [nlp(text) for text in LOTS_OF_TEXTS]
GOOD:
docs = list(nlp.pipe(LOTS_OF_TEXTS))
ADVANCED NLP WITH SPACY
Passing in context (1)
Se ing as_tuples=True on nlp.pipe lets you pass in (text, context) tuples
Yields (doc, context) tuples
Useful for associating metadata with the doc
data = [
('This is a text', {'id': 1, 'page_number': 15}),
('And another text', {'id': 2, 'page_number': 16}),
]
for doc, context in nlp.pipe(data, as_tuples=True):
print(doc.text, context['page_number'])
This is a text 15
And another text 16
ADVANCED NLP WITH SPACY
Passing in context (2)
from spacy.tokens import Doc
Doc.set_extension('id', default=None)
Doc.set_extension('page_number', default=None)
data = [
('This is a text', {'id': 1, 'page_number': 15}),
('And another text', {'id': 2, 'page_number': 16}),
]
for doc, context in nlp.pipe(data, as_tuples=True):
doc._.id = context['id']
doc._.page_number = context['page_number']
ADVANCED NLP WITH SPACY
Using only the tokenizer
don't run the whole pipeline!
ADVANCED NLP WITH SPACY
Using only the tokenizer (2)
Use nlp.make_doc to turn a text in to a Doc object
BAD:
doc = nlp("Hello world")
GOOD:
doc = nlp.make_doc("Hello world!")
ADVANCED NLP WITH SPACY
Disabling pipeline components
Use nlp.disable_pipes to temporarily disable one or more pipes
# Disable tagger and parser
with nlp.disable_pipes('tagger', 'parser'):
# Process the text and print the entities
doc = nlp(text)
print(doc.ents)
restores them a er the with block
only runs the remaining components
ADVANCED NLP WITH SPACY
Let's practice!
A D VA N C E D N L P W I T H S PA C Y