0% found this document useful (0 votes)

8 views29 pages

Chapter 3

The document provides an overview of processing pipelines in spaCy, detailing built-in components like tagger, parser, and named entity recognizer. It explains how to create custom pipeline components, set extension attributes, and optimize performance when processing large volumes of text. Additionally, it covers methods for passing context, using only the tokenizer, and disabling pipeline components temporarily.

Uploaded by

gocat52116

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views29 pages

Chapter 3

Uploaded by

gocat52116

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Processing pipelines

A D VA N C E D N L P W I T H S PA C Y

Ines Montani
spaCy core developer
What happens when you call nlp?

doc = nlp("This is a sentence.")

ADVANCED NLP WITH SPACY

Built-in pipeline components
Name Description Creates

tagger Part-of-speech tagger Token.tag

parser Dependency parser Token.dep , Token.head , Doc.sents , Doc.noun_chunks

ner Named entity recognizer Doc.ents , Token.ent_iob , Token.ent_type

textcat Text classi er Doc.cats

ADVANCED NLP WITH SPACY

Under the hood

Pipeline de ned in model's meta.json in order

Built-in components need binary data to make predictions

ADVANCED NLP WITH SPACY

Pipeline attributes
nlp.pipe_names : list of pipeline component names

print(nlp.pipe_names)

['tagger', 'parser', 'ner']

nlp.pipeline : list of (name, component) tuples

print(nlp.pipeline)

[('tagger', <spacy.pipeline.Tagger>),
('parser', <spacy.pipeline.DependencyParser>),
('ner', <spacy.pipeline.EntityRecognizer>)]

ADVANCED NLP WITH SPACY

Let's practice!
A D VA N C E D N L P W I T H S PA C Y
Custom pipeline
components
A D VA N C E D N L P W I T H S PA C Y

Ines Montani
spaCy core developer
Why custom components?

Make a function execute automatically when you call nlp

Add your own metadata to documents and tokens

Updating built-in a ributes like doc.ents

ADVANCED NLP WITH SPACY

Anatomy of a component (1)
Function that takes a doc , modi es it and returns it

Can be added using the nlp.add_pipe method

def custom_component(doc):
# Do something to the doc here
return doc
nlp.add_pipe(custom_component)

ADVANCED NLP WITH SPACY

Anatomy of a component (2)
def custom_component(doc):
# Do something to the doc here
return doc

nlp.add_pipe(custom_component)

Argument Description Example

last If True , add last nlp.add_pipe(component, last=True)

first If True , add rst nlp.add_pipe(component, first=True)

before Add before component nlp.add_pipe(component, before='ner')

after Add a er component nlp.add_pipe(component, after='tagger')

ADVANCED NLP WITH SPACY

Example: a simple component (1)
# Create the nlp object
nlp = spacy.load('en_core_web_sm')
# Define a custom component
def custom_component(doc):
# Print the doc's length
print('Doc length:' len(doc))
# Return the doc object
return doc
# Add the component first in the pipeline
nlp.add_pipe(custom_component, first=True)
# Print the pipeline component names
print('Pipeline:', nlp.pipe_names)

Pipeline: ['custom_component', 'tagger', 'parser', 'ner']

ADVANCED NLP WITH SPACY

Example: a simple component (2)
# Create the nlp object
nlp = spacy.load('en_core_web_sm')

# Define a custom component

def custom_component(doc):

# Print the doc's length

print('Doc length:' len(doc))

# Return the doc object

return doc

# Add the component first in the pipeline

nlp.add_pipe(custom_component, first=True)
# Process a text
doc = nlp("Hello world!")

Doc length: 3

ADVANCED NLP WITH SPACY

Let's practice!
A D VA N C E D N L P W I T H S PA C Y
Extension attributes
A D VA N C E D N L P W I T H S PA C Y

Ines Montani
spaCy core developer
Setting custom attributes
Add custom metadata to documents, tokens and spans

Accessible via the ._ property

doc._.title = 'My document'

token._.is_color = True
span._.has_color = False

registered on the global Doc , Token or Span using the set_extension method

# Import global classes

from spacy.tokens import Doc, Token, Span
# Set extensions on the Doc, Token and Span
Doc.set_extension('title', default=None)
Token.set_extension('is_color', default=False)
Span.set_extension('has_color', default=False)

ADVANCED NLP WITH SPACY

Extension attribute types
1. A ribute extensions

2. Property extensions

3. Method extensions

ADVANCED NLP WITH SPACY

Attribute extensions
Set a default value that can be overwri en

from spacy.tokens import Token

# Set extension on the Token with default value

Token.set_extension('is_color', default=False)
doc = nlp("The sky is blue.")

# Overwrite extension attribute value

doc[3]._.is_color = True

ADVANCED NLP WITH SPACY

Property extensions (1)
De ne a ge er and an optional se er function

Ge er only called when you retrieve the a ribute value

from spacy.tokens import Token

# Define getter function

def get_is_color(token):
colors = ['red', 'yellow', 'blue']
return token.text in colors
# Set extension on the Token with getter
Token.set_extension('is_color', getter=get_is_color)
doc = nlp("The sky is blue.")
print(doc[3]._.is_color, '-', doc[3].text)

blue - True

ADVANCED NLP WITH SPACY

Property extensions (2)
Span extensions should almost always use a ge er

from spacy.tokens import Span

# Define getter function

def get_has_color(span):
colors = ['red', 'yellow', 'blue']
return any(token.text in colors for token in span)
# Set extension on the Span with getter
Span.set_extension('has_color', getter=get_has_color)
doc = nlp("The sky is blue.")
print(doc[1:4]._.has_color, '-', doc[1:4].text)
print(doc[0:2]._.has_color, '-', doc[0:2].text)

True - sky is blue

False - The sky

ADVANCED NLP WITH SPACY

Method extensions
Assign a function that becomes available as an object method

Lets you pass arguments to the extension function

from spacy.tokens import Doc

# Define method with arguments

def has_token(doc, token_text):
in_doc = token_text in [token.text for token in doc]
# Set extension on the Doc with method
Doc.set_extension('has_token', method=has_token)
doc = nlp("The sky is blue.")
print(doc._.has_token('blue'), '- blue')
print(doc._.has_token('cloud'), '- cloud')

True - blue
False - cloud

ADVANCED NLP WITH SPACY

Let's practice!
A D VA N C E D N L P W I T H S PA C Y
Scaling and
performance
A D VA N C E D N L P W I T H S PA C Y

Ines Montani
spaCy core developer
Processing large volumes of text
Use nlp.pipe method

Processes texts as a stream, yields Doc objects

Much faster than calling nlp on each text

BAD:

docs = [nlp(text) for text in LOTS_OF_TEXTS]

GOOD:

docs = list(nlp.pipe(LOTS_OF_TEXTS))

ADVANCED NLP WITH SPACY

Passing in context (1)
Se ing as_tuples=True on nlp.pipe lets you pass in (text, context) tuples

Yields (doc, context) tuples

Useful for associating metadata with the doc

data = [
('This is a text', {'id': 1, 'page_number': 15}),
('And another text', {'id': 2, 'page_number': 16}),
]
for doc, context in nlp.pipe(data, as_tuples=True):
print(doc.text, context['page_number'])

This is a text 15
And another text 16

ADVANCED NLP WITH SPACY

Passing in context (2)
from spacy.tokens import Doc

Doc.set_extension('id', default=None)
Doc.set_extension('page_number', default=None)
data = [
('This is a text', {'id': 1, 'page_number': 15}),
('And another text', {'id': 2, 'page_number': 16}),
]

for doc, context in nlp.pipe(data, as_tuples=True):

doc._.id = context['id']
doc._.page_number = context['page_number']

ADVANCED NLP WITH SPACY

Using only the tokenizer

don't run the whole pipeline!

ADVANCED NLP WITH SPACY

Using only the tokenizer (2)
Use nlp.make_doc to turn a text in to a Doc object

BAD:

doc = nlp("Hello world")

GOOD:

doc = nlp.make_doc("Hello world!")

ADVANCED NLP WITH SPACY

Disabling pipeline components
Use nlp.disable_pipes to temporarily disable one or more pipes

# Disable tagger and parser

with nlp.disable_pipes('tagger', 'parser'):
# Process the text and print the entities
doc = nlp(text)
print(doc.ents)

restores them a er the with block

only runs the remaining components

ADVANCED NLP WITH SPACY

Let's practice!
A D VA N C E D N L P W I T H S PA C Y

Advanced NLP with spaCy Guide
No ratings yet
Advanced NLP with spaCy Guide
29 pages
Spacy Cheat Sheet Python For Data Science: Spans Visualizing
No ratings yet
Spacy Cheat Sheet Python For Data Science: Spans Visualizing
2 pages
Introduction To Spacy: Ines Montani
No ratings yet
Introduction To Spacy: Ines Montani
26 pages
Introduction to NLP with spaCy
No ratings yet
Introduction to NLP with spaCy
28 pages
Advanced NLP With Spacy Chapter2
100% (1)
Advanced NLP With Spacy Chapter2
28 pages
Chapter 2
No ratings yet
Chapter 2
28 pages
Tokenization
No ratings yet
Tokenization
4 pages
Tutorial Text Classification in Phyton Using Spacy
No ratings yet
Tutorial Text Classification in Phyton Using Spacy
22 pages
NLP Full Overview
No ratings yet
NLP Full Overview
37 pages
NLP - Cheatsheet
No ratings yet
NLP - Cheatsheet
10 pages
Chapter 4
No ratings yet
Chapter 4
26 pages
01 01 Introduccion A Spacy
No ratings yet
01 01 Introduccion A Spacy
2 pages
Spacy Library
No ratings yet
Spacy Library
3 pages
TP1 3
No ratings yet
TP1 3
5 pages
Advanced NLP With Spacy Chapter4
No ratings yet
Advanced NLP With Spacy Chapter4
26 pages
spaCy 101: NLP Basics & Features Guide
No ratings yet
spaCy 101: NLP Basics & Features Guide
10 pages
NLP 1 New
No ratings yet
NLP 1 New
306 pages
Natural Language Processing With Python
No ratings yet
Natural Language Processing With Python
7 pages
Python and NLP Notes
No ratings yet
Python and NLP Notes
32 pages
NLP Rule-Based Matching Guide
No ratings yet
NLP Rule-Based Matching Guide
48 pages
NLP Record300
No ratings yet
NLP Record300
24 pages
Archivo - 01 (Outra Cópia)
No ratings yet
Archivo - 01 (Outra Cópia)
1 page
NLP Lab - 1
No ratings yet
NLP Lab - 1
3 pages
NLP Journl
No ratings yet
NLP Journl
15 pages
Spacy Yy Yyy Yyy Yyy
No ratings yet
Spacy Yy Yyy Yyy Yyy
19 pages
What Is NLP?
No ratings yet
What Is NLP?
74 pages
AI Zone: Log in Sign Up
No ratings yet
AI Zone: Log in Sign Up
24 pages
Introduction To Python - Ipynb - Colaboratory
No ratings yet
Introduction To Python - Ipynb - Colaboratory
4 pages
Gentle Start To Natural Language Processing Using Python
No ratings yet
Gentle Start To Natural Language Processing Using Python
6 pages
Trained Models & Pipelines SpaCy Models Documentation
No ratings yet
Trained Models & Pipelines SpaCy Models Documentation
6 pages
Gging and Named Entity Recognition
No ratings yet
Gging and Named Entity Recognition
31 pages
Text Pre Processing (NLTK SpaCy) (1) .HTML
No ratings yet
Text Pre Processing (NLTK SpaCy) (1) .HTML
25 pages
Unit 4
No ratings yet
Unit 4
8 pages
Lab8 Instructions
No ratings yet
Lab8 Instructions
36 pages
NLP for Python Developers
No ratings yet
NLP for Python Developers
17 pages
NLP with Textacy: A Developer's Guide
No ratings yet
NLP with Textacy: A Developer's Guide
184 pages
Ai Assignment3 Lcs2023007
No ratings yet
Ai Assignment3 Lcs2023007
8 pages
Text Mining and Dataset Creation in Python
No ratings yet
Text Mining and Dataset Creation in Python
13 pages
NLP Lab
No ratings yet
NLP Lab
63 pages
NLP Tokenization Techniques Guide
No ratings yet
NLP Tokenization Techniques Guide
50 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
19 pages
NLP Pipeline: Chapter-2
No ratings yet
NLP Pipeline: Chapter-2
171 pages
NLP With Python
No ratings yet
NLP With Python
11 pages
Module-I NLP
No ratings yet
Module-I NLP
35 pages
Spacy
No ratings yet
Spacy
5 pages
Lecture 8 - Text Analytics NLP
No ratings yet
Lecture 8 - Text Analytics NLP
24 pages
AMLTA
No ratings yet
AMLTA
17 pages
NLP Guide for AI Students
No ratings yet
NLP Guide for AI Students
29 pages
Natural Language Processing in Data Science
No ratings yet
Natural Language Processing in Data Science
7 pages
Natural Language Processing
No ratings yet
Natural Language Processing
19 pages
BLC 2 BLC 1nlp12erged
No ratings yet
BLC 2 BLC 1nlp12erged
11 pages
NLP Text Classification Week4
No ratings yet
NLP Text Classification Week4
26 pages
NLP Preprocessing Steps 1740444240
No ratings yet
NLP Preprocessing Steps 1740444240
20 pages
NLP 9
No ratings yet
NLP 9
44 pages
Big Data Analytics Chap 11
No ratings yet
Big Data Analytics Chap 11
8 pages
Natural Language Processing For Hackers
No ratings yet
Natural Language Processing For Hackers
176 pages
NLP Practicals
No ratings yet
NLP Practicals
6 pages
Python NLP with Transformers
No ratings yet
Python NLP with Transformers
275 pages
NLP Exp2
No ratings yet
NLP Exp2
6 pages
Key Information Technologies in MIS
No ratings yet
Key Information Technologies in MIS
35 pages
Primary Memory Final
No ratings yet
Primary Memory Final
24 pages
Introduction to Data Science Basics
No ratings yet
Introduction to Data Science Basics
6 pages
Budgetary Control - Icici
No ratings yet
Budgetary Control - Icici
8 pages
B1 I FW01 Dev Env
No ratings yet
B1 I FW01 Dev Env
28 pages
Networker Server Disaster Recovery Guide 19 10 en Us
No ratings yet
Networker Server Disaster Recovery Guide 19 10 en Us
51 pages
S22 Lecture 1 Intro Inked
No ratings yet
S22 Lecture 1 Intro Inked
46 pages
Occupational Standards Hairdressing Level 4
No ratings yet
Occupational Standards Hairdressing Level 4
132 pages
Sales Promotion of Television Industry
80% (5)
Sales Promotion of Television Industry
43 pages
엑셀로배우는데이터과학
No ratings yet
엑셀로배우는데이터과학
226 pages
DBs Test
No ratings yet
DBs Test
4 pages
Amity International School: Artificial Intelligence (417) Class XI (Session-2022)
No ratings yet
Amity International School: Artificial Intelligence (417) Class XI (Session-2022)
6 pages
Manisha Shrestha
No ratings yet
Manisha Shrestha
42 pages
Ebook Practitioners Guide To People Analytics 1700435502
No ratings yet
Ebook Practitioners Guide To People Analytics 1700435502
31 pages
Tourism Principles and Practice - Cooper Fletcher Gilbert Fyall Wanhill
100% (2)
Tourism Principles and Practice - Cooper Fletcher Gilbert Fyall Wanhill
840 pages
Hive Functions: Cheat Sheet
No ratings yet
Hive Functions: Cheat Sheet
6 pages
Master Thesis Information Science
No ratings yet
Master Thesis Information Science
86 pages
Quiz 2
No ratings yet
Quiz 2
4 pages
SKEWED
No ratings yet
SKEWED
65 pages
SQL Practical Solutions
No ratings yet
SQL Practical Solutions
2 pages
SQL Oracle-L3
No ratings yet
SQL Oracle-L3
8 pages
Define Data Abstraction and Discuss Levels of Abstraction
No ratings yet
Define Data Abstraction and Discuss Levels of Abstraction
2 pages
Ey GDPR Aug 2018
No ratings yet
Ey GDPR Aug 2018
32 pages
C# LinkedList<T> Guide
100% (1)
C# LinkedList<T> Guide
13 pages
Functional Dependency & Normalization
No ratings yet
Functional Dependency & Normalization
134 pages
Data Engineering Nanodegree Program Syllabus PDF
No ratings yet
Data Engineering Nanodegree Program Syllabus PDF
5 pages
File Management Lecture-Final PDF
No ratings yet
File Management Lecture-Final PDF
4 pages
Pro 2
No ratings yet
Pro 2
11 pages
Change Log
No ratings yet
Change Log
22 pages
Hiiii
No ratings yet
Hiiii
4 pages

Chapter 3

Uploaded by

Chapter 3

Uploaded by

Processing pipelines

doc = nlp("This is a sentence.")

ADVANCED NLP WITH SPACY

tagger Part-of-speech tagger Token.tag

parser Dependency parser Token.dep , Token.head , Doc.sents , Doc.noun_chunks

ner Named entity recognizer Doc.ents , Token.ent_iob , Token.ent_type

textcat Text classi er Doc.cats

ADVANCED NLP WITH SPACY

Pipeline de ned in model's meta.json in order

Built-in components need binary data to make predictions

ADVANCED NLP WITH SPACY

['tagger', 'parser', 'ner']

nlp.pipeline : list of (name, component) tuples

ADVANCED NLP WITH SPACY

Make a function execute automatically when you call nlp

Add your own metadata to documents and tokens

Updating built-in a ributes like doc.ents

ADVANCED NLP WITH SPACY

Can be added using the nlp.add_pipe method

ADVANCED NLP WITH SPACY

Argument Description Example

last If True , add last nlp.add_pipe(component, last=True)

first If True , add rst nlp.add_pipe(component, first=True)

before Add before component nlp.add_pipe(component, before='ner')

after Add a er component nlp.add_pipe(component, after='tagger')

ADVANCED NLP WITH SPACY

Pipeline: ['custom_component', 'tagger', 'parser', 'ner']

ADVANCED NLP WITH SPACY

# Define a custom component

# Print the doc's length

# Return the doc object

# Add the component first in the pipeline

ADVANCED NLP WITH SPACY

Accessible via the ._ property

doc._.title = 'My document'

# Import global classes

ADVANCED NLP WITH SPACY

ADVANCED NLP WITH SPACY

from spacy.tokens import Token

# Set extension on the Token with default value

# Overwrite extension attribute value

ADVANCED NLP WITH SPACY

Ge er only called when you retrieve the a ribute value

from spacy.tokens import Token

# Define getter function

ADVANCED NLP WITH SPACY

from spacy.tokens import Span

# Define getter function

True - sky is blue

ADVANCED NLP WITH SPACY

Lets you pass arguments to the extension function

from spacy.tokens import Doc

# Define method with arguments

ADVANCED NLP WITH SPACY

Processes texts as a stream, yields Doc objects

Much faster than calling nlp on each text

docs = [nlp(text) for text in LOTS_OF_TEXTS]

ADVANCED NLP WITH SPACY

Yields (doc, context) tuples

Useful for associating metadata with the doc

ADVANCED NLP WITH SPACY

for doc, context in nlp.pipe(data, as_tuples=True):

ADVANCED NLP WITH SPACY

don't run the whole pipeline!

ADVANCED NLP WITH SPACY

doc = nlp("Hello world")

doc = nlp.make_doc("Hello world!")

ADVANCED NLP WITH SPACY

# Disable tagger and parser

restores them a er the with block

only runs the remaining components

ADVANCED NLP WITH SPACY

You might also like