The hard numbers for spaCy and how it compares to other tools
spaCy is a free, open-source library for advanced Natural Language
Processing (NLP) in Python. It’s designed specifically for production use
and helps you build applications that process and “understand” large volumes of
text. It can be used to build information extraction or natural language
understanding systems.
Multi-task learning with pretrained transformers like BERT
Pretrained word vectors
State-of-the-art speed
Production-ready training system
Linguistically-motivated tokenization
Components for named entity recognition, part-of-speech tagging, dependency parsing, sentence segmentation, text classification, lemmatization, morphological analysis, entity linking and more
Easily extensible with custom components and attributes
Support for custom models in PyTorch, TensorFlow and other frameworks
Built in visualizers for syntax and NER
Easy model packaging, deployment and workflow management
I’m a beginner and just getting started with NLP. – spaCy makes it easy
to get started and comes with extensive documentation, including a
beginner-friendly 101 guide, a free interactive
online course and a range of
video tutorials.
I want to build an end-to-end production application. – spaCy is
specifically designed for production use and lets you build and train powerful
NLP pipelines and package them for easy deployment.
I want my application to be efficient on GPU and CPU. – While spaCy
lets you train modern NLP models that are best run on GPU, it also offers
CPU-optimized pipelines, which are less accurate but much cheaper to run.
I want to try out different neural network architectures for NLP. –
spaCy lets you customize and swap out the model architectures powering its
components, and implement your own using a framework like PyTorch or
TensorFlow. The declarative configuration system makes it easy to mix and
match functions and keep track of your hyperparameters to make sure your
experiments are reproducible.
I want to build a language generation application. – spaCy’s focus is
natural language processing and extracting information from large volumes of
text. While you can use it to help you re-write existing text, it doesn’t
include any specific functionality for language generation tasks.
I want to research machine learning algorithms. spaCy is built on the
latest research, but it’s not a research library. If your goal is to write
papers and run benchmarks, spaCy is probably not a good choice. However, you
can use it to make the results of your research easily available for others to
use, e.g. via a custom spaCy component.
spaCy v3.0 introduces transformer-based pipelines that bring spaCy’s accuracy
right up to current state-of-the-art. You can also use a CPU-optimized
pipeline, which is less accurate but much cheaper to run.
We compare the speed of different NLP libraries, measured in words per second
(WPS) - higher is better. The evaluation was performed on 10,000 Reddit
comments.