[go: up one dir, main page]

Get started

Facts & Figures

The hard numbers for spaCy and how it compares to other tools

spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python. It’s designed specifically for production use and helps you build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems.

Feature overview

  • Support for 75+ languages
  • 84 trained pipelines for 25 languages
  • Multi-task learning with pretrained transformers like BERT
  • Pretrained word vectors
  • State-of-the-art speed
  • Production-ready training system
  • Linguistically-motivated tokenization
  • Components for named entity recognition, part-of-speech tagging, dependency parsing, sentence segmentation, text classification, lemmatization, morphological analysis, entity linking and more
  • Easily extensible with custom components and attributes
  • Support for custom models in PyTorch, TensorFlow and other frameworks
  • Built in visualizers for syntax and NER
  • Easy model packaging, deployment and workflow management
  • Robust, rigorously evaluated accuracy

When should I use spaCy?

  • I’m a beginner and just getting started with NLP. – spaCy makes it easy to get started and comes with extensive documentation, including a beginner-friendly 101 guide, a free interactive online course and a range of video tutorials.
  • I want to build an end-to-end production application. – spaCy is specifically designed for production use and lets you build and train powerful NLP pipelines and package them for easy deployment.
  • I want my application to be efficient on GPU and CPU. – While spaCy lets you train modern NLP models that are best run on GPU, it also offers CPU-optimized pipelines, which are less accurate but much cheaper to run.
  • I want to try out different neural network architectures for NLP. – spaCy lets you customize and swap out the model architectures powering its components, and implement your own using a framework like PyTorch or TensorFlow. The declarative configuration system makes it easy to mix and match functions and keep track of your hyperparameters to make sure your experiments are reproducible.
  • I want to build a language generation application. – spaCy’s focus is natural language processing and extracting information from large volumes of text. While you can use it to help you re-write existing text, it doesn’t include any specific functionality for language generation tasks.
  • I want to research machine learning algorithms. spaCy is built on the latest research, but it’s not a research library. If your goal is to write papers and run benchmarks, spaCy is probably not a good choice. However, you can use it to make the results of your research easily available for others to use, e.g. via a custom spaCy component.

Benchmarks

spaCy v3.0 introduces transformer-based pipelines that bring spaCy’s accuracy right up to current state-of-the-art. You can also use a CPU-optimized pipeline, which is less accurate but much cheaper to run.

PipelineParserTaggerNER
en_core_web_trf (spaCy v3)95.197.889.8
en_core_web_lg (spaCy v3)92.097.485.5
en_core_web_lg (spaCy v2)91.997.285.5

Full pipeline accuracy on the OntoNotes 5.0 corpus (reported on the development set).

Named Entity Recognition SystemOntoNotesCoNLL ‘03
spaCy RoBERTa (2020)89.891.6
Stanza (StanfordNLP)188.892.1
Flair289.793.1

Named entity recognition accuracy on the OntoNotes 5.0 and CoNLL-2003 corpora. See NLP-progress for more results. Project template: benchmarks/ner_conll03. 1. Qi et al. (2020). 2. Akbik et al. (2018).

Dependency Parsing SystemUASLAS
spaCy RoBERTa (2020)95.193.7
Mrini et al. (2019)97.496.3
Zhou and Zhao (2019)97.295.7

Dependency parsing accuracy on the Penn Treebank. See NLP-progress for more results. Project template: benchmarks/parsing_penn_treebank.

Speed comparison

We compare the speed of different NLP libraries, measured in words per second (WPS) - higher is better. The evaluation was performed on 10,000 Reddit comments.

LibraryPipelineWPS CPU WPS GPU
spaCyen_core_web_lg10,01414,954
spaCyen_core_web_trf6843,768
Stanzaen_ewt8782,180
Flairpos(-fast) & ner(-fast)3231,184
UDPipeenglish-ewt-ud-2.51,101n/a

End-to-end processing speed on raw unannotated text. Project template: benchmarks/speed.