diff --git a/doc/datasets/kddcup99.rst b/doc/datasets/kddcup99.rst
index fadc41c85c3be..407b2d8e2c0bf 100644
--- a/doc/datasets/kddcup99.rst
+++ b/doc/datasets/kddcup99.rst
@@ -12,11 +12,11 @@ generated using a closed network and hand-injected attacks to produce a
 large number of different types of attack with normal activity in the
 background. As the initial goal was to produce a large training set for
 supervised learning algorithms, there is a large proportion (80.1%) of
-abnormal data which is unrealistic in real world, and inapropriate for
+abnormal data which is unrealistic in real world, and inappropriate for
 unsupervised anomaly detection which aims at detecting 'abnormal' data, ie
 1) qualitatively different from normal data
 2) in large minority among the observations.
-We thus transform the KDD Data set into two differents data set: SA and SF.
+We thus transform the KDD Data set into two different data sets: SA and SF.
 
 -SA is obtained by simply selecting all the normal data, and a small
 proportion of abnormal data to gives an anomaly proportion of 1%.
diff --git a/doc/datasets/labeled_faces.rst b/doc/datasets/labeled_faces.rst
index 5d79f89e81c04..0e70aca8aa705 100644
--- a/doc/datasets/labeled_faces.rst
+++ b/doc/datasets/labeled_faces.rst
@@ -29,11 +29,11 @@ Usage
 
 ``scikit-learn`` provides two loaders that will automatically download,
 cache, parse the metadata files, decode the jpeg and convert the
-interesting slices into memmaped numpy arrays. This dataset size is more
+interesting slices into memmapped numpy arrays. This dataset size is more
 than 200 MB. The first load typically takes more than a couple of minutes
 to fully decode the relevant part of the JPEG files into numpy arrays. If
 the dataset has  been loaded once, the following times the loading times
-less than 200ms by using a memmaped version memoized on the disk in the
+less than 200ms by using a memmapped version memoized on the disk in the
 ``~/scikit_learn_data/lfw_home/`` folder using ``joblib``.
 
 The first loader is used for the Face Identification task: a multi-class
diff --git a/doc/modules/calibration.rst b/doc/modules/calibration.rst
index 9762414ac8cc0..18c3cfdd8366f 100644
--- a/doc/modules/calibration.rst
+++ b/doc/modules/calibration.rst
@@ -56,7 +56,7 @@ with different biases per method:
    than 0 for this case, thus moving the average prediction of the bagged
    ensemble away from 0. We observe this effect most strongly with random
    forests because the base-level trees trained with random forests have
-   relatively high variance due to feature subseting." As a result, the
+   relatively high variance due to feature subsetting." As a result, the
    calibration curve also referred to as the reliability diagram (Wilks 1995 [5]_) shows a
    characteristic sigmoid shape, indicating that the classifier could trust its
    "intuition" more and return probabilties closer to 0 or 1 typically.
@@ -78,7 +78,7 @@ The class :class:`CalibratedClassifierCV` uses a cross-validation generator and
 estimates for each split the model parameter on the train samples and the
 calibration of the test samples. The probabilities predicted for the
 folds are then averaged. Already fitted classifiers can be calibrated by
-:class:`CalibratedClassifierCV` via the paramter cv="prefit". In this case,
+:class:`CalibratedClassifierCV` via the parameter cv="prefit". In this case,
 the user has to take care manually that data for model fitting and calibration
 are disjoint.
 
diff --git a/doc/modules/gaussian_process.rst b/doc/modules/gaussian_process.rst
index 94cca8999e489..1937e3897444a 100644
--- a/doc/modules/gaussian_process.rst
+++ b/doc/modules/gaussian_process.rst
@@ -280,7 +280,7 @@ of the dataset, this might be considerably faster. However, note that
 "one_vs_one" does not support predicting probability estimates but only plain
 predictions. Moreover, note that :class:`GaussianProcessClassifier` does not
 (yet) implement a true multi-class Laplace approximation internally, but
-as discussed aboved is based on solving several binary classification tasks
+as discussed above is based on solving several binary classification tasks
 internally, which are combined using one-versus-rest or one-versus-one.
 
 GPC examples
diff --git a/doc/modules/manifold.rst b/doc/modules/manifold.rst
index c8c5910136db8..2586daffa2e27 100644
--- a/doc/modules/manifold.rst
+++ b/doc/modules/manifold.rst
@@ -558,7 +558,7 @@ descent will get stuck in a bad local minimum. If it is too high the KL
 divergence will increase during optimization. More tips can be found in
 Laurens van der Maaten's FAQ (see references). The last parameter, angle,
 is a tradeoff between performance and accuracy. Larger angles imply that we
-can approximate larger regions by a single point,leading to better speed
+can approximate larger regions by a single point, leading to better speed
 but less accurate results.
 
 `"How to Use t-SNE Effectively" <http://distill.pub/2016/misread-tsne/>`_
diff --git a/doc/modules/multiclass.rst b/doc/modules/multiclass.rst
index 2eec94f76b1c2..93e4c1a6c36c1 100644
--- a/doc/modules/multiclass.rst
+++ b/doc/modules/multiclass.rst
@@ -367,7 +367,7 @@ classifier per target.  This allows multiple target variable
 classifications. The purpose of this class is to extend estimators
 to be able to estimate a series of target functions (f1,f2,f3...,fn)
 that are trained on a single X predictor matrix to predict a series
-of reponses (y1,y2,y3...,yn).
+of responses (y1,y2,y3...,yn).
 
 Below is an example of multioutput classification:
     
diff --git a/doc/modules/neighbors.rst b/doc/modules/neighbors.rst
index 41e628594c6b3..12d7aab7f5a46 100644
--- a/doc/modules/neighbors.rst
+++ b/doc/modules/neighbors.rst
@@ -294,7 +294,7 @@ the *KD tree* data structure (short for *K-dimensional tree*), which
 generalizes two-dimensional *Quad-trees* and 3-dimensional *Oct-trees*
 to an arbitrary number of dimensions.  The KD tree is a binary tree
 structure which recursively partitions the parameter space along the data
-axes, dividing it into nested orthotopic regions into which data points
+axes, dividing it into nested orthotropic regions into which data points
 are filed.  The construction of a KD tree is very fast: because partitioning
 is performed only along the data axes, no :math:`D`-dimensional distances
 need to be computed.  Once constructed, the nearest neighbor of a query
diff --git a/doc/modules/neural_networks_unsupervised.rst b/doc/modules/neural_networks_unsupervised.rst
index 08cbf7f7f6292..262eba614c4e5 100644
--- a/doc/modules/neural_networks_unsupervised.rst
+++ b/doc/modules/neural_networks_unsupervised.rst
@@ -135,7 +135,7 @@ negative gradient, however, is intractable. Its goal is to lower the energy of
 joint states that the model prefers, therefore making it stay true to the data.
 It can be approximated by Markov chain Monte Carlo using block Gibbs sampling by
 iteratively sampling each of :math:`v` and :math:`h` given the other, until the
-chain mixes. Samples generated in this way are sometimes refered as fantasy
+chain mixes. Samples generated in this way are sometimes referred as fantasy
 particles. This is inefficient and it is difficult to determine whether the
 Markov chain mixes.
 
diff --git a/doc/modules/pipeline.rst b/doc/modules/pipeline.rst
index 232b3ed72bbda..24cef941a027d 100644
--- a/doc/modules/pipeline.rst
+++ b/doc/modules/pipeline.rst
@@ -164,7 +164,7 @@ object::
     >>> # Clear the cache directory when you don't need it anymore
     >>> rmtree(cachedir)
 
-.. warning:: **Side effect of caching transfomers**
+.. warning:: **Side effect of caching transformers**
 
    Using a :class:`Pipeline` without cache enabled, it is possible to
    inspect the original instance such as::
diff --git a/doc/modules/preprocessing.rst b/doc/modules/preprocessing.rst
index 92920553ea216..5825409f0f112 100644
--- a/doc/modules/preprocessing.rst
+++ b/doc/modules/preprocessing.rst
@@ -482,7 +482,7 @@ Then we fit the estimator, and transform a data point.
 In the result, the first two numbers encode the gender, the next set of three
 numbers the continent and the last four the web browser.
 
-Note that, if there is a possibilty that the training data might have missing categorical
+Note that, if there is a possibility that the training data might have missing categorical
 features, one has to explicitly set ``n_values``. For example,
 
     >>> enc = preprocessing.OneHotEncoder(n_values=[2, 3, 4])
@@ -588,7 +588,7 @@ In some cases, only interaction terms among features are required, and it can be
 
 The features of X have been transformed from :math:`(X_1, X_2, X_3)` to :math:`(1, X_1, X_2, X_3, X_1X_2, X_1X_3, X_2X_3, X_1X_2X_3)`.
 
-Note that polynomial features are used implicitily in `kernel methods <https://en.wikipedia.org/wiki/Kernel_method>`_ (e.g., :class:`sklearn.svm.SVC`, :class:`sklearn.decomposition.KernelPCA`) when using polynomial :ref:`svm_kernels`.
+Note that polynomial features are used implicitly in `kernel methods <https://en.wikipedia.org/wiki/Kernel_method>`_ (e.g., :class:`sklearn.svm.SVC`, :class:`sklearn.decomposition.KernelPCA`) when using polynomial :ref:`svm_kernels`.
 
 See :ref:`sphx_glr_auto_examples_linear_model_plot_polynomial_interpolation.py` for Ridge regression using created polynomial features.
 
diff --git a/doc/modules/scaling_strategies.rst b/doc/modules/scaling_strategies.rst
index cf105d2dd2ef0..d034ae3e11cda 100644
--- a/doc/modules/scaling_strategies.rst
+++ b/doc/modules/scaling_strategies.rst
@@ -34,7 +34,7 @@ different :ref:`feature extraction <feature_extraction>` methods supported by
 scikit-learn. However, when working with data that needs vectorization and
 where the set of features or values is not known in advance one should take
 explicit care. A good example is text classification where unknown terms are
-likely to be found during training. It is possible to use a statefull
+likely to be found during training. It is possible to use a stateful
 vectorizer if making multiple passes over the data is reasonable from an
 application point of view. Otherwise, one can turn up the difficulty by using
 a stateless feature extractor. Currently the preferred way to do this is to
diff --git a/doc/modules/svm.rst b/doc/modules/svm.rst
index 62d566fe150ba..8f253437690c3 100644
--- a/doc/modules/svm.rst
+++ b/doc/modules/svm.rst
@@ -653,7 +653,7 @@ support vectors and training errors. The parameter :math:`\nu \in (0,
 1]` is an upper bound on the fraction of training errors and a lower
 bound of the fraction of support vectors.
 
-It can be shown that the :math:`\nu`-SVC formulation is a reparametrization
+It can be shown that the :math:`\nu`-SVC formulation is a reparameterization
 of the :math:`C`-SVC and therefore mathematically equivalent.
 
 
diff --git a/doc/themes/scikit-learn/static/ML_MAPS_README.rst b/doc/themes/scikit-learn/static/ML_MAPS_README.rst
index 679419bb96c38..069cc6be4de22 100644
--- a/doc/themes/scikit-learn/static/ML_MAPS_README.rst
+++ b/doc/themes/scikit-learn/static/ML_MAPS_README.rst
@@ -19,7 +19,7 @@ so I'll try to make it as simple as possible.
 
 Use a Graphics editor like Inkscape Vector Graphics Editor
 to open the ml_map.svg file, in this folder. From there
-you can move objects around, ect. as you need.
+you can move objects around, etc. as you need.
 
 Save when done, and make sure to export a .PNG file
 to replace the old-outdated ml_map.png, as that file
diff --git a/doc/tutorial/statistical_inference/unsupervised_learning.rst b/doc/tutorial/statistical_inference/unsupervised_learning.rst
index afe51320414c6..0ad16c180385c 100644
--- a/doc/tutorial/statistical_inference/unsupervised_learning.rst
+++ b/doc/tutorial/statistical_inference/unsupervised_learning.rst
@@ -155,7 +155,7 @@ that aims to build a hierarchy of clusters. In general, the various approaches
 of this technique are either:
 
   * **Agglomerative** - bottom-up approaches: each observation starts in its
-    own cluster, and clusters are iterativelly merged in such a way to
+    own cluster, and clusters are iteratively merged in such a way to
     minimize a *linkage* criterion. This approach is particularly interesting
     when the clusters of interest are made of only a few observations. When
     the number of clusters is large, it is much more computationally efficient
diff --git a/doc/tutorial/text_analytics/working_with_text_data.rst b/doc/tutorial/text_analytics/working_with_text_data.rst
index d7a74d5304258..4ec53801eaea9 100644
--- a/doc/tutorial/text_analytics/working_with_text_data.rst
+++ b/doc/tutorial/text_analytics/working_with_text_data.rst
@@ -495,7 +495,7 @@ Refine the implementation and iterate until the exercise is solved.
 
 **For each exercise, the skeleton file provides all the necessary import
 statements, boilerplate code to load the data and sample code to evaluate
-the predictive accurracy of the model.**
+the predictive accuracy of the model.**
 
 
 Exercise 1: Language identification