@@ -225,27 +225,30 @@ provides a starting point.
225
225
226
226
For example,::
227
227
228
- >>> np.array([1.0, 2.0, np.NA, 7.0], namasked =True)
229
- array([1., 2., NA, 7.], namasked =True)
230
- >>> np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8] ')
228
+ >>> np.array([1.0, 2.0, np.NA, 7.0], maskna =True)
229
+ array([1., 2., NA, 7.], maskna =True)
230
+ >>> np.array([1.0, 2.0, np.NA, 7.0], dtype='NA')
231
231
array([1., 2., NA, 7.], dtype='NA[<f8]')
232
+ >>> np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f4]')
233
+ array([1., 2., NA, 7.], dtype='NA[<f4]')
232
234
233
235
produce arrays with values [1.0, 2.0, <inaccessible>, 7.0] /
234
- mask [Unmasked, Unmasked, Masked, Unmasked], and
235
- values [1.0, 2.0, <NA bitpattern>, 7.0] respectively.
236
+ mask [Exposed, Exposed, Hidden, Exposed], and
237
+ values [1.0, 2.0, <NA bitpattern>, 7.0] for the masked and
238
+ NA dtype versions respectively.
236
239
237
240
It may be worth overloading the np.NA __call__ method to accept a dtype,
238
241
returning a zero-dimensional array with a missing value of that dtype.
239
242
Without doing this, NA printouts would look like::
240
243
241
- >>> np.sum(np.array([1.0, 2.0, np.NA, 7.0], namasked =True))
242
- array(NA, dtype='float64', namasked =True)
244
+ >>> np.sum(np.array([1.0, 2.0, np.NA, 7.0], maskna =True))
245
+ array(NA, dtype='float64', maskna =True)
243
246
>>> np.sum(np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]'))
244
247
array(NA, dtype='NA[<f8]')
245
248
246
249
but with this, they could be printed as::
247
250
248
- >>> np.sum(np.array([1.0, 2.0, np.NA, 7.0], namasked =True))
251
+ >>> np.sum(np.array([1.0, 2.0, np.NA, 7.0], maskna =True))
249
252
NA('float64')
250
253
>>> np.sum(np.array([1.0, 2.0, np.NA, 7.0], dtype='NA[f8]'))
251
254
NA('NA[<f8]')
@@ -274,12 +277,12 @@ from another view which doesn't have them masked. For example::
274
277
275
278
>>> a = np.array([1,2])
276
279
>>> b = a.view()
277
- >>> b.flags.hasnamask = True
280
+ >>> b.flags.hasmaskna = True
278
281
>>> b
279
- array([1,2], namasked =True)
282
+ array([1,2], maskna =True)
280
283
>>> b[0] = np.NA
281
284
>>> b
282
- array([NA,2], namasked =True)
285
+ array([NA,2], maskna =True)
283
286
>>> a
284
287
array([1,2])
285
288
>>> # The underlying number 1 value in 'a[0]' was untouched
@@ -351,10 +354,10 @@ Creating Masked Arrays
351
354
There are two flags which indicate and control the nature of the mask
352
355
used in masked arrays.
353
356
354
- First is 'arr.flags.hasnamask ', which is True for all masked arrays and
357
+ First is 'arr.flags.hasmaskna ', which is True for all masked arrays and
355
358
may be set to True to add a mask to an array which does not have one.
356
359
357
- Second is 'arr.flags.ownnamask ', which is True if the array owns the
360
+ Second is 'arr.flags.ownmaskna ', which is True if the array owns the
358
361
memory to the mask, and False if the array has no mask, or has a view
359
362
into the mask of another array. If this is set to False in a masked
360
363
array, the array will create a copy of the mask so that further modifications
@@ -402,8 +405,16 @@ New functions added to the ndarray are::
402
405
array is unmasked and has the 'NA' part stripped from the
403
406
parameterized type ('NA[f8]' becomes just 'f8').
404
407
405
- arr.view(namasked=True)
406
- This is a shortcut for 'a = arr.view(); a.flags.hasnamask=True'.
408
+ arr.view(maskna=True)
409
+ This is a shortcut for
410
+ >>> a = arr.view()
411
+ >>> a.flags.hasmaskna = True
412
+
413
+ arr.view(ownmaskna=True)
414
+ This is a shortcut for
415
+ >>> a = arr.view()
416
+ >>> a.flags.hasmaskna = True
417
+ >>> a.flags.ownmaskna = True
407
418
408
419
Element-wise UFuncs With Missing Values
409
420
=======================================
@@ -461,21 +472,21 @@ will also use the unmasked value counts for their calculations if
461
472
462
473
Some examples::
463
474
464
- >>> a = np.array([1., 3., np.NA, 7.], namasked =True)
475
+ >>> a = np.array([1., 3., np.NA, 7.], maskna =True)
465
476
>>> np.sum(a)
466
- array(NA, dtype='<f8', masked =True)
477
+ array(NA, dtype='<f8', maskna =True)
467
478
>>> np.sum(a, skipna=True)
468
479
11.0
469
480
>>> np.mean(a)
470
481
NA('<f8')
471
482
>>> np.mean(a, skipna=True)
472
483
3.6666666666666665
473
484
474
- >>> a = np.array([np.NA, np.NA], dtype='f8', namasked =True)
485
+ >>> a = np.array([np.NA, np.NA], dtype='f8', maskna =True)
475
486
>>> np.sum(a, skipna=True)
476
487
0.0
477
488
>>> np.max(a, skipna=True)
478
- array(NA, dtype='<f8', namasked =True)
489
+ array(NA, dtype='<f8', maskna =True)
479
490
>>> np.mean(a)
480
491
NA('<f8')
481
492
>>> np.mean(a, skipna=True)
@@ -487,20 +498,24 @@ The functions 'np.any' and 'np.all' require some special consideration,
487
498
just as logical_and and logical_or do. Maybe the best way to describe
488
499
their behavior is through a series of examples::
489
500
490
- >>> np.any(np.array([False, False, False], namasked =True))
501
+ >>> np.any(np.array([False, False, False], maskna =True))
491
502
False
492
- >>> np.any(np.array([False, NA, False], namasked =True))
503
+ >>> np.any(np.array([False, np. NA, False], maskna =True))
493
504
NA
494
- >>> np.any(np.array([False, NA, True], namasked =True))
505
+ >>> np.any(np.array([False, np. NA, True], maskna =True))
495
506
True
496
507
497
- >>> np.all(np.array([True, True, True], namasked =True))
508
+ >>> np.all(np.array([True, True, True], maskna =True))
498
509
True
499
- >>> np.all(np.array([True, NA, True], namasked =True))
510
+ >>> np.all(np.array([True, np. NA, True], maskna =True))
500
511
NA
501
- >>> np.all(np.array([False, NA, True], namasked =True))
512
+ >>> np.all(np.array([False, np. NA, True], maskna =True))
502
513
False
503
514
515
+ Since 'np.any' is the reduction for 'np.logical_or', and 'np.all'
516
+ is the reduction for 'np.logical_and', it makes sense for them to
517
+ have a 'skipna=' parameter like the other similar reduction functions.
518
+
504
519
Parameterized NA Data Types
505
520
===========================
506
521
@@ -609,14 +624,124 @@ The important part of future-proofing the design is making sure
609
624
the C ABI-level choices and the Python API-level choices have a natural
610
625
transition to multi-NA support. Here is one way multi-NA support could look::
611
626
612
- >>> a = np.array([np.NA(1), 3, np.NA(2)], namasked ='multi')
627
+ >>> a = np.array([np.NA(1), 3, np.NA(2)], maskna ='multi')
613
628
>>> np.sum(a)
614
- NA(1)
629
+ NA(1, dtype='<i4' )
615
630
>>> np.sum(a[1:])
616
- NA(2)
617
- >>> b = np.array([np.NA, 2, 5], namasked =True)
631
+ NA(2, dtype='<i4' )
632
+ >>> b = np.array([np.NA, 2, 5], maskna =True)
618
633
>>> a + b
619
- array([NA(0), 5, NA(2)], namasked='multi')
634
+ array([NA(0), 5, NA(2)], maskna='multi')
635
+
636
+ The design of this NEP does not distinguish between NAs that come
637
+ from an NA mask or NAs that come from an NA dtype. Both of these get
638
+ treated equivalently in computations, with masks dominating over NA
639
+ dtypes.::
640
+
641
+ >>> a = np.array([np.NA, 2, 5], maskna=True)
642
+ >>> b = np.array([1, np.NA, 7], dtype='NA')
643
+ >>> a + b
644
+ array([NA, NA, 12], maskna=True)
645
+
646
+ The multi-NA approach allows one to distinguish between these NAs,
647
+ through assigning different payloads to the different types. If we
648
+ extend the 'skipna=' parameter to accept a list of payloads in addition
649
+ to True/False, one could do this::
650
+
651
+ >>> a = np.array([np.NA(1), 2, 5], maskna='multi')
652
+ >>> b = np.array([1, np.NA(0), 7], dtype='NA[f4,multi]')
653
+ >>> a + b
654
+ array([NA(1), NA(0), 12], maskna='multi')
655
+ >>> np.sum(a, skipna=0)
656
+ NA(1, dtype='<i4')
657
+ >>> np.sum(a, skipna=1)
658
+ 7
659
+ >>> np.sum(b, skipna=0)
660
+ 8
661
+ >>> np.sum(b, skipna=1)
662
+ NA(0, dtype='<f4')
663
+ >>> np.sum(a+b, skipna=(0,1))
664
+ 12
665
+
666
+ Differences with numpy.ma
667
+ =========================
668
+
669
+ The computational model that numpy.ma uses does not strictly adhere to
670
+ either the NA or the IGNORE model. This section exhibits some examples
671
+ of how these differences affect simple computations. This information
672
+ will be very important for helping users navigate between the systems,
673
+ so a summary probably should be put in a table in the documentation.::
674
+
675
+ >>> a = np.random.random((3, 2))
676
+ >>> mask = [[False, True], [True, True], [False, False]]
677
+ >>> b1 = np.ma.masked_array(a, mask=mask)
678
+ >>> b2 = a.view(maskna=True)
679
+ >>> b2[mask] = np.NA
680
+
681
+ >>> b1
682
+ masked_array(data =
683
+ [[0.110804969841 --]
684
+ [-- --]
685
+ [0.955128477746 0.440430735546]],
686
+ mask =
687
+ [[False True]
688
+ [ True True]
689
+ [False False]],
690
+ fill_value = 1e+20)
691
+ >>> b2
692
+ array([[0.110804969841, NA],
693
+ [NA, NA],
694
+ [0.955128477746, 0.440430735546]],
695
+ maskna=True)
696
+
697
+ >>> b1.mean(axis=0)
698
+ masked_array(data = [0.532966723794 0.440430735546],
699
+ mask = [False False],
700
+ fill_value = 1e+20)
701
+
702
+ >>> b2.mean(axis=0)
703
+ array([NA, NA], dtype='<f8', maskna=True)
704
+ >>> b2.mean(axis=0, skipna=True)
705
+ array([0.532966723794 0.440430735546], maskna=True)
706
+
707
+ For functions like np.mean, when 'skipna=True', the behavior
708
+ for all NAs is consistent with an empty array::
709
+
710
+ >>> b1.mean(axis=1)
711
+ masked_array(data = [0.110804969841 -- 0.697779606646],
712
+ mask = [False True False],
713
+ fill_value = 1e+20)
714
+
715
+ >>> b2.mean(axis=1)
716
+ array([NA, NA, 0.697779606646], maskna=True)
717
+ >>> b2.mean(axis=1, skipna=True)
718
+ RuntimeWarning: invalid value encountered in double_scalars
719
+ array([0.110804969841, nan, 0.697779606646], maskna=True)
720
+
721
+ >>> np.mean([])
722
+ RuntimeWarning: invalid value encountered in double_scalars
723
+ nan
724
+
725
+ In particular, note that numpy.ma generally skips masked values,
726
+ except returns masked when all the values are masked, while
727
+ the 'skipna=' parameter returns zero when all the values are NA,
728
+ to be consistent with the result of np.sum([])::
729
+
730
+ >>> b1[1]
731
+ masked_array(data = [-- --],
732
+ mask = [ True True],
733
+ fill_value = 1e+20)
734
+ >>> b2[1]
735
+ array([NA, NA], dtype='<f8', maskna=True)
736
+ >>> b1[1].sum()
737
+ masked
738
+ >>> b2[1].sum()
739
+ NA(dtype='<f8')
740
+ >>> b2[1].sum(skipna=True)
741
+ 0.0
742
+
743
+ >>> np.sum([])
744
+ 0.0
620
745
621
746
PEP 3118
622
747
========
@@ -696,28 +821,28 @@ This gives us the following additions to the PyArrayObject::
696
821
/*
697
822
* Descriptor for the mask dtype.
698
823
* If no mask: NULL
699
- * If mask : bool/structured dtype of bools
824
+ * If mask : bool/uint8/ structured dtype of mask dtypes
700
825
*/
701
- PyArray_Descr *maskdescr ;
826
+ PyArray_Descr *maskna_descr ;
702
827
/*
703
828
* Raw data buffer for mask. If the array has the flag
704
- * NPY_ARRAY_OWNNAMASK enabled, it owns this memory and
829
+ * NPY_ARRAY_OWNMASKNA enabled, it owns this memory and
705
830
* must call PyArray_free on it when destroyed.
706
831
*/
707
- npy_uint8 *maskdata ;
832
+ npy_mask *maskna_data ;
708
833
/*
709
834
* Just like dimensions and strides point into the same memory
710
835
* buffer, we now just make the buffer 3x the nd instead of 2x
711
836
* and use the same buffer.
712
837
*/
713
- npy_intp *maskstrides ;
838
+ npy_intp *maskna_strides ;
714
839
715
840
There are 2 (or 3) flags which must be added to the array flags::
716
841
717
- NPY_ARRAY_HASNAMASK
718
- NPY_ARRAY_OWNNAMASK
842
+ NPY_ARRAY_HASMASKNA
843
+ NPY_ARRAY_OWNMASKNA
719
844
/* To possibly add in a later revision */
720
- NPY_ARRAY_HARDNAMASK
845
+ NPY_ARRAY_HARDMASKNA
721
846
722
847
To allow the easy detection of NA support, and whether an array
723
848
has any missing values, we add the following functions:
@@ -807,7 +932,7 @@ NPY_ITER_ARRAYMASK
807
932
can be only one such mask, and there cannot also be a virtual
808
933
mask.
809
934
810
- As a special case, if the flag NPY_ITER_USE_NAMASK is specified
935
+ As a special case, if the flag NPY_ITER_USE_MASKNA is specified
811
936
at the same time, the mask for the operand is used instead
812
937
of the operand itself. If the operand has no mask but is
813
938
based on an NA dtype, that mask exposed by the iterator converts
@@ -827,14 +952,14 @@ Iterator NA-array Features
827
952
828
953
We add several new per-operand flags:
829
954
830
- NPY_ITER_USE_NAMASK
955
+ NPY_ITER_USE_MASKNA
831
956
If the operand has an NA dtype, an NA mask, or both, this adds a new
832
957
virtual operand to the end of the operand list which iterates
833
958
over the mask of the particular operand.
834
959
835
- NPY_ITER_IGNORE_NAMASK
960
+ NPY_ITER_IGNORE_MASKNA
836
961
If an operand has an NA mask, by default the iterator will raise
837
- an exception unless NPY_ITER_USE_NAMASK is specified. This flag
962
+ an exception unless NPY_ITER_USE_MASKNA is specified. This flag
838
963
disables that check, and is intended for cases where one has first
839
964
checked that all the elements in the array are not NA using the
840
965
PyArray_ContainsNA function.
0 commit comments