BUG: `np.loadtxt` return F_CONTIGUOUS ndarray if row size is too big #26900

determ1ne · 2024-07-10T08:40:26Z

Describe the issue:

If row size is too big in a text file, np.loadtxt will retrun a np.ndarray with F_CONTIGUOUS set.

Reproduce the code example:

import numpy as np

a = np.arange(25).reshape((-1, 5))
b = np.arange((1 << 14) + 2).reshape((-1, (1 << 13) + 1))

np.savetxt('a.txt', a)
np.savetxt('b.txt', b)

a = np.loadtxt('a.txt')
b = np.loadtxt('b.txt')
print(a.flags)
print(b.flags)

assert(b.tobytes('F') == b.copy().tobytes('F')) # error

Error message:

C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False

  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[25], line 14
     11 print(a.flags)
     12 print(b.flags)
---> 14 assert(b.tobytes('F') == b.copy().tobytes('F'))

AssertionError:

Python and NumPy Versions:

2.1.0.dev0+git20240708.735a477
3.12.4 (main, Jul 9 2024, 15:36:49) [GCC 11.4.0]

Runtime Environment:

[{'numpy_version': '2.1.0.dev0+git20240708.735a477',
'python': '3.12.4 (main, Jul 9 2024, 15:36:49) [GCC 11.4.0]',
'uname': uname_result(system='Linux', node='hn00', release='5.15.0-105-generic', version='#115-Ubuntu SMP Mon Apr 15 09:52:04 UTC 2024', machine='x86_64')},
{'simd_extensions': {'baseline': [], 'found': [], 'not_found': []}},
{'filepath': '/opt/intel/oneapi/mkl/2022.1.0/lib/intel64/libmkl_rt.so.2',
'internal_api': 'mkl',
'num_threads': 48,
'prefix': 'libmkl_rt',
'threading_layer': 'intel',
'user_api': 'blas',
'version': '2022.1-Product'},
{'filepath': '/opt/intel/oneapi/compiler/2022.1.0/linux/compiler/lib/intel64_lin/libiomp5.so',
'internal_api': 'openmp',
'num_threads': 96,
'prefix': 'libiomp',
'user_api': 'openmp',
'version': None}]

Context for the issue:

np.loadtxt creates the PyArrayObject in read_rows, which reads a specific number of rows of the text file based on the allocated buffer size. If the row size was too big, read_rows might decide to read only one line (var min_rows below), and then PyArray_SimpleNewFromDescr would set both F_CONTIGUOUS and C_CONTIGUOUS. This results in an inconsistent behavior with what loadtxt does on small files.

code for context:

numpy/numpy/_core/src/multiarray/textreading/rows.c

Lines 267 to 300 in 6b2a3e0

    
                       if (data_array == NULL) { 
        
                           if (max_rows < 0) { 
        
                               /* 
        
                                * Negative max_rows denotes to read the whole file, we 
        
                                * approach this by allocating ever larger blocks. 
        
                                * Adds a number of rows based on `MIN_BLOCK_SIZE`. 
        
                                * Note: later code grows assuming this is a power of two. 
        
                                */ 
        
                               if (row_size == 0) { 
        
                                   /* actual rows_per_block should not matter here */ 
        
                                   rows_per_block = 512; 
        
                               } 
        
                               else { 
        
                                   /* safe on overflow since min_rows will be 0 or 1 */ 
        
                                   size_t min_rows = ( 
        
                                           (MIN_BLOCK_SIZE + row_size - 1) / row_size); 
        
                                   while (rows_per_block < min_rows) { 
        
                                       rows_per_block *= 2; 
        
                                   } 
        
                               } 
        
                               data_allocated_rows = rows_per_block; 
        
                           } 
        
                           else { 
        
                               data_allocated_rows = max_rows; 
        
                           } 
        
                           result_shape[0] = data_allocated_rows; 
        
                           Py_INCREF(out_descr); 
        
                           /* 
        
                            * We do not use Empty, as it would fill with None 
        
                            * and requiring decref'ing if we shrink again. 
        
                            */ 
        
                           data_array = (PyArrayObject *)PyArray_SimpleNewFromDescr( 
        
                                   ndim, result_shape, out_descr);

The text was updated successfully, but these errors were encountered:

determ1ne added the 00 - Bug label Jul 10, 2024

determ1ne mentioned this issue Jul 10, 2024

BUG: np.loadtxt return F_CONTIGUOUS ndarray if row size is too big #26901

Merged

seberg closed this as completed in #26901 Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: `np.loadtxt` return F_CONTIGUOUS ndarray if row size is too big #26900

BUG: `np.loadtxt` return F_CONTIGUOUS ndarray if row size is too big #26900

BUG: np.loadtxt return F_CONTIGUOUS ndarray if row size is too big #26900

BUG: np.loadtxt return F_CONTIGUOUS ndarray if row size is too big #26900

Comments

Describe the issue:

Reproduce the code example:

Error message:

Python and NumPy Versions:

Runtime Environment:

Context for the issue:

BUG: `np.loadtxt` return F_CONTIGUOUS ndarray if row size is too big #26900

BUG: `np.loadtxt` return F_CONTIGUOUS ndarray if row size is too big #26900