8000 BUG: Xerbla doesn't get linked in 1.10-devel. by charris · Pull Request #5383 · numpy/numpy · GitHub
[go: up one dir, main page]

Skip to content

BUG: Xerbla doesn't get linked in 1.10-devel. #5383

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Dec 24, 2014
Merged

Conversation

charris
Copy link
Member
@charris charris commented Dec 20, 2014

Add our python_xerbla to the multiarray sources. That function is
needed for all modules that link to the ATLAS 3.10 libraries, which
are now all located in two files, libsatlas and libtatlas.

Also make the test for xerbla linkage work better. If xerbla is not
linked the test will be skipped with a message.

Closes #5362.

@charris
Copy link
Member Author
charris commented 10000 Dec 20, 2014

This fixes #5362:


In [1]: np.linalg.lstsq([[np.nan]]*2, [1]*2)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-1-b5c3177f1e3f> in <module>()
----> 1 np.linalg.lstsq([[np.nan]]*2, [1]*2)

/home/charris/.local/lib/python2.7/site-packages/numpy/linalg/linalg.pyc in lstsq(a, b, rcond)
   1865         work = zeros((lwork,), t)
   1866         results = lapack_routine(m, n, n_rhs, a, m, bstar, ldb, s, rcond,
-> 1867                                  0, work, lwork, iwork, 0)
   1868     if results['info'] > 0:
   1869         raise LinAlgError('SVD did not converge in Linear Least Squares')

ValueError: On entry to DLASCL parameter number 4 had an illegal value

A different fix is needed for earlier versions of numpy. Probably adding python_xerbla.c to the _dotblas sources will work.

python_xerba.o should probably be located somewhere where all modules can find it, Maybe a library with just the one file. Thoughts?.

@charris
Copy link
Member Author
charris commented Dec 20, 2014

@opoplawski Heads up. Note that current Fedora 21 numpy is also broken in this regard.

@@ -431,6 +431,7 @@ def pre_build(context):
bld(target="multiarray_templates", source=multiarray_templates)
if ENABLE_SEPARATE_COMPILATION:
sources = [
pjoin('src', 'multiarray', 'python_xerbla.c),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo, needs a ' at the end of python_xerbla.c

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks.

@@ -1152,6 +1152,8 @@ def test_xerbla_override():
# and may, or may not, abort the process depending on the LAPACK package.
from nose import SkipTest

XERBLA_OK = 255
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a valid return code?
aren't those limited to 127 on some systems?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, it's valid on Linux, and windows returns 16 bit values. It can be anything that is not common.

@juliantaylor
Copy link
Contributor

strange this isn't seen in debian which also has atlas 3.10 and automatic tests of numpy against netlib, openblas and atlas (http://ci.debian.net/packages/p/python-numpy/)
Probably related to distribution specific blas/lapack splitting?

@charris
Copy link
Member Author
charris commented Dec 24, 2014

@juliantaylor Are you sure ATLAS 3.10 is getting linked? It isn't found by default and I didn't see the problem until I set up site.cfg to work correctly for that version. #5364 allows automatic detection.

If someone would explain to me how a dynamic library can link to various different external functions of the same name in a safe way you will make me very happy :)

@juliantaylor
Copy link
Contributor

yes but only the blas part, debian uses runtime interchangeable blas so numpy is linked against libblas.so.3 and which is just a symlink to the real library which can be any compatible one.

@charris
Copy link
Member Author
charris commented Dec 24, 2014

I've been using a site config with

[atlas]
library_dirs = /usr/lib64/atlas
include_dirs = /usr/include
atlas_libs = tatlas
lapack_libs = tatlas

ATLAS also provides the needed lapack routines. Where does lapack come from on debian?

@juliantaylor
Copy link
Contributor

normal netlib lapack, but atlas and its xerbla is loaded first

@charris
Copy link
Member Author
charris commented Dec 24, 2014

Note that the xerbla test never fails, it just gets skipped. The verbose output needs to be checked for xerbla to see what happened. Either that or use np.show_config().

@juliantaylor
Copy link
Contributor

also your example works fine in debian/ubuntu, even when manually linking against atlas

@juliantaylor
Copy link
Contributor

the output of

LD_DEBUG=all  python -c "import numpy as np; np.linalg.lstsq([[np.nan]]*2, [1]*2)"

on fedora might be interesting
on debian e.g. ATL_xerbla is loaded from libatlas as expected, but xerbla_ from numpys lapack_lite

@charris
Copy link
Member Author
charris commented Dec 24, 2014

Fedora Numpy (1.8.2) shows

     10417: symbol=cblas_xerbla;  lookup in file=python [0]
     10417: symbol=cblas_xerbla;  lookup in file=/lib64/libpython2.7.so.1.0 [0]
     10417: symbol=cblas_xerbla;  lookup in file=/lib64/libpthread.so.0 [0]
     10417: symbol=cblas_xerbla;  lookup in file=/lib64/libdl.so.2 [0]
     10417: symbol=cblas_xerbla;  lookup in file=/lib64/libutil.so.1 [0]
     10417: symbol=cblas_xerbla;  lookup in file=/lib64/libm.so.6 [0]
     10417: symbol=cblas_xerbla;  lookup in file=/lib64/libc.so.6 [0]
     10417: symbol=cblas_xerbla;  lookup in file=/lib64/ld-linux-x86-64.so.2 [0]
     10417: symbol=cblas_xerbla;  lookup in file=/usr/lib64/python2.7/site-packages/numpy/core/_dotblas.so [0]
     10417: symbol=cblas_xerbla;  lookup in file=/usr/lib64/atlas/libsatlas.so.3 [0]
     10417: binding file /usr/lib64/atlas/libsatlas.so.3 [0] to /usr/lib64/atlas/libsatlas.so.3 [0]: normal symbol `cblas_xerbla'

Which is a problem, the lapack_lite functions are linked against our xerbla. Numpy devel without this patch shows

     12228: symbol=xerbla_;  lookup in file=python [0]
     12228: symbol=xerbla_;  lookup in file=/lib64/libpython2.7.so.1.0 [0]
     12228: symbol=xerbla_;  lookup in file=/lib64/libpthread.so.0 [0]
     12228: symbol=xerbla_;  lookup in file=/lib64/libdl.so.2 [0]
     12228: symbol=xerbla_;  lookup in file=/lib64/libutil.so.1 [0]
     12228: symbol=xerbla_;  lookup in file=/lib64/libm.so.6 [0]
     12228: symbol=xerbla_;  lookup in file=/lib64/libc.so.6 [0]
     12228: symbol=xerbla_;  lookup in file=/lib64/ld-linux-x86-64.so.2 [0]
     12228: symbol=xerbla_;  lookup in file=/home/charris/.local/lib/python2.7/site-packages/numpy/core/multiarray.so [0]
     12228: symbol=xerbla_;  lookup in file=/usr/lib64/atlas/libtatlas.so.3 [0]
     12228: binding file /usr/lib64/atlas/libtatlas.so.3 [0] to /usr/lib64/atlas/libtatlas.so.3 [0]: normal symbol `xerbla_'

@juliantaylor
Copy link
Contributor

why is it looking up the xerbla_ symbol when loading multiarray?
is it built with -Wl,bind=now?
our xerbla overloading relies on lazy binding so its only looked up when loading the lapack_lite module

@charris
Copy link
Member Author

I expect because _dotblas has been moved into multiarray.

@juliantaylor
Copy link
Contributor

but dotblas is not calling xerbla_ so it should not be binding it.
I think it is some fedora security build option either in numpy or atlas that is causing the problem, so the same thing that broke it in ubuntu but was fixed long ago.

@charris
Copy link
Member Author
charris commented Dec 24, 2014

Could it also be related to the compiler version? Using gcc version 4.9.2 here. Maybe I should fire up fedora 20 and see if this is a long standing problem.

@juliantaylor
Copy link
Contributor

no the compiler should not directly be related to it, but the flags you are using to compile are, though only the linker flags prefixed with -Wl,
I guess I can spin up a fedora in docker myself and check, just install the system atlas then build numpy?

@charris
Copy link
Member Author
charris commented Dec 24, 2014

I only see -Wl by itself. Are there some flags that may have default values?

Yes, that should do it. Numpy won't link to system atlas by default, so site.cfg needs to be set up.

@juliantaylor
Copy link
Contributor

why won't it pick up system atlas? didn't you add support for 3.10 recently?

@charris
Copy link
Member Author
charris commented Dec 24, 2014

It hasn't been merged yet... Was thinking of doing that today.

@juliantaylor
Copy link
Contributor

ok it seems the debian atlas is split into a fortran library that provides xerbla (libf77blas) and the regular library
multiarray is not linked against the fortran library so it isn't loading the atlas xerbla, possibly fedora can do something similar.

@juliantaylor
Copy link
Contributor

or this change, should work just as well.

@@ -0,0 +1,51 @@
#include "Python.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't you use the same file as from lapack?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant linalg, this file is now twice in numpy, once in multiarray and once in numpy/linalg

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you mean the same python_xerbla source, it is inconveniently located in numpy/linalg/lapack_lite/python_xerbla.c`.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that a problem for reusing it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't use f2c.h

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's because I removed the include for this version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weird I guess numpy.distutils automatically adds the include path then

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your version is cleaner, so if it works it is preferable. We can give it a try and see if there are any problems.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think onefile compilation will not work

@charris
Copy link
Member Author
charris commented Dec 24, 2014

If the debian atlas 3.10 is split, that may cause problems for the automatic detection of 3.10, it doesn't look for split files as 3.10 doesn't normally do that. Does debian make it look like earlier versions of atlas?

@juliantaylor
Copy link
Contributor

debian currently doesn't seem to offer a threaded atlas, only libatlas, that is covered by the old atlas detection logic
the splitting are just wrapper libraries ontop of libatlas providing cblas or f77 interfaces,
debian installs:
libatlas.so (contains ATL_xerbla)
libcblas.so (contains cblas_xerbla)
libf77blas.so (contains xerbla_)
liblapack_atlas.so

multiarray is linked against libcblas, so no xerbla_ is pulled in, lapack_lite is linked against libf77blas so it picks up the xerbla_ but overrides it correctly.

@charris
Cop 10000 y link
Member Author
charris commented Dec 24, 2014

Oh, that's not a bad solution. Fedora seems a bit quicker to make incompatible changes. One improvement here might be to make the xerbla inclusion depend on HAVE_CBLAS.

@charris
Copy link
Member Author
charris commented Dec 24, 2014

Do you know if there is a problem with threaded atlas? Fedora Numpy is linked with the single threaded version.

@juliantaylor
Copy link
Contributor

not that I know of, though in the past atlas had the problem that you had to define the threads at compile time and its not a runtime option like in openblas, so distributions had to pick the safe single threaded option.

@charris
Copy link
Member Author
charris commented Dec 24, 2014

This is still broken for single file builds, I forgot to include it :( Will fix.

Add our python_xerbla to the multiarray sources. That function is
needed for all modules that link to the ATLAS 3.10 libraries, which
are now all located in two files, libsatlas and libtatlas.

Also make the test for xerbla linkage work better. If xerbla is not
linked the test will be skipped with a message.

Closes numpy#5362.
@charris
Copy link
Member Author
charris commented Dec 24, 2014

Fixed single file builds, made compilation of python_xerbla.c depend on HAVE_CBLAS macro.

@juliantaylor
Copy link
Contributor

looks good, should make this fragile mechanism a bit more robust.

@charris
Copy link
Member Author
charris commented Dec 24, 2014

I'll push a differenct fix for 1.9.x.

@charris
Copy link
Member Author
charris commented Dec 24, 2014

OK, putting this in. Thanks for the review, Julian, I learned things :)

charris added a commit that referenced this pull request Dec 24, 2014
BUG: Xerbla doesn't get linked in 1.10-devel.
@charris charris merged commit bcd7169 into numpy:master Dec 24, 2014
@charris charris deleted the xerbla branch December 24, 2014 18:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

numpy exits if lhs in lstsq(lhs, rhs) contains nan
3 participants
0