In the process of creating a new deployment environment for the 2.0 PCIC Data Portal, I came across some strange behaviour running the (new!) test suite.

For background, the netcdf4/hdf5 file handler for Pydap that wrote uses the h5py python package to open/access data for all of our raster data files. It typically works pretty darn well. The web application pulls data out of those files and then creates new data files (subsets, typically) and writes it out to whatever format the user has requested. Right now we just support netcdf3 and ASCII/CSV. Formats that are/can be multidimensional and can also be streamed, because the responses have the potential to be huge.

To test the netcdf responses, I didn’t necessarily want to use h5py. I want the files to be openable by any library or piece of software, so it made sense to check their validity with something else. I chose python-netcdf4. Unfortunately, these two libraries seem to have trouble playing nicely together, and can’t both have access to the system HDF5 libs at the same time. I’m not exactly sure why, but the issue has cropped up for at least another person and is documented here and here.

james@basalt ~ $ equery list netcdf hdf5
 * Searching for netcdf ...
[IP-] [  ] sci-libs/netcdf-4.1.1-r4:0

 * Searching for hdf5 ...
[IP-] [  ] sci-libs/hdf5-1.8.10:0

(Pdb) netCDF4.__version__
'1.0.7'
(Pdb) h5py.__version__
'2.2.1'

Apparently the ‘identifier-recycling issue’ started at hdf5 version 1.8.5. But apparently there are various combinations of the above software, that won’t cause this problem. At this point the best option seems to be to rewrite the tests to use scipy.io.netcdf or something, rather than play version-russian-roulette.

Update

Turns out that nearly nothing else can actually read netcdf3 files in python anymore. scipy.io.netcdf hasn’t seen any love for a while, and it totally failed to open up the output files with obscure error(s):

>>> import scipy.io.netcdf
>>> f = netcdf.netcdf_file('/tmp/tmpy9rTN6.nc')
Traceback (most recent call last):
  File "", line 1, in 
NameError: name 'netcdf' is not defined
>>> f = scipy.ionetcdf.netcdf_file('/tmp/tmpy9rTN6.nc')
Traceback (most recent call last):
  File "", line 1, in 
AttributeError: 'module' object has no attribute 'ionetcdf'
>>> f = scipy.io.netcdf.netcdf_file('/tmp/tmpy9rTN6.nc')
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/lib64/python2.7/site-packages/scipy/io/netcdf.py", line 211, in __init__
    self._read()
  File "/usr/lib64/python2.7/site-packages/scipy/io/netcdf.py", line 494, in _read
    self._read_var_array()
  File "/usr/lib64/python2.7/site-packages/scipy/io/netcdf.py", line 598, in _read_var_array
    mm = mmap(self.fp.fileno(), begin+self._recs*self._recsize, access=ACCESS_READ)
OverflowError: memory mapped size must be positive
>>> f = scipy.io.netcdf.netcdf_file('/tmp/tmpy9rTN6.nc', 'r', False)
Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/lib64/python2.7/site-packages/scipy/io/netcdf.py", line 211, in __init__
    self._read()
  File "/usr/lib64/python2.7/site-packages/scipy/io/netcdf.py", line 494, in _read
    self._read_var_array()
  File "/usr/lib64/python2.7/site-packages/scipy/io/netcdf.py", line 604, in _read_var_array
    rec_array = fromstring(self.fp.read(self._recs*self._recsize), dtype=dtypes)
ValueError: string size must be a multiple of element size


So I decided to play version roulette and it turned out to not be too bad. After upgrading my netcdf and hdf5 system libs to the highest available, everything was smooth again.

basalt james # equery list netcdf hdf5
 * Searching for netcdf ...
[IP-] [  ] sci-libs/netcdf-4.3.0:0

 * Searching for hdf5 ...
[IP-] [  ] sci-libs/hdf5-1.8.12:0/1.8.12

(Pdb) netCDF4.__version__
'1.0.7'
(Pdb) h5py.__version__
'2.2.1'

-----


blog comments powered by Disqus

Published

23 January 2014

Category

work

Tags