Performance improvements for many banded PostGIS rasters
Yesterday was a pretty big day for progress on our work with using PostGIS rasters for model output storage. I finally put together some benchmarks which very clearly showed that the run time of doing a ST_Clip and ST_Intersects operation between a raster and vector layer would increase super-linearly (perhaps even exponentially!) with an increasing number of bands (timesteps).
SELECT rid, st_height(st_clip(rast, 1, the_geom)), ST_width(ST_clip(rast, 1, the_geom)) FROM basins INNER JOIN bcsd ON ST_Intersects(the_geom, rast, 1) WHERE rid = %s
This was pretty unexpected, since the raster properties like cell size, projection, are the same for each band, so it wouldn’t make any sense to run the operation once for each band. It really should be a linear operation, so I called for reinforcements (PostGIS-users list). I received a pretty quick reply from Bborie Park at UC Davis who has apparently be working really hard on the raster types and has implemented a lot of the new features. Very helpful guy.
Apparently the cause of the super-linear performance is the fact that any data type that is stored in PostgreSQL is TOASTed, so it get’s deserialized for every operation which obviously gets more expensive the larger the data blog. Bborie also pointed out the fact that I had two ST_clips in the query, so the deTOASTing happes twice. Badness. I removed the 2nd clip operation from the query and bam! An increase in performance by a factor of hundreds.
At this point, the fact that the plot tails up at the end lead me to believe that it’s some sort of desktop scaling issue… maybe running out of cache when serializing the data? I’m not sure. But again Bborie came to the rescue and suggested out of db storage. In the past, many of the nice features that you want with rasters was not available for out of db storage, but Bborie has been working on this and it’s actually getting pretty close to the core functionality. Running the query on out of db storage, the plot finally becomes O(1).
Not sure what’s up with the 8 seconds at the beginning, but everything after that was < 300ms which is fine and dandy. Now it turns out that I have another problem:
raster_test=> SELECT ST_summarystats(ST_clip(rast, 1, the_geom), 1) FROM basins INNER JOIN bcsd ON ST_intersects(the_geom, rast, 1) WHERE rid = 61; st_summarystats -- (3727,0,0,0,0,0) (1 row) Time: 131.227 ms
It kind of looks like out of db storage doesn’t actually support retrieving the values of the raster, which would be problematic. Next step is to install the svn trunk version of PostGIS and see if it’s implemented in the dev version (or deterimine whether there is some other problem).
blog comments powered by Disqus