My attempts at writing some simple raster geoprocessing algorithms in PostGIS have met with some minor frustration. I shouldn’t be frustrated… raster types are practically alpha, but there is still a lot to be desired.

What are some common operations that we want to perform? And what would we want to do this in the database anyways? Shouldn’t we keep the logic out of the database?!? In this case, the advice of Andrew Dunstan (PostgreSQL major contributor) is in order:

Let the database be good at what it's good at, including smart processing of bulk data...Avoid shipping data to an external client program just to process and ship stuff back to the database. Only fetch data from the database if you need it to display or send to another system.

This is particularly true with raster data because often we’re only interested in a subset of the data (e.g. data from the 1971-2000 normals period, or only data from a certain watershed), which is much more inexpensive to fetch. Other times we’re only interested in broad summaries of the data. For example our RCI program often wants plots of timeseries across a whole watershed. I.e. they want a spatial mean at each point in time. If the raster data is even 100 pixels square (they are usually bigger), calculating the mean in the database before returning the query results will reduces the I/O load by a factor of 10,000 which is hugely significant!

What about the consideration that putting logic in the database can lead to maintenance issues? Calculating a mean (or quantiles or other simple arithmetic operations) barely constitutes “logic”, so I’m going to say that the maintenance hazards are pretty nominal and we can live with them.

So we have convinced ourselves that this is worth doing. Let’s do it. What hooks to we have?

Well first of all, it turns out that when I used ST_SetBandNoDataValue to set the no data value (because gdal2gdal didn’t do it), it only sets one band’s nodata value. While this should be surprising from the function’s name, it surprises me because it’s not very useful. Why would one band’s nodata value be different from another? And why wouldn’t you set all by default instead of the first band. Very silly.

I took the time to write a procedure to do it, but wasn’t entirely happy by the fact that it takes ten lines of boiler plate code to do something that should be do-able in one.

CREATE OR REPLACE FUNCTION SetAllNoDataValue(nodatavalue int) RETURNS Void AS
$$
DECLARE
        band int;
BEGIN
        FOR band IN (SELECT generate_series(1, ST_NumBands(rast)) FROM bcsd) LOOP
          UPDATE bcsd SET rast = ST_SetBandNoDataValue(rast, band, nodatavalue);
        END LOOP;
END;
$$
LANGUAGE plpgsql;

Aside from being overly verbose, it’s also not general at all. It will update every raster in the bcsd table. I don’t actually know how (or if you can) write general procedures in PostgreSQL, which is going to stop by later in this post.

OK, now that we have our nodata value set for all bands, let’s take a mean. How to we do that? How to we access pixel values? ST_Value seems to be the obvious and only choice. But it seems a little cumbersome, the API shows only single pixel access by specifying an x, y pair, and then you have to cross join that with a pair of generate_series() calls. None of this is conceptually hard, but it increases size of the code, and it turns out to also be slow as hell. If I run the query:

SELECT st_value(rast, 1, x, y) FROM
Generate_series(1, (SELECT ST_Width(rast) FROM bcsd WHERE rid=2)) AS x CROSS JOIN
Generate_series(1, (SELECT ST_Height(rast) FROM bcsd WHERE rid=2)) AS y, bcsd
WHERE rid=2;

It took 42 minutes for this query to finish which is not a reasonable amount of time. Why? I don’t know. What I do know is that generating the pixel space is about 82k rows, and then you go to CROSS JOIN that with the raster table? Something smells fishy to me about that, but I don’t know another way. All of the examples on the ST_Value help page do it that way, but I have a feeling that they’re all using 4-band satellite images instead of kilo-band timeseries model output. But yeah, it seems like that CROSS JOIN would duplicate the raster table 82k times, which is clearly the wrong thing to do.

The last example on the help page has a comment:

--- Checking all the pixels of a large raster tile can take a long time. --- You can dramatically improve speed at some lose of precision by orders of magnitude -- by sampling pixels using the step optional parameter of generate_series. -- This next example does the same as previous but by checking 1 for every 4 (2x2) pixels and putting in the last checked -- putting in the checked pixel as the value for subsequent 4

This is obviously not an option for us where the values in question are more than simply image colors.

[Several hours later after signing up to the postgis-users list serve]

So I read the the first message that came across the postgis-users list serve since my subscription and it contained some code from someone using ST_PixelAsPoints. “That doesn’t sound familiar,” I think to myself, “but really, really useful. How have I missed this?” Answer: It’s documented in the 2.1 release documentation which is still in development. Bah! We need it now! But better to know that it will be available at some point and that (apparently) other people have the same needs that we do.



blog comments powered by Disqus

Published

13 September 2012

Category

work

Tags