Satellite mapping for everyone

with Chris Waigl

PhD Candidate at UAF

Satellite mapping for Python programmers

Most of the precendent focus on vector data and not raster data

Rasters are completely ignorant of the object meaning of raster data.

Both are “just data” and we can deal with that.

Raster data is not really harder than using vector data, but there are fewer people that do it. And there’s lots of data freely available. E.g.: LandSat

  • 30m spatial resolution
  • since the 80s and early version 70s
  • any point on the globe (except the poles)
  • public domain

Agenda:

  • Why would you want to do this?
  • Where do you get data?
  • How can you make maps efficiently?

How

There are polar orbiting satellites that run a swath. Swath width is limited by the height of the satellite and the number of sensors. Overlap is 16-18 days for LandSat. Sometimes you get more data, but usually not.

Basics:

  • Swatch width

Examples: * ice jams and floods on the Lukon River at Galena * Urban sprawl in Las Vegas * Louisiana’s disappearing coastline (ProPublica) * Mount Polly mine disaster

Higher resolution images are available * Spot5 (2.5 m resolution)

Which do we use

  • Low leve
    • GEOS
    • PROJ4
    • Numpy
  • GDAL/OGR
  • High level
    • shapely
    • fiona
    • rasterio
  • visualization
    • pygaarst
    • matplotlib
    • hdf5/h5py
    • python-hdf4
from osgeo import gdal
gdal.UseExceptions()
ds = gdal.Open(fpath)

ds.RasterCount
ds.RasterXSize
ds.RasterYSize
ds.GetProjection()

rasterio

import rasterio
with rasterio.open(fname) as f:
    # do stuff
    f.bounds
    f.crs

Command line tools

  • gdalinfo
  • rio info
  • gdal_transform
  • gdal_merge.py

Real LandSat data

http://earthexplorer.usgs.gov/ 980 MB tarball a bunch of tifs

metadata, Q/A layer, and 11 bands

  • Red/green/blue
  • Infrared bands
  • radiometric resolution == bit depth

You may find HDF4/HDF5 (yup).

How much ice is there in the winter

  • National Snow and Ice Data Center has Passive microwave data
  • Lots of scientific info, but don’t be afraid. Read the docs.
  • 25x25 km pixel size
  • 8 bit data
  • polar stereographic projection
  • 0-250 maps to 0-100% sea ice cover
  • Also coastline, land percentage layers.

What do I do?

  • Open the files, grab the data, plot the data in matplotlib.
  • geog2stereo()
  • Reproject it, change the projection to an albers equal area

http://nbviewer.ipython.org/gist/chryss/…

Random ideas

  • Combine this with image processing to detect heaving logging activity
  • Detecting snow-free areas for hiking in the Rockies
  • Mapping suburban sprawl around Invermere

Advanced Git

with David Baumgold

The basics

branches are labels note that you can change where the labels point to each commit has pointers backwards (even if it’s usually displayed forward)

Preface: status

  • git status
  • git show (shows you information that you are currently sitting on)
    • the hash
      • calculated based on the parent, based on the diff, other things
    • the diff
    • etc.

Chapter 1: blame

  • git blame
    • last commit that touched that line
    • give you the hash
    • once you have the hash, you can run git show

Chapter 2: cherry-pick

  • used to move a commit from one branch to another
  • on master, run git show and get the hash
  • checkout the feature branch
  • cherry-pick, creates a new commit, does not delete the old one.
  • How do I remove from master???

Chapter 3: reset

  • git reset --hard HEAD^ (whatever commit I’m currently sitting on)
  • HEAD^ parent
  • HEAD^^ grandparent
  • reset reassigns the branch pointer
  • old J is now just hangout out in the ether
  • J will get cleaned up by git’s garbage collector eventually
  • garbage collect runs pretty infrequently

Chapter 4: rebase

  • Changing history
  • Awesome power, and responsibility
  • Never change history when other people might be using your branch, unless they know you’re doing so.
  • Never change history on master
  • Best practice: only change history for commits that have not yet been pushed

When do you want to use rebase

  • You could use a merge, but that makes it confusing when you want to do code review, because the merge commits are in there

  • git checkout feature
  • git rebase master
  • You have changed history, and you may get some warnings
    • the branches have diverged
  • git push… “you wnat me to do what?”
  • git push -f (force) It will let you do it, but it says that it’s doing it under duress

When you rebase, sometimes you get conflicts. They are resolved just like merge conflicts.

Do what git status tells you, git rebase --continue

If you want to start over, you can always use git rebase --abort

You can also get conflicts when you cherry-pick. git status is your friend.

Chapter 5: reflog

git reflog shows commits in order of when you last referenced them

“Oh no, I screwed up and I want to get back to the way things were before, but I didn’t write down the commit hash”

reflog will show you what you’ve been doing and the commit hashes.

Then reset the branch pointer back to the commit: git reset --hard

Chapter 6: squashing commits

git add missing-file.txt; git commit --amend

Never have to have a bunch “crap I forgot that file” commits.

git rebase --interactive

change picks to squash which melds the squashed commits with the previous commits. You can amend history to make it look like we’re a perfect coder and never made a mistake.

“My commit is too big, can I split it into smaller ones?”. Yes.

git rebase --interactive

change pick to edit. It will rebase up to the end point.

git reset HEAD^1, but leave your filesystem in place, and then you can do a git add -p to pick and choose

Chapter 7: git bisect

“The feature’s broken? But it was working just fine two months ago… what changed?”

You need:

  • A test to determine if things are broken
  • A commit where things were working
  • A commit where things are broken

See here: work/2014/05/30/git-bisect-FTW/

If you have an automated test, it’s even faster:

git bisect run my_test.sh

Questions

Q: What do you do if you have to rebase and there are merge conflicts, and you have to resolve them repeatedly for every commit?

A: That’s a situation where you may want to squash all of your commits, then do the resolution, so at least you only have to do it once. There’s also supposedly an extension (?) named rerere, which I think stands for “Reuse, rebase, resolution”. However, I have not used it, personally.

Python Concurrency From the Ground Up: LIVE!

with David Beazley

This talk is a live demonstration of writing a network service from the ground up using threads or coroutines. It’s a lot of low level code, and there are a lot of things that could go wrong.

fibinocci

Let’s start with a basic service that does something: fibinocci, a terrible recursive version of fib()

The larger the number the longer it takes… this is actually a demonstration “feature” which gives us a large range of computational times.

Let’s make a microservice out of this.

Creating a network port, binding to an address, and receiving connections on it.

This is the world’s worst microserice.

The problem with this is that it will only handle one client at a time which is a very basic concurrency problem.

From here, many people will turn to thread programming. Add Threading(). It mostly works. Threads is one of those topic where may people say “never use threads”… Python is actually using real system POSIX threads… but there are problems with the GIL and other such things.

So, let’s measure. Write something that measures the resposne time when we hit it with large requests.

One of the things that people know about the GIL is that it pins you to one CPU… if I hit it with another session, it adds another 1/3 second to the run time. There’s another facet of the GIL that is not as widely known. Also affects the time of short requests. So let’s measure in requests / second. So, let’s just keep a counter, sleep for a second and then reset the counter.

So hammer the server and we do about 25k requests per second. If we do something that requires some CPU, we get a plummet in requests per second, down to 90 requests / second. The GIL gives priority to things that require the CPU. But… that’s not how operating systems prioritize. If I run it in a separate process, you don’t even see a dent in the reqs/second.

Maybe you are actually writing a web service, like some computationally intensive thing (like computing NetCDF files!).

A solution to this, is to farm the work out to a process pool, from concurrent.futures…

Now instead of computing the work yourself, you submit it out to a pool. Doing things with the pool, there’s way more overhead. reqs/sec drop down to 2.5k reqs/sec. The upside is that there’s no contention… you’re not fighting the GIL.

One of things that you may have heard about… “threads suck, let’s not use them.”

What did threads give me? Essentially threads were helping me with all of these operations that might block, and the blocking is what is preventing concurrency.

Well, Python has a feature that blocks like that: generators. yield makes the code stop there.

So, let’s make a queue of tasks, put a bunch of generator functions on there, then write a little task scheduler. While there are tasks to run, I’ll run it to the yield statement, and then put it back in the tasks queue. Round robin scheduler.

So let’s put in yield statments any place where there is potential blocking and then flesh it out to drive it with the generators. So I’m going to use the yield statment to communicate the program’s intention. I’m yielding… but why? I’m receiving. Or I’m sending.

OK, what does it mean to send and receive. For receiving, we have to go wait somewhere. So we’ll make a waiting area for things that are receiving and for sending. Make two dictionaries mapping sockets to tasks (which are generators).

Need to get the tasks back from the penalty area. Do a poll/select, and then when they have run, push them back on to the waiting area queue.

So… now we have concurrency with no threads, with just a poll/select event loop. But… this doesn’t avoid the GIL problem. You’re still limited to a single CPU. It also doesn’t solve the blocking problem.

In order to fix that, you have to go back to the original solution. Create a pool of worker threads.

If you’re programming with coroutines, you have to go all in. Because future.result blocks. You have to yield on future, and the modifications for that are non-trivial. You have to make another waiting area for the future, and having a strategy to for getting it out of there. Then you have to add an pair of future_notify, future_event.

A lot of people think that just because you’re using coroutines, it means that you can ignore the GIL. That is not true. You still need to use a threadpool or something.

Programming model

One can write a class and take the coroutine stuff and hide it behind an interface. Then you can just do a yield from later in the code. From a logical flow perspective, the coroutine implementation looks a lot like the threading implementation. One benefit is that coroutines is a great way to handle a large number of socket connections. This poll/select loop is the basis of all of the event-driven frameworks like Twisted

Conclusions

  • What is the impact of the GIL?
  • What is the impact of blocking?
  • You really have to think about what work you do where.

Fire your supervisord: running Python apps on CoreOS

CoreOS

with Dan Callahad

Your system is atomic, so when an update is applied, either everything gets applied, or nothing gets applied. The server reboots itself. So you might ask, if the server is rebooting iteself, how to I keep my app up?

Consensus with etcd

Three servers, one with metadata and two regular servers

  • etcd holds a lock server
  • if a lock breaks your machine, a server can take out a lock on the update
  • etcd is running CoreOS, so it’s also rebooting itself
  • etcd is designed to be distributed… you can build a cluster of nodes
  • As long as the majority of them are up, it will keep going.

Containerization

CoreOS is able to 140MB… no python, perl, Ruby, apt, anything. How do you run anything?

Use containers: Docker / rkt

It’s a really lightweight whole computer or a very heavyweight chroot.

It’s more more like launching a local process. Docker popularized this in contemporary time, because Linux came along with cgroups, but it’s been around since 1998.

CoreOS ships Docker and rkt (pronounced “rok-et”)

Demo

  • Three node cluster with the app and etcd running on the same node. You could do this, but it’s not recommended
  • Etcd is reusible

Docker let’s you build a hierarchy of build images. Inside the container, it looks like I’m on a Debian system.

Scheduling

Fleet / Kubernetes (both of these are built on etcd)

Cluster-level init system

“Always keep these running, but not on the same machine”

Fleet is the “Clusterd interface for systemd”.

Let’s you write normal systemd unit files, but a few Fleet extensions that allow you to describe which machines it should be on.

Shows a bunch of containers running and shelling into them and getting their config from the cluster. “I promised you high-availability, so let’s go kill a machine”. Kills a machine and etcd is still running.

Design considerations

Minimize state, build the “Twelve-Factor Apps” by Heroku.

What about Databases? But this doesn’t getting rid of things that actually need state. You still need something in front of this doing load balancing.

If you’re on Amazon, then whatever, just pay for RDS. Things like PostgreSQL are not cluster aware. So at some point you need to just deal with that.

We used

  • An OS with automatic, atomic, whole system updates
  • Portable, isolated containers for our applications.
  • Multiple servers in a coordinated cluster.
  • A scheduler to distribute jobs across machines.

CoreOS is supported on a ton of platforms

  • Local VMs (Vagrant)
  • Azure, EC2, GCE, RackSpace
  • Digital Ocean ($40 credit SAMMYLOVESPYCON)

Try it out and play with it.

How does this relate to Chef/Puppet. This kind of obviates the need for Chef/Puppet and doing configuration management, because everything’s so lightweight that you can just serialize them and send them somewhere else.

Using Supervisor For Fun And Profit

with Chris McDonough

Python OSS developer Original developer for Supervisor, Pyramid and work for Agendaless Consulting

Who I am not: Primary maintainer is Mike Naberezny.

What is Supervisor

Python application that lets you control and monitor processes on Unix-like systems

It’s a program that runs other programs. Programs that it runs and not required to be Python programs.

Configuration

Config-file driven. Kinda like systemd or upstart expect it’s meant to be per-project instead of “pid 1”.

supervisord starts the programs. All communication with supervisord runs over HTTP. You can start it up as a unix socket instead of an HTTP socket.

It has an RPC interface, but you can extend it by adding other interfaces.

That’s bookkeeping, but let’s run other programs.

[supervisor:program]

When we start supervisord, it will run one instance of this program. So… let’s do that. You can run it in the foreground and give it -edebug to print out debugging information. We can see that clock is indeed running, and printing out it’s information.

When we stop Supervisor, the programs that it started get stopped as well.

Allows you to control the process:

  • via a web interface (a bit weak, but it’s handy and it works)
  • via a command-line tool
  • programmatically via XML-RPC

I usually control it with supervisorctl. With it we can: start, stop, status, tail the log, maintail, reload, shutdown

More advanced things:

  • Change the config file and add another section
  • change processes to the foreground

With supervisor settings you can:

  • run more than one instance of a program
  • restarts programs when they crash (unexpectedly)
    • If it can’t restart it, it backs off for a while
    • It if still can’t start, it gives up after a while
    • Useful for weird corner cases that you don’t have time to debug or file bug

Finally you can use the XML-RPC interface to control everything, you get get full status information from a couple lines of python.

FAQ: Programs run under supervisor must not daemonize themselves, in contrast to god or monit. Supervisor does the daemonization

Event Listeners

programs that take tokens from supervisor on stdin and …

e.g. package called superlinks (httpok, crashmail, memmon). httpok continually pokes a URL that you supply, and if it does not get a 200 OK it will restart something that is running under supervisor. crashmail will send you an e-mail. memmon restrarts something if it goes over a memory threshold.

Supervisor in real-world app deployments. Ansible/SaltStack/Puppet. Sometimes packages will put little snippets of supervisord config in /etc/supervisord.d which is really useful. Sometimes used in the Docker community to start multiple programs inside a container.

History. Has been around forever. Since 2004. Sometime this year, Supervisor 4.0 will work on Py3 and signal. I want it to be replaced by systemd or something else. That’d be nice.

Questions?

Q: I always have trouble with gunicorn workers not being killed

A: You can set the option killasgroup. It should be the default, but it’s not right now. Also you need to ensure that processes that you start up are non-daemonized.



blog comments powered by Disqus

Published

10 April 2015

Category

work

Tags