Interactive data for the web: Bokeh for web developers

with Sarah Bird

Follow along

What is it

  • data visualiation library in python
  • it makes your visualizations for the web
  • iteractive
  • dynamic and data driven
  • roots in data science

I like writing in Python more than JavaScript, so this is cool because you can stay in that world

Why?

  • mid-data (and big-data)
  • It uses HTML5 Canvas to visualize so you can throw 10k points at it. (as opposed to d3.js where you have to create too many elements)
  • real-time data updates
  • server-side processing
  • my new “default”… maybe you have to move on and try something else if you are really specialized

Rendering pipeline: bokeh/python -> JSON -> bokehjs/javascript -> [ canvas, html ]

So this is cool because there are lots of other languages that can send JSON, so you can run those and send the data to the bokehjs layer.

Also, there’s a compatibility module which allows you to use all of your old matplotlib and ggplot plots.

Agenda

We built the Water Aid Africa site

At the time I built it in d3.js, but we could do the same thing in Bokeh, so I’ll show you how in this talk

High-level Charts interface

from bokeh.charts import Line
from bokey.plotting import show

chart = Line(data)
show(chart)

It gives you zoom, save, reset, pan… all this stuff already built in. What was the data that we just saw? It was a pandas data frame.

pd.read_csv("my_beautiful_csv.csv")
pd.read_json("my_deliciously_nested.json")

You could do the same thing with a Django Queryset

stats = stats.values(...)
stats_df = pd.DataFrame(...)

Or you could do the same thing with raw SQL

Style Guide

From a data visualization perspective, this awesome to start with. But your designer hates it. Make it shiny. Take the reductive approach.

Add labels and palettes. We have a python object, so we can specify an xrange, yrange, line weight, etc. I use the iPython notebook when I’m doing this sort of iterative syling because I can see my changes immediately.

OK restart.

from bokeh.models import Plot

plot = Plot() # empty space plot.add_glyph(…) plot.add_layers(…) # one other

Embedding

We can load bokeh.js asyncronously so that the page loads quickly. Or if you want to make your plots responsive, there are other options for tweaking the templates.

Glyph

(shapes).

from bokeh.models import Circle, Line, Triangle, Rect, Text, .....

We have to style them ourselves, because we’re using pandas, not SVG.

Map: Glyphs + Data

Not tricky. Really easy.

from bokeh.models import ColumnDataSource
source = ColumnDataSource(data)
plot.add_glyph(source, glyph)

So, we used a glyph to make each of the contries a specific color, we have the key and the map. Now let’s add some interaction.

Tap

plot.add_tools(TapTool())

Hover

plot.add_tools(HoverTool(tootips="@activeyear"))

Now when I click, I the responsiveness and section.

Linked selection

Everything’s a Python object, so we can start sharing them with each other

source = ColumnDataSource(data)

Now we can create a new control with the same source, and both tools will share the logic of that selection.

Widgets

  • Make a set of tabs
    • Panel with water
    • Panel with sanitation
    • All data is shared and works together

Slider

We need the server for this, so we create a server app

Define the instance attributes that are updated or changed

class WashmapApp(VBox):
    year = Instance(Slider)
    source = Instance(ColumnDataSource)

Serve plot (bokeh server)

Now that we do that.

Do I really need to set up a server to get a slider on my visualization. Yes but…

  • JavaScript actions framework
    • full framework “next few months”
    • See here bokeh/bokeh/models/actions.py
  • The server is awesome!
    • Complex data manipulation
    • Downsampling
    • Stream your data
  • There’s always bokehjs
    • See the repo for how to use bokehjs to make your plots responsive

Conclusions

Python Performance Profiling: The Guts and the Glory

with A. Jesse Jiryu Davis

Staff engineer at MongoDB

At mongo we write drivers for our database in 10 different languages. The driver that I mainly work on is called pymongo. The standard python driver for MongoDB.

A blog post came out discussing that they could get Over 80,0000 Inserts/Seconds on Commodity Hardware

This was written by a stranger to me and on the whole it was good news that he had managed to get good performance. But the problem was that:

  • MongoDB Node.js driver: 88,000 per second
  • PyMongo: 29,000 per second

So I get a one line e-mail from my boss, CC’ing our CEO asking: “Why Is PyMongo Slower?”

Eliot, our CEO, one of the most frightening people that I know. I am now on high alert, both for the sake of my job and for knowing what the hell is going on.

import pymong

client = pymongo.MongoClient('localhost')
db = client.random
collection = db.randomData
collection.remove()

n_documents = 80000
batch_size = 5000
batch = []

import time
starttime = time.time()

from datetime import datetime

min_date
max_date
delta = total seconds between them

# More code... he makes a random number distributed over a year of seconds
# Converts a datetime to time tuple, converts it to a timestamp
# Another random value
# and he appends it to the batch

That’s a good pattern to insert documents 5000 at a time. You can insert up to 16MB of documents at a time.

duration = time.time() - start
print 'inserted $d documents per second' % (n_documents / duration)

inserted 30,000 documents per second

That’s about what he got, so we’re working from a common baseline.

The Node.js Code

[ not shown because we are at PyCon ]

The Question

Why is the Python script 3x slower than the equivalent Node script? Is it my fault?

Why Profile?

  • Optimization is like debugging
  • Hypothesis: “The following change will yield a worthwhile improvement”
  • Experiment
  • Repeat until it meets your needs

Generally optimizing your code, make your code worse. It will make it less clear and more complicated. But typically, your original intention is represented in the first iteration.

Profiling is a way to generate your hypothesis. Profiling isn’t the experiment. It makes your code slower in unpredictable ways. So only an unprofiled run of the benchmark gives you your experiment.

Which Profiler?

  • cProfile (I did not reach for this one… it’s… fine)
  • Yappi (yet another python profiling [something])
    • as fast as cProfile
    • written in C
    • can measure functions
    • CPU time, not just wall
    • can measure all threads
    • can export to callgrind

Yappi

import yappi
yappi.set_clock_type('cpu')
yappi.start(builtins=True)
for i in range(n_documents):
	#Vlads script

# Export stuff here

Use KCacheGrind to explore

Spends 2/3 of it’s time in Collections.insert

The hypothesis is that even if pymongo were infinitely fast and took zero time, it would still be slower

  • Before: 30k inserts per second
  • After: 50k inserts per second

PyMongo 3.0:

  • Before: 38k inserts per second
  • After: 59k inserts per second

On PyPY (since it’s a JIT compiler like the NodeJS compiler)

  • Before: 51k inserts per second
  • After: 73k inserts per second

Conclusions

  • Generate hypotheses
  • Estimate possible improvement
  • Don’t start doing things that require a large time investment
    • Don’t start writing a caching layer
    • Don’t start doing other complicated things
    • Just stub out the thing that you think takes time

Performance by the Numbers: analyzing the performance of web applications

with Geoff Gerrietts

Engineer Management at AppNeta

15 years of Python

Performance Matters

It’s been hard to say how much it matters. Features matter, but it’s really hard to specify what the performance characteristics are. Sometimes that’s because we haven’t known the value. Recently amazon and Walmart have both released studies that have mapped monetary value to milliseconds.

Premature optimization is the root of all evil. -Donald Knuth

It’s really hard to deliver your code if you don’t know what the metrics are.

Slow websites erode your sanity and loose you money.

Ignoring latency does not make it better. But there are still plenty of mistakes to make.

Lots of people pass the buck. The developers blame the database, the database manager blames the systems guy and the systems guy points back to the code.

Ask not what the your database can do for you, but what the you can do for the database.

Just don’t take “The Drunken Man” approach, don’t just come up with a random idea, do a bunch of work, uglify your code, and then… I don’t know… hope that things got better.

We spent $100k doing nothing, but it was cool.

Don’t use The Hammer. “Everything looks like a thumb”. If you just use one tool, you can’t see into your blind spots.

Root Cause Analysis. All of the above antipatterns stop short of finding the root problem.

But it’s hard to do right

Sometimes we go straight to profilers. Be careful. A profiler can become A Hammer. E.g. stick a profiler between Django and WSGI. We got killed by function overhead from making extra calls and writing out the stats. Because of the amount of overhead involved in writing the stats can distort the profile picture.

More stories. Where we stuck that profiler in the middleware, we were trying to address the fact that the Django app was pretty slow. So we used the Apache logs to try and simulate the traffic. So we couldn’t figure out why 1 URL would take a few millisecond while another would take way longer, but we didn’t have the POST data, so it wasn’t a meaningful exercise.

Statistical Profiling

  • produces a statistical model
  • take samples sometimes
  • lacks context

Operating System Tools

  • Maybe top or strace
  • Maybe you’re secretly a sysdmin
  • Look at this graph!

OS Tools are great for observing resource depletion

It often presents as an acute system problem

Real insight into the code requires instrumentation

Ad-hoc

  • I wrote a stats service that would report mean latency, etc.
  • We wrote a Flask app… asynchronous fanout aggregator
    • 3 collectors, flask app, 3 reporters?
  • Etsy’s statsd
    • with statsd.timer('foo')
    • Tracks various metrics over time
  • works great for tracking and trending discrete events
  • can be labour intensive
    • everything is hand-tooled
    • can be exhaustive to interpret
    • one dude I know just keeps a screen full of munin graphs all day

Tracing

It’s hard to see performance without the context of the full request

Lots of tools exist to help

  • E.g. Twitter’s Zipkin
  • With these traces, you can add them and aggregate them and visualize them
  • This is a great place to start, when you’re looking at latency data
  • The architecture diagram is horrible
  • It is not super easy to do in a generic way for every application
  • Free versions
    • Google’s Dapper paper
    • Yammer’s Telemetry
    • Twitter’s Zipkin
  • Services
    • Exmaples
      • AppNeta
      • New Relic
      • AppDynamics
    • One-Stop setup

Conclusions

  • Build a toolbox
  • Don’t pick a hammer

@ggerrietts



blog comments powered by Disqus

Published

16 April 2015

Category

work

Tags