PyCon Day 4: Afternoon Sessions
Interactive data for the web: Bokeh for web developers
with Sarah Bird
What is it
- data visualiation library in python
- it makes your visualizations for the web
- iteractive
- dynamic and data driven
- roots in data science
I like writing in Python more than JavaScript, so this is cool because you can stay in that world
Why?
- mid-data (and big-data)
- It uses HTML5 Canvas to visualize so you can throw 10k points at it. (as opposed to
d3.js
where you have to create too many elements) - real-time data updates
- server-side processing
- my new “default”… maybe you have to move on and try something else if you are really specialized
Rendering pipeline: bokeh/python -> JSON -> bokehjs/javascript -> [ canvas, html ]
So this is cool because there are lots of other languages that can send JSON, so you can run those and send the data to the bokehjs layer.
Also, there’s a compatibility module which allows you to use all of your old matplotlib
and ggplot
plots.
Agenda
We built the Water Aid Africa site
At the time I built it in d3.js
, but we could do the same thing in Bokeh, so I’ll show you how in this talk
High-level Charts interface
from bokeh.charts import Line
from bokey.plotting import show
chart = Line(data)
show(chart)
It gives you zoom, save, reset, pan… all this stuff already built in. What was the data that we just saw? It was a pandas
data frame.
pd.read_csv("my_beautiful_csv.csv")
pd.read_json("my_deliciously_nested.json")
You could do the same thing with a Django Queryset
stats = stats.values(...)
stats_df = pd.DataFrame(...)
Or you could do the same thing with raw SQL
Style Guide
From a data visualization perspective, this awesome to start with. But your designer hates it. Make it shiny. Take the reductive approach.
Add labels and palettes. We have a python object, so we can specify an xrange, yrange, line weight, etc. I use the iPython
notebook when I’m doing this sort of iterative syling because I can see my changes immediately.
OK restart.
from bokeh.models import Plot
plot = Plot() # empty space plot.add_glyph(…) plot.add_layers(…) # one other
Embedding
We can load bokeh.js
asyncronously so that the page loads quickly. Or if you want to make your plots responsive, there are other options for tweaking the templates.
Glyph
(shapes).
from bokeh.models import Circle, Line, Triangle, Rect, Text, .....
We have to style them ourselves, because we’re using pandas, not SVG.
Map: Glyphs + Data
Not tricky. Really easy.
from bokeh.models import ColumnDataSource
source = ColumnDataSource(data)
plot.add_glyph(source, glyph)
So, we used a glyph to make each of the contries a specific color, we have the key and the map. Now let’s add some interaction.
Tap
plot.add_tools(TapTool())
Hover
plot.add_tools(HoverTool(tootips="@activeyear"))
Now when I click, I the responsiveness and section.
Linked selection
Everything’s a Python object, so we can start sharing them with each other
source = ColumnDataSource(data)
Now we can create a new control with the same source, and both tools will share the logic of that selection.
Widgets
- Make a set of tabs
- Panel with water
- Panel with sanitation
- All data is shared and works together
Slider
We need the server for this, so we create a server app
Define the instance attributes that are updated or changed
class WashmapApp(VBox):
year = Instance(Slider)
source = Instance(ColumnDataSource)
Serve plot (bokeh server)
Now that we do that.
Do I really need to set up a server to get a slider on my visualization. Yes but…
- JavaScript actions framework
- full framework “next few months”
- See here bokeh/bokeh/models/actions.py
- The server is awesome!
- Complex data manipulation
- Downsampling
- Stream your data
- There’s always
bokehjs
- See the repo for how to use
bokehjs
to make your plots responsive
- See the repo for how to use
Conclusions
- Run ALL THE EXAMPLES
- bokeh/examples//.py
- 1.0 by end of the year
- Get a server up and running
Python Performance Profiling: The Guts and the Glory
with A. Jesse Jiryu Davis
Staff engineer at MongoDB
At mongo we write drivers for our database in 10 different languages. The driver that I mainly work on is called pymongo
. The standard python driver for MongoDB.
A blog post came out discussing that they could get Over 80,0000 Inserts/Seconds on Commodity Hardware
This was written by a stranger to me and on the whole it was good news that he had managed to get good performance. But the problem was that:
- MongoDB Node.js driver: 88,000 per second
- PyMongo: 29,000 per second
So I get a one line e-mail from my boss, CC’ing our CEO asking: “Why Is PyMongo Slower?”
Eliot, our CEO, one of the most frightening people that I know. I am now on high alert, both for the sake of my job and for knowing what the hell is going on.
import pymong
client = pymongo.MongoClient('localhost')
db = client.random
collection = db.randomData
collection.remove()
n_documents = 80000
batch_size = 5000
batch = []
import time
starttime = time.time()
from datetime import datetime
min_date
max_date
delta = total seconds between them
# More code... he makes a random number distributed over a year of seconds
# Converts a datetime to time tuple, converts it to a timestamp
# Another random value
# and he appends it to the batch
That’s a good pattern to insert documents 5000 at a time. You can insert up to 16MB of documents at a time.
duration = time.time() - start
print 'inserted $d documents per second' % (n_documents / duration)
inserted 30,000 documents per second
That’s about what he got, so we’re working from a common baseline.
The Node.js Code
[ not shown because we are at PyCon ]
The Question
Why is the Python script 3x slower than the equivalent Node script? Is it my fault?
Why Profile?
- Optimization is like debugging
- Hypothesis: “The following change will yield a worthwhile improvement”
- Experiment
- Repeat until it meets your needs
Generally optimizing your code, make your code worse. It will make it less clear and more complicated. But typically, your original intention is represented in the first iteration.
Profiling is a way to generate your hypothesis. Profiling isn’t the experiment. It makes your code slower in unpredictable ways. So only an unprofiled run of the benchmark gives you your experiment.
Which Profiler?
- cProfile (I did not reach for this one… it’s… fine)
- Yappi (yet another python profiling [something])
- as fast as cProfile
- written in C
- can measure functions
- CPU time, not just wall
- can measure all threads
- can export to callgrind
Yappi
import yappi
yappi.set_clock_type('cpu')
yappi.start(builtins=True)
for i in range(n_documents):
#Vlads script
# Export stuff here
Use KCacheGrind to explore
Spends 2/3 of it’s time in Collections.insert
The hypothesis is that even if pymongo
were infinitely fast and took zero time, it would still be slower
- Before: 30k inserts per second
- After: 50k inserts per second
PyMongo 3.0:
- Before: 38k inserts per second
- After: 59k inserts per second
On PyPY
(since it’s a JIT compiler like the NodeJS compiler)
- Before: 51k inserts per second
- After: 73k inserts per second
Conclusions
- Generate hypotheses
- Estimate possible improvement
- Don’t start doing things that require a large time investment
- Don’t start writing a caching layer
- Don’t start doing other complicated things
- Just stub out the thing that you think takes time
Performance by the Numbers: analyzing the performance of web applications
with Geoff Gerrietts
Engineer Management at AppNeta
15 years of Python
Performance Matters
It’s been hard to say how much it matters. Features matter, but it’s really hard to specify what the performance characteristics are. Sometimes that’s because we haven’t known the value. Recently amazon and Walmart have both released studies that have mapped monetary value to milliseconds.
Premature optimization is the root of all evil. -Donald Knuth
It’s really hard to deliver your code if you don’t know what the metrics are.
Slow websites erode your sanity and loose you money.
Ignoring latency does not make it better. But there are still plenty of mistakes to make.
Lots of people pass the buck. The developers blame the database, the database manager blames the systems guy and the systems guy points back to the code.
Ask not what the your database can do for you, but what the you can do for the database.
Just don’t take “The Drunken Man” approach, don’t just come up with a random idea, do a bunch of work, uglify your code, and then… I don’t know… hope that things got better.
We spent $100k doing nothing, but it was cool.
Don’t use The Hammer. “Everything looks like a thumb”. If you just use one tool, you can’t see into your blind spots.
Root Cause Analysis. All of the above antipatterns stop short of finding the root problem.
But it’s hard to do right
Sometimes we go straight to profilers. Be careful. A profiler can become A Hammer. E.g. stick a profiler between Django and WSGI. We got killed by function overhead from making extra calls and writing out the stats. Because of the amount of overhead involved in writing the stats can distort the profile picture.
More stories. Where we stuck that profiler in the middleware, we were trying to address the fact that the Django app was pretty slow. So we used the Apache logs to try and simulate the traffic. So we couldn’t figure out why 1 URL would take a few millisecond while another would take way longer, but we didn’t have the POST data, so it wasn’t a meaningful exercise.
Statistical Profiling
- produces a statistical model
- take samples sometimes
- lacks context
Operating System Tools
- Maybe top or strace
- Maybe you’re secretly a sysdmin
- Look at this graph!
OS Tools are great for observing resource depletion
It often presents as an acute system problem
Real insight into the code requires instrumentation
Ad-hoc
- I wrote a stats service that would report mean latency, etc.
- We wrote a Flask app… asynchronous fanout aggregator
- 3 collectors, flask app, 3 reporters?
- Etsy’s statsd
with statsd.timer('foo')
- Tracks various metrics over time
- works great for tracking and trending discrete events
- can be labour intensive
- everything is hand-tooled
- can be exhaustive to interpret
- one dude I know just keeps a screen full of munin graphs all day
Tracing
It’s hard to see performance without the context of the full request
Lots of tools exist to help
- E.g. Twitter’s Zipkin
- With these traces, you can add them and aggregate them and visualize them
- This is a great place to start, when you’re looking at latency data
- The architecture diagram is horrible
- It is not super easy to do in a generic way for every application
- Free versions
- Google’s Dapper paper
- Yammer’s Telemetry
- Twitter’s Zipkin
- Services
- Exmaples
- AppNeta
- New Relic
- AppDynamics
- One-Stop setup
- Exmaples
Conclusions
- Build a toolbox
- Don’t pick a hammer
@ggerrietts
blog comments powered by Disqus