PyCon: Day 2
Jessica McKellar Keynote
The current (sad) state of affairs
Started with a "depressing" introduction discussing the sorry state of Computer Science (CS) education in K-12. Very low number of students taking the CS Advanced Placement (AP) exam; one of the lowest of any disciplines. It also has a big lack of diversity; gender and racial. In fact the CS AP exam had the worst gender ratio of all AP exams, and there were a large handful of states which had zero students of color or woman taking the exam
What can we do about it
The rest of the talk was a call to action matching the realization that there's only room for improvement from the current state of affairs. Jessica gave a range of actions. Some were as simple as calling state legislatures and school boards and requesting (nay, demanding!) that CS count as science credit and core curriculum (if it doesn't count for anything, then no one will take it and no one wants to teach it). Other actions were more involved and included calling up the local CS teacher and volunteering to lend them and ear if they had questions or needed curriculum advice. Or volunteering to come in and speak to the class one day about what having CS in your background looks like in one's job.
Some people asked questions about whether mailing lists existed and such, but she commented that this problem will likely only be solved at the local/distributed level and with personal relationships.
Jessica noted that there are a lot of outreach programs by the Python Software Foundation which focus specifically on outreach. If you can implement a good idea (e.g. summer program for kids teaching them programming, training for teachers on how to teach python), the PSF is very willing to fund it.
Fernando Peréz Keynote
Much like other talks by Fernando, this one was a whirlwind of accomplishments by the IPython community. It started with a brief history of IPython and Scipy/Numpy and then went into present ways in which IPython is being used to facilitate open and reproducible science.
Highlights
- Fernando's collaboration with biologists to produce results for an ecology paper in one day with AWS which would have taken a month to run serially as written. They published a corresponding article just about the process of how they did the reproducible research which included an executable AWS instance setup and the code. Someone else only need to plug in their own AWS id (and pay the bill, obviously). This is a great, great model.
- Detailed IPython notebook/blog posted by a social scientist critiquing analysis by Nate Silver/538. It, actually inspired 538 to change a practice in and post their code in two days.
- How IPython is pluggable, such that you can actually implment processing cores for other languages and hook them up to the IPython Notebook. This has apparently been done for 11-ish other languages.
A Scenic Drive Through the Django Request/Response Cycle
Which interface?
- mod_python?
- FastCGI?
- CGI?
Back in the early days, things were strongly coupled
PEP 333 / December 2007 / WSGI was invented
It's simple and easy to implement. No state (obviously).
Django speaks WSGI
Django test client and WSGIHandler are both subclasses of base.BaseHandler
BaseHandler
- equest middleware
- Resolve request.path_info
- View Middleware
- View
Response Middleware
ExceptionMiddleware (maybe)
Middleware?
Enterprisey? Big, expensive, a PITA to work with In Django? A Light low-level 'plugin' system for globally altering Django's input and output. Run on every request that comes in and every response that goes out
Installing middleware is just adding a line to a setting files
MIDDLEWARE_CLASSES = (
...
)
class PyConRequestMiddleware(object):
def process_request(self, request):
return None # or HttpResponse
Break Time
Client -> Web server -> WSGI -> Request(Middleware) -> URL Resolution -> View -> View -> (Maybe Exception) -> Response(Middleware) -> WSGI -> Web Server -> Client
django.core.urlresolvers
This stuff is mostly obvious
View middleware
Speaker does not use them much. Used if you want your middleware to know something about the view itself
View
Where the logic lives. Where you do the work itself.
- HttpResponse
- HttpResonseRedirect
- JsonResponse
- SreamingHttpResponse (yay!)
Cool, and it's there, but it will break some middlewares. Could we, run offline and give a PUSH notification instead? See here: https://docs.djangoproject.com/en/1.5/ref/request-response/#django.http.StreamingHttpResponse
- Class-based views
- Function-based views
Templates - You can put HTML in your, but when you go to scale up and work with designers, you'll really want to use a template and render it.
Context processor:
Another Pause
Go to Schwartz's Chez
django.core.handlers.basehandler
- ingress uses request middleware
egress uses response middleware
class PyConResponseMiddleware(object): def process_response(self, request, response): return None # or HttpResponse
useful for cookie management
So, one could packae all of your middleware into the same class
But... middleware is ordered in settings: MIDDLEWARE_CLASSES
So, if you package them up in one class you have to put them in the same spot in the ordering
Exception Handling
Designed to run if you have done something wrong and raised an exception that you have not handled.
class PyConMiddleware(object):
def process_exception(self, reqeust, exception):
return None # or HttpResponse
Hey, we're back at the web server
Garbage Collection in Python
How GC Works in both CPython and PyPy
Only picking those two, because those are the only ones that the speaker knows
What is GC?
- Unused objects are finalizedand deallocated
- When?
- "eventually"... or never!
- not running out of memory is good
CPython: reference counting
- Every reachable objects has a count of how many other objects want to keep it alive.
- New objects have ref count of 1
- When there is a ref count of 0, it can be safely deallocated
- This requires some graph traversal
- when an object goes dead, you have to decrement references to everything to which it refers
Py_INCREF(v) / Py_DECREF(old_value)
- Reference counting's major flaw is that there can be reference cycles
- E.g. Object C and D both reference each other,
- CPython's cyclic GC
- detects cycles unreachable from the program and deletes them
- very simple. Looks for all cycles and decrements the internal ref count of the cycles
PyPy Basics
- Interpreter written in RPYthon
- RPython translated to low language
PyPy has pluggable GC
- GC is simply another low-level transform during translation
- GC algorithm itself is written in RPython
- GC implementation can be selected at translation time
- Current default GC: "minmark"
Mark and Sweep
- Mark: Start with known live objects, recursively traverse objects, mark them as reachable
- Sweep: Walk all allocated objects, deleting the ones that are not marked
Fancy PyPy GC Optimizations
- Use observations of usage patterns to optimize
average = (a + b + c) / 3
[ x for x in seq ]
my_object.method() # has to bind
data.replace('&', '&')
- e.g. "High Infant Mortality", lots of young objects that do not last long
How to cope?
"The nursery"
- Store newly allocated objects in a "nursery"
- GC "nursery" often and move surviving objects out
- GC old objects less frequently
GC Pauses
- When the GC is running, the program is not
- Inconvenient for many long running programs like servers
- A deal-breaker for real-time applications like video processing
- Of course no one tried to do this before PyPy made it possible
PyPy incremental GC
- In PyPy 2.2
- Major collection split intpu multiple passes, each lasting only a few milliseconds
- "All in all it was relatively painless work" according to their blog
- Means that you can actually use this in real-time processing applications
Summary fo GC in PyPy
- Pluggable
- Generational
- Incremental
- Integrated with the JIT
GC semantics subtleties
__del__
(i.e. finalizers) kills kittens
Cycles with finalizers... which finalizer do you actually run first?
What to do about it? * CPython < 3.4: give up... finalizers that have cycles, get moved to a specific list or something * CPython >= 3.4: PEP 442 * PyPy: sort finalizers into a "reasonable" order and run them * do a topological sort on the object graph and run those which have the fewest dependencies * in the end you just have to chose one, though
PEP 442
- Run finalizers on unreachable cycles (arbitrary order). Resurrect any cycles that become reachable again
- Break references in remaining cycles
Summary
- GC is hard.
- (PyPy's) GCs are awesome!
Django: The Good Parts
This was very, very editorialized.
"I know that you're all eager to get to lunch. This won't take long, because there aren't that many good parts."
Release manager for Django.
History... going back to July 2005... 9 years ago.
1.7 beta 1. 161k lines of Python.
"Full stack framework"
"Now the 800 lb gorilla"
"Aren't other frameworks smaller/lighter and thus better?"
The good parts
- HTTP abstraction + request/response cycle
- Conservative (yes!) approach to features
- The secret weapon: apps
HTTP sucks
- H(ate)TTP
- Nine different request methods
- Unicode-unaware
- Persistence, auth, streaming, security are all tacked on
CGI isn't much better
- Write a script
- Web server will set up the environment
standard input/standard output
Actual CGI
- Web server might not invoke your script
- chmod 777 ALL the things
- Environment is full of lies and sadness
- fork()-bomb the server
Drink. A lot.
WSGI is CGI in Python (according to the speaker)
- CGI-style environ: have to pass it around, parse it, re-parse it, etc., etc.
- Was never that good of a programming environment anyway
- Signaling up and down the stack requires inventing ad-hoc protocols
- Return/response API sucks
- Inherits HTTP's approach to character encoding (this is not a good thing)
Can we do better?
- HTTPRequest class
- Already parsed and normalized for you
- Dict-based attribute access
- sane character encoding
- HTTPResponse
- easy to construct and manipulate
- Convenient subclasses for common specialized responses (404s, 304s, etc.)
Do it
- Callable objects that take HTTPRequest and HTTPResponses
- regex parsing for URLs
But wait, there's more
Life cycle
- Middleware system
- Signals! Way for Django to listen and response to signals
- Decorators! Can decorate any callable
"We read PEP 333 so you don't have to"
- With a Django project you have a WSGI project
"Sane HTTP/WSGI is worth importing some stuff"
Conservatism
Not "minimal", but conservative.
truncatechars
filter: that only took four years.
Migrations (coming in 1.7). That only took six years.
Things get into Django very slowly and some things are even coming out.
Unbundling/removing:
- contrib.comments
- contrib.databrowse
- contrib.localflavor
- contrib.markup ...
Things are difficult to add and easy to remove.
Note deprecation warnings... Django is not shy about this.
This is usually described as a bad thing. People who have great new things, have a lot of trouble getting them in. But... the speaker disagrees.
Conservatism
- Strong preference for community solutions
- E.g. there have been many idea about migrations and the best one won
- Core framework more about providing API support
- More "swappable" components than expected
- E.g. at MDN we don't use django template system, we use jinja
- More stability over time
- People come back after a few years and find that it's still the same thing
- Features land when ready
- Competition often produces better solutions
- E.g. "South" for migrations again
- E.g. SoC work with Django/setuptools was great, but nothing came of it.
Enc-app-sulation
A Django project is a WSGI application What is a Django application? (encapsulated, pluggable functionality) One doesn't want black boxes and monolithic-ness
Django app can be a whole lot of small things.
Django gives you safe APIs that you can assume will be present. This makes integration of lots of components a lot easier.
Philosophy of Unix Tools: Do one thing and do it well
E.g. Personal site. 12 apps. MDN: about 50 apps (about 12 are specific to MDN).
django-packages lists 2206 available apps "There's an app for that" (booooo to gratuitous use of cliche).
People complain about there being a full stack, but you really don't have to accept all of them.
Sane Schema Management with Alembic and SQLAlchemy
http://bit.ly/1iGL...
Work for Socorro
Thanks to Mike Bayer
http://github.com/mozilla/socorro http://crash-stats.mozilla.com
Use "alembic" for all migrations.
We use PostgreSQL
Database systems resist change
Evolution of schema change process
- edit stored procedures on prod!
- SQL files + bash
- Revision control
- Code review
- Generated SQL using schema comprarison
- Rollback plan Tets on generated data Test on prod-like data Test in complete stage environment
What's sane schema management?
Executing schema change in a controlled repeatable way while working on...
Migration tools are reall configuration management tools * once that clicks, it can make the conversation easier
Migrations are for: *communicating change * communicating process * controls
Assumptions?
- Schema migrations are frequent
- Automated schema migration is a goal
- Stage environment is enough like prod for testing
- Writing a small amount of code is OK.
No tool is perfect. DBAs should drive migration tool choice. Chose a tool that DBAs and developers like. Or, don't hate, at least.
Organization:
- dbaproblems
- Picking a migration tool
- Using alembic
- Lessons learned
- What alembic could learn
And to drop and replace partition constraints that lack TIMESTAMP WITH TIMEZONE
Migrations are hard. And messy. And necessary.
Changing a CHECK constraint on 1000+ partitions.
This sucked a lot. But it wasn't first time (2012 bugs) Change snuck into partiiton UDF Jan-April 2013. No useful audit trail Some paritions affected Original error went back to 2010
Big messy process for dev to staging to prod Much easier with alembic
Used Alembic to manage the change Tested in stage Experimentation revealed which partitions could be modified without...
Process
- make changes to model.py or raw_sql files
- Run:
alembic revision --auto-generate
- edit revision file
- commit changes
- have jenkins run downgrade/upgrade as part of test suite.
Problems solved
- easy to deploy migrations including UDFs for dev and stage
- enables atabase change discipline
- enables code review discipline
- revisions are decoupled from releases
- 100k LOC removed
- no more post-deploy schema checkins
- enabling a tested, automated stage deployment
- separated schema definition from version specific config
Picking a data migration tool
Choose your own adventure
Questions:
- How often does your schema change? (if never, may not be worth your time)
- Can migrations be run w/o you?
- Can developers create a new schema w/o your help?
- How hard is it to get from an old schema to a new one using the tool?
- Are change rollback a standard use of the tool?
DBAs are not very excited about: Use an ORM with the migration tool
We had three different ways of defining schema in our code and tests. And they were all used. Having a central source of truth would be very useful for a system... then you end up with reusable components and ...
Fits with existing tooling and developer workflows enables partnership with developers integrates with testing frameworks (big win!)
Using Alembic
http://tinyurl.com/po4mal6
Vocab is awkward
Defining a schema?
vi env.py
import myproj.model
Super nice
Helper functions?
put your helper functions in a custom library and add this to env.py
Ignore certain schemas or partitions?
Manage User Defined Functions?
Stamping database revision?
command called stamp
Offline mode!
Change your model then create your revision file then upgrade alembic upgrade abc2134 --sql
and it will emit raw SQL. Works great.
Lessions Learned
- Always roll forward
- Put migrations in a separate commit from schema changes
- Revert commits for schema change, leave migration commit in-place for downgrade support.
- Store schema objects in the smallest, reasonable, composbale unit
- ORM for core schema
- types UDF, and views in separate files
- Write tests. Run them every time.
- Write a simple tool to create a new schema from scratch
- ... generate fake data
- write tests for these tools
Would be helpful
- Understand partitions
- Never apply a DEFAULT to a new column
- Help us manage UDFs better
- INDEX CONCURRENTLY
- Pretty syntax for multi-commit sequences
- Support multiple branches in history (currently your history just needs to be in a single order)
Epliogue
- No tool is perfect
- DBAs should drive migration tool choice.
- Chose a tool that your developers like. Or, don't hate.
See Also
- Sqitch
- Erwin
- South (Django specific)
Questions
Q: Migrating migration systems? A: Speaker recommended "Just ignore it. Don't move backwards." though she was a little reticent about that response.
Remember, migrations are only sequential.
Q: Also run migrations on your continuous integration server. That takes a long time, how do you do that? A: I use every possible trick in the book to take out as few locks as possible and make it run faster. And I teach the developers how to do that.
Multi-factor Authentication-Possession Factors
Ying Li works at Rackspace
Talk about Alice and Bob (a gamer)
World of Bobcraft; stealing accounts became profitable
Alice's stuff is stolen.
Bob spends all of his time restoring accounts.
Bob adds two passwords. Account thefts continue.
Mutlifactor
- something you know
- something you have
- something you are
Two of the same things do not count
A key logger can log two passwords as well as it can steal one.
Important to know what an attacker will do, rather than just throwing up defenses willy nilly.
E.g. Man in the middle or phishing, keylogger, replay.
Possession are vulnerable to theft and man-in-the-middle
biometric can be stolen credentials don't change and can never be revoked.
password and token
vunerability is the intersection of the methods of attack against each factor.
Just because two factors are used, doesn't mean the intersection is as small as it should be
root phone and mitm are both attacks for password and token.
Most common tokens are OATH (time-based, expired
TOTP time based one time password
TOTP(
pip install cryptography, otp...
SMS-based passwords. This works even if Alice doesn't have a smart-phone.
pip install twilio
UB-key is a token...
yubicon-...
add 2FA validate 2FA remove 2FA
TOTP
Bob generates shared secret, communicates to Alice, label=issuer..
SMS factor Bob uses twilio to generate message
Yubikey
every time button is pressed, yubikey_id + randomly generted OTP and sends to Bob
validate 2FA on login
Check password, ask for 2nd password
remove 2FA from account
just delete it from the collection (and send Alice a login)
packages for django
- django-otp
- django-otp-twilio
- django-otp-yubikey
- django-two-factor-auth
Just add apps to settings.conf and twilio account information
Happily Ever After?
Login is protected... but... not everywhere requires authenication. Session key... protected by TLS and TLS certificate
goto fail
: but must respond to security patches
API keys should be:
- revokable and auditable
- ...
- ...
Password/factor chnages require full credentials, and the user should be warned
Recovery
Just log in again. Should be auditable, rare, difficult
Standard methods: backup devices, email resent
Alice:
- 2FA
- backup factors
- passwords
- protect email and phone
Bob:
- 2FA
- account reset
- TLS
- limit privilege
- system updates
Fast Python, Slow Python
- Performance
- Python
- and the intersection of the two
What is performance and why?
How fast something runs.
Lies, damned lies, and benchmarks. Benchmarks are nonsense. Performance is about specialization. So... benchmarks need to look like the system that they're trying to replicate.
Systems performance, micro/macro benchmarks. Don't often talk about them like we do unit/integration testing.
What are the difference between micro/macro benchmarks.
Java benchmarks that beat Python are often micro benchmarks with small tight loops. E.g. Java benchmark slower than ruby-on-rails, b/c it was reparsing XML over and over again.
Python. What is it? Usually CPython. But not a lot of other things: Cython, C, Numba, RPython (PyPy).
"Everyone knows Python is slow and dynamic languages are slow". Wrong. Optimizing dynamic typing is completely different from optimizing static typing, and it's just not true that it's not possible to not optimiize it.
You can monkey patch anything. You can make assumptions about what our code is doing and then my cheap checks that those assumptions are try.
Slow vs. harder to optimize: Slow is what happens when you run it. "Harder to optimize" is an exisistential problem...
PyPy - an implementation of python. Usually runs your Python code faster. It's a specialization. It takes your code and it tries to adapt to what's going on in your program. PyPy does and specifies things dynamically.
Let's make a deal! You choose good algorithms and I'll make them run fast
C - it's a DSL for assembly essentially Baremetal, and is going to be fast.
Compare a C struct to a Python class with three attributes x, y, z.
"I don't care that it looks hard to optimize, but we've done it. We've known since 1988 how to optimize this. I'll hold up my end of the deal if you hold up yours."
Dicts. Object is a fixed set of behaviours while dict is an arbitrary mapping. You could add arbitrary properties to an object, but people don't don that because it makes it hard to maintain your code. Dictionaries, are not specialized. One dict type represents any type of shape imaginable. They are specialized for many purposes. We would like our python to be able to take advantage of those specializations.
C++ of a dict? No would write code as an unordered_map. It's really slow and looks ridiculous. Why do we accept it in Python? Because for the last 20 years, no one has noticed the performance difference so we accepted dicts as being more lightweight.
if [hex, bytes, bytes_le, fields, in].count(None) != 4:
raise TypeError()
Not specialized
if (
(hex is None) + (bytes is None)... !=4
Specialized.
PyPy figured it out and make it much faster (example compliments of Python standard library).
Python makes everything easier
Unfortunately, that makes it so sometimes it's easier to use a general tool than a specialized tool, and that ends up being a problem.
Compared reading from a file in C and Python.
Python code is much more precise and elegant, but it's quite a bit slower because it allocates and copies on every line.
So, they wrote the zero_buffer
package.
b = Bufer.allocate(8192)
with open(path, "rb") as f:
f.
More myths
- function calls are really expensive
- using only builtin data types will make your code fast
- don't write Python in the style of Java or C
May be true only if you are only willing to run on C
One Python
Conventions of fast python code vs. conventions of slow python code. Must be careful about how we write it. Take pride in how fast we can write and ship code. True because we don't have to worry about a lot of details.
Deal: You can be casual if performance doesn't matter. But if you're going to write a library, care. Because otherwise having a slow library will lead people to believe that Python is slow.
I hate heuristics. I have to look for patterns, and I have to guess about what your code is supposed to do. But that makes it inconsistent.
Let's take a lesson from Java. Java developers prance around not caring about performance. But when they want performance, they hunker down and write it in Java. I want us to write high performance code in Python.
https://speakerdeck.com/alex/
Questions
Q: Is a NamedTuple object also as fast as an object. A: Yes.
Q: Common-thread. Allocations and copies make things slow. But... what about closure and functional languages? Is there a conflict with immutable data style of programming and allocation minimization? A: Difference between allocations and copies that are part of the algorithm and those that "just happen". The once that "just happen" are those that the tools should handle for us. But there are examples where we write sloppy code and it causes them to happen.
Q: Highlighted places where our performance intuition is outdated or outright wrong. Advice so that people don't just do things because "I read on a blog post that we should do that". A: Tools. PyPy JITViewer... PyPI line_profiler, something else for CPython.
Q: Haskell community has many library authors have annotated with space/time complexity to cue the user? Is that analysis useful? A: "Absolutely." Of course, before any of these optimizations, the optimization of your algorithm is the first thing to do. I would love for that to take off within the Python community. We could use some more specialized data structures: skip-lists, trees, heaps, etc.
Q: Confused about the performance of dicts when an object has __dict__
. A: Why would that be more performant. "Why do you believe everything you see?" Unfortunate that the way things look is sometimes not how they are. But I'm not sure how to reconcile 20 years of Python code.
blog comments powered by Disqus