PyCon Day 3: Afternoon Session
Cutting Off the Internet: Testing Applications that Use Requests
Developer at Rackspace and core contributor to Requests
with Ian Cordasco
Some of you, I’ve collaborated with a lot, but no one really knows what I look like. I look like this (grumpy face); the guy who says “No” a lot. I like that feature, but it doesn’t really belong in this library.
I’ve been using Requests
since long before I became a maintainer.
I’ve see lots of testing code that doesn’t test part of Requests
at all.
Review
A Unit of code
- 1 functional module or a method of a class
- something that isn’t very complex
- something that doesn’t hold all of the logic of your application
Unittest should describe exactly one path through that unit of code. A unit of code will have many unit tests.
Collaborators
- When you want tests, you want isolation.
- When you test class A, you should not be testing any collaboration with class B.
For example:
def bas(self, *args):
a = self.foo.fiz() # We should mock this out
Integration Test
If you have a 3rd party dependency, you can test the interaction of those.
Example
def get_resource(session, resource, params=None, headers=None):
I wanted to take advantage of sessions
. (Shows a function body)
Too simple to test? 5 LOCs. But it’s important. We need it to work. So, how do we do that?
Do it live! (but don’t really)
We’ll just run the test against the API directly. GitHub is a big company, they can handle it.
assert response.ok
or test that you get back what you expect. This works once. But… we do this all the time.
Now do that 400 times. But… it took like 15 minutes to run. How many people think that your tests should take longer than 1 minute? No one.
Ratelimits! It used to be on GitHub, that without authenticating you could make 5000 request / hour without getting rate limited. But shortly after my tests were working, they lowered that to 60 / hour. Don’t test live; it’s bad for too many reasons.
Write a Flask app to test against
Don’t do this. It’s too hard to copy the whole thing that you’re testing and if any of the thing that you’re testing against changes, then you have to keep it in sync.
Stub out part of requests
3 libraries do this well and the 4th is questionable:
- responses (not sure if it’s maintained)
- httpretty
- request-mock
- mock (I’m worried about when people use this)
session = mock.Mock()
get_resource(session, 'user')
...
(session.get(assert_called_once_with('api.github.com/users', headers=None, params=None)))
That looked like a unit test. Nothing is talking to the network.
What happens if the service returns a 404.
sess = mock.Mock()
resp = mock.Mock()
resp.ok = False
...
resp.raise_for_status()
Then if you go on a bike camping trip like me and my friend do, then you can hack on this when you’re in your tent at night.
Mock / Verbose
People don’t use Mock because it can be too verbose.
But… get_resource()
isn’t that complicated.
What if you are testing for content and other things?
Pretend to be the NSA
Record it and then play it back during your tests. Two library do this well, but of which are inspired by vcr.py.
- vcr.py
- betamax
vcr.py
(Shows example)
Have all data recorded in ‘vcrexample.yml’ and load it in a “casset”. Next time you use it, it will be near instantaneous.
betamax
Similar, but different benefits
betamax forces you to use the session… and sessions give you a lot of benefits, more than I could tell you in a separate talk.
casset_library
… full directory without having to explicitly name the data file.
Tests should be concise and very very clear. By creating a recorder object, and a session inside it, it’s not always clear or in one case.
If we’re using py.test
, we can use its fixtures to do this.
What if the service changes? Instead of editing the code, you just delete the cassets, run the test again, and it will record it and go. Sounds awesome.
The reality
You probably want to use both unit tests and integration tests. Some things make more sense in one and not the other. So, testing with Requests
is the two part solution. But this combination has really helped me increased the code quality of this library.
Mock
et al. are great for regression testing, if there’s some weird corner case that’s crashing your application. Mock out the request, put in the bad data and then you have it.
Questions
If you want to test for intermediate socket errors, you can raise Exceptions as a side effect of get_resource()
Sessions give you a lot more control over your request. The Session will also store cookies for you permanently. Connection pooling makes it faster. You can set which SSL protocol you use. There’s tons of things that you can do with Session that you can’t do with the functional API. With the functional API, every time you call request.get()
, we always start a new session and tear it down afterwards.
To ORM or not to ORM
with Christine Spang
Cofounder, Nylas
In this talk I’ll build your intuition about whether it’s good to use them or not.
We built tech for the Linux kernel which allows you to update the kernel without restarting it.
Nylas is building a platform that integrates e-mail, calendar, etc.
Nylas sync engine is 25k lines of Python developed over 2 years.
What is an ORM?
Two-way sync between an object in a database to an object in memory. This is a hard problem to solve and systems that solve it generically are necessarilly complex.
Interpreter function
I’ll use examples from SQLAlchemy from the Nylas sync engine.
ORMs centralize the data model and then generate SQL to send to the database.
Why should you use one? (Or not)
Let’s start with a story from Ksplice. I inherited complex billing code, built which used sqlite. When I arrived it was using a hosted MySQL server which was really slow. It was scattered with raw SQL all through the application. The data model was not clear which made it very hard to reason about a good strategy for managing the data.
So instead of having proper data structures, we just used dicts and tuples and lists which worked fine against a local sqlite instance, but when we deployed, the performance penalties killed us and it all fell apart.
In this example, all of the code didn’t fit on one screen. The data models were shared across applications. There was no single source for describing what the data model was.
Advantages of an ORM
There is a clear, centralized place,
Create abstractions to manage complexity.
One way to manage complexity is to map multiple backend columns into a single object property with the @property decorator.
This helps us to apply powerful abstractions like Polymorphism in the data model. (Shows an example of how they used polymorphism to have separate implementations for different mail accounts).
Abstracting many-to-many relationships
ORMs also allow use of the with
statement to do cleanup of the session.
ORMs provide a mechanism for batch flushing changes for performance.
Stop!
ORMs are not appropriate in all situations. Which should you not use them?
- Overhead (not a ton, but it’s some)
- They don’t really abstract the database
- Shows a 200 line patch of getting their sync engine to run on PostgreSQL
- If you want to be truly db agnostic
- You have to limit yourself to LCD
- You have to write custom code to handle specific data types and performance optimizations
- ORMs hide queries
- It’s less clear what SQL statements are being run
- Unexpectedly bad performance
You can deal with this, but it involves work on your part and learning to customize.
Use echo
and watch the SQL… it will help your intuition. Once you’ve done that, you may be able to convince your ORM to run different queries.
To do so is very complex. Remember that two-way sync between the application and your database is a hard problem.
What about building your own?
I’ve never done it. But if you’ve already built your own with a team of engineers
When not to use it
- In simple scripts
- one off tasks
- don’t need to maintain
- i.e. don’t need to manage complexity
- In migrations
- These are places where you need to be very specific about the SQL that is run.
- These are a different context than an application
- You just don’t need it
- Trying to do so will make your life more difficult
- Performance can be a big deal here.
- For the most part, don’t do it.
How to decide which to use?
- For django, use the built-in Django ORM
- For everything else, use SQLAlchemy
- Complaints against it cite that it is too “magicky”
- But, it’s very reliable
- The magic works
Conclusions
- ORMs solve a hard problem
- Abstractions manage data model complexity
- Learn to use SQL echo! Tune your intuition
https://github.com/nylas/sync-engine
Brasseurs 6pm
Describing Descriptors
with Laura Rupprecht
Outline?
- What is a descriptor
- Custom Descriptor Example
- Kinds of descriptors
- Attribute lookup order
- @property/@classmethod
- Usage in ORM
- Problems
What is it?
A certain typ of attribute
class Foo():
x = SomeDescriptor(some_args)
The descriptor has __get__
, __set__
, __del__
The interesting thing on __get__
is that I’m passing in the instance. We’ll see an example. Also I pass in the owner which is the class to whom the class belongs. Whereas the get and set only care about the instance (and the values).
Example
- Dealing with JSON YouTube API responses
- Dictionarey access is everywhere
video['snippet']['thumbnails']['maxres']['url']
No autocomplete in the IDE.
- Hat to to Jonathan “Use Descriptors”
“uh… what’s a descriptor”. It turns out that the official documentation kinda sucks and doesn’t really describe it if you’re coming without any pre-knowledge. It talks about “binding behaviour”, a la __get__
, __set__
, __del__
How can we make the massive JSON object prettier?
1st try: put it into an object and set all of the attributes in the init function. But, I have to add everything explicitly.
Use dict_digger
?
Also, if I update the JSON response, it’s not out of sync with the object that we just built up.
class ResponseDescriptor():
def __get__(self, instance, objtype):
return dict_digger.dig(instance.json_response, *self.path)
This way, we’re abstracting away the behaviour of how to retrieve the data. Totally not as repetative as before. Then just add some error checking and raise an AttributeError if the dict_digger can’t find anything.
But… there are two types of descriptors
- Data Descriptor
__get__
__set__
__delete__
- Non-data descriptor
- Just a
__get__
- Just a
Attribute lookup
- Check class dict for a data descriptor “
type(foo).__dict__['x']
- For this one you are explicitly looking for something that has
__set__
and__delete__
- Check instance dict
foo.__dict__['x']
- Check class dict for non-data descriptor or attribute
type(foo).__dict__['x']
- In this step you are search for something that may only have a
__get__
- Method calls are usually found in this 3rd step
- But you can override the call on the instance, if you put something into the instance dictionary
- Throw an attribute error
Why is it this way? To have function attribute access fit in a little better with attribute access in general.
@property
/ @classmethod
People often don’t know how they work and it’s magic and it’s great. They’re a good way to get descriptor-like behaviour without having to worry about the complexity of the method resolution order (MRO).
@property
lets you treat something like an attribute. But remember that @property
is always a data descriptor and as such will take precendent in the MRO.
However, @classmethod
can be called on an instance or the class itself.
Usage in ORMs
In the Django ORM, all of the class attributes are descriptors. You’re defining properties in the class.
Why use them?
- To check data has the proper format (converting units)
- To automatically perform verification of a field
- To make it easier to manipulate existing data fields
Why not?
- Because it looks awesome
- Job security
To think about
Throw the exceptions
- ValueError
- ValidationError
- AttributeError
- NotImplementedError
Common Pitfalls
- Confusing class and instance variables
- Infinite recursion
- Don’t use your get function to do a
getattr
- Store stuff in the instance dictionary
.__dict__
- Make sure to define
__set__
in this case
- Don’t use your get function to do a
Resources
- Demystifying Descriptors in 5 Minutes, talk by Chris Beaumont
- David Beazley Python 3 MetaProgramming tutorial
- Python Descriptors, a talk by Simeon Franklin
- Raymond Hettinger’s Descriptor HowTo Guide
- Read this first, read some other stuff, come back to it and allow it to make sense.
- Guido van Rossum’s The Inside Story on New-Style Classes
- Why methods work the way they do
Oh, Come On. Who Needs Bytearrays?
with Brandon Rhodes
In Python, normal string objects are immutable.
Python 2, Python 3
str or bytes, bytes
unicode, str
b
prefix is option in Python2, but mandatory in Python3
Immutability
Immutable objects. Any time you call a method on them, it returns a new method, and the original one are unchanged.
- Advantages
- Simple
- Functional
- Disadvantages
- Allocation
- Every time you want to make a tweak to the string, it gets copied.
- Copying
- Allocation
Therefore, Python3 introduced a builtin bytearray
- A mutable string
- Based on Python 3
bytes
Python3 is designed to be awkward to use, so that you will want to run decode()
before treating it as characters
So, Python programmers are prepared for good internationalization.
Python 2/3, you can slice a string and bytearray
Python 3 owes me 18000 characters per year! Aka, “the parens tax”.
Python3 bytes are actually integers and you can’t necessarily pull them apart per element and put them back together. So bytearrays
objects are not really a natural object.
bytearray
is a mutable version of Python’s most underpowered string type
Potential applications?
The bit vector
I wrote my first bloom filter. It’s a dictionary of words that you want to look up and knock out a large class of words before you go to an expensive lookup.
(Shows example)
bytearray
is more than 7% faster than old general-purpose array.array
object (and the code runs on both natively). So… you might think that it is immediately useful.
But do you know what’s faster? A list of int
s! Why? A bytearray
is storing bytes that must be translating into a list of int
s that then have to be translated to addresses and back. A list simply stores int
objects to begin with; no translation!
So we have this new special-case container that is no better than the regular general purpose int.
Except on PyPy, where they are all the same, and they all run much much faster.
Conclusion? Verdict: it’s space efficient. Slightly slower, but 8x less space. And the point of a bloom filter is to save space. That’s why you use a bytearray.
The resuable buffer
When you read a string in, you can’t do anything to it because it is immutable. But for a binary array, it can be
dd
takes 6x longer, because of the bytesize of 512 byte blocks by default.
cat
is the same as python and has a reasonable byte size.
So as we look at Python I/O, we will need to keep block size in mind. The size you read, determines how often you have to go to the OS, which determines the run size.
Tried readinto()
data = bytearray(blocksize)
while True:
length = i.readinto(data)
if not length:
break
o.write(data)
But was writing the entire length of the block regardless of the length.
o.write(data[:length])
What if we didn’t want to do that expensive slicing operation, since it causes copying. If we want to have zero copy, there’s also memoryview
s = bytearray(b'_________')
m = memoryview(s)
v = m[3:6]
v[0] = 65
v[1] = 66
v[2] = 67
s
bytearray(b'___ABC___')
data = bytearray(blocksize)
view = memoryview(data)
while True:
length = i.readinto(data)
if not length:
break
o.write(view[:length])
m
has no memory of it’s own. It’s just a slicable object
So memory views are often imperitave to getting good performance out of the
dd
.112 scat
.113 smemoryview
.117 s
Creating a view object takes some time, when the blocksize is small. memoryview is a loss for small block size. 20% slowdown
I thought of something
What if we don’t always slice?
# The normal case is when the length == the blocksize
data = bytearray(blocksize)
view = memoryview(data)
while True:
length = i.readinto(data)
if not length:
break
elif length == blocksize:
o.write(data)
else:
o.write(view[:length])
Lesson: It’s hard to beat old-fashioned strings.
Verdict: Dangerous, but offers a great memory profile
The accumulator
Q: How many bytes will recv(1024) return A: One (or more) if it feels like it
This is completely different from file I/O. In file I/O, the OS will wait for the disk to spin up, find it, and leave you waiting until it’s ready.
So on the network, you’re almost always given the case where you get small pieces of data.
But it worked when I ran against localhost
How does recv()
perform with bytearrays
(Shows code example of a Python anti-pattern) (small string initialize +=)
How long does that take? Infinity time.
Instead, store the blocks in a list and join them together at the end.
recv()
1.08s
There is rec_into()
but it now runs into a problem. recv_into()
reads into the front of the bytearray and overwrites the old data. So now I have to build a memory view and do the slicing on every read.
data = bytearray(content_length)
view = memoryview(data)
n = content_length()
…
This takes 0.80s vs. 1.08s.
- still copies data twice
- but replaces
.join()
data.extend()
, 40 bytes of RAM bandwidth just to add to the end of your bytearray
. But it doesn’t actually do that.
Q: Does bytearray have an append operation that’s any good? A: Yes!
Use the +=
! The one operator that we’ve been telling people for 20 years not to use.
In the accumulator, this is the real win for the bytearray
. and cleaner code.
Admit it, you’ve always wanted to use
+=
The Freestyle Mutable String
I just want a string, that I can do stuff with and modify
- There’s really not a good use for this.
- You have to change part of the payload before sending it. So, the result is curious: a “mutable string” that doesn’t gain you anything. So to upper your bytearray, you have to make 2 extra copies.
It’s Awkward.
Conclusions
- Memory-efficient (but not faster)
- Help control memory fragmentation
- Great way to accumulate data
- Awkward for string operations (and underpowered)
Lessons learned with asyncio (“Look ma, I wrote a distributed hash table!”)
with Nicholas Tollervey
Freelance Python developer from the UK. This is an introduction to asyncio
This is not an exhaustive discussion of asyncio
, and I’ve simplified plenty of times.
What does asyncio
do?
The Python docs do not give me a practical feel for how I would use it.
It lets you write code that concurrently handles asynchronous network based I/O…
Messages arrive and depart via the network at unpredictable times - asynicio lets you deal with such interactions simultaneously.
Distributed Hash Table
- Essentially a dictionary that is distributed
- It is decentralized
- Peer-to-peer key/value data store
Why do you want one?
- No single point of failure or control
- It scales well
- Have to concurrently handle lots of asyncronous I/O
Core Concepts
- The event loop
- poll/select loop
- polling takes place once during the event loop
- all callbacks take place one after the other
- so the loop cannot proceed if anything is stopped
-
Concurrency 101
- Program never waits for a reply from network calls before continuing
- Programmers define callbacks
Analogy: Washing machine… I can hang the close after the laundry is done.
I can squeeze the orange juice while waiting for the toast and the eggs
We don’t stand there watching the toast.
Questions
- How are async concurrent tasks created
Coroutines
- Something that eventually completes
- Generators
- Suspended (with yield)
- can
yield from
other objects
(shows example of asyncio.coroutine
that handles a request)
Represent activity that is already complete
But what about callbacks?
Futures and Tasks
- A result that may not be available yet
- A TODO list of sorts
- These are first class functions calls
Represents results that may not yet be available.
A DHT Example
Hashing, distance, …
Imagine a clockface, …
Each peer keeps their own routing table, and each peer exchanges state information. Peers store fixed size buckets. A local node knows more about closer nodes. get() and set() require a lookup All interactions are asynchronous and lookups are in parallel/concurrent.
Recursive lookup
- six degrees of separations
Ask the closest peers, they ask their closest peers, etc.
How does asyncio
handle this?
A lookup is a Future
class Lookup(asyncio.Future):
What about networking? How does asyncio
handle different networking protocols?
Transports and Protocols
The DHT is Network agnostic, so it can work over HTTP, Netstring, anything that I choose to implement.
Final thoughts;
Twisted
?asyncio
feels a lot likeTwisted
- My code has 100% unit test coverage (sort of)
- DHT < 1k LOC (because asyncio makes it easy to think about concurrent problems)
- I/O bound vs. CPU bound
- Don’t do use async I/O if you’re going to be CPU bound because if your event-loop blocks, everything will hang.
http://github.com/ntoll/drogulus
blog comments powered by Disqus