Cutting Off the Internet: Testing Applications that Use Requests

Developer at Rackspace and core contributor to Requests

with Ian Cordasco

Some of you, I’ve collaborated with a lot, but no one really knows what I look like. I look like this (grumpy face); the guy who says “No” a lot. I like that feature, but it doesn’t really belong in this library.

I’ve been using Requests since long before I became a maintainer.

I’ve see lots of testing code that doesn’t test part of Requests at all.

Review

A Unit of code

  • 1 functional module or a method of a class
  • something that isn’t very complex
  • something that doesn’t hold all of the logic of your application

Unittest should describe exactly one path through that unit of code. A unit of code will have many unit tests.

Collaborators

  • When you want tests, you want isolation.
  • When you test class A, you should not be testing any collaboration with class B.

For example:

def bas(self, *args):
  a = self.foo.fiz() # We should mock this out

Integration Test

If you have a 3rd party dependency, you can test the interaction of those.

Example

def get_resource(session, resource, params=None, headers=None):

I wanted to take advantage of sessions. (Shows a function body)

Too simple to test? 5 LOCs. But it’s important. We need it to work. So, how do we do that?

Do it live! (but don’t really)

We’ll just run the test against the API directly. GitHub is a big company, they can handle it.

assert response.ok or test that you get back what you expect. This works once. But… we do this all the time.

Now do that 400 times. But… it took like 15 minutes to run. How many people think that your tests should take longer than 1 minute? No one.

Ratelimits! It used to be on GitHub, that without authenticating you could make 5000 request / hour without getting rate limited. But shortly after my tests were working, they lowered that to 60 / hour. Don’t test live; it’s bad for too many reasons.

Write a Flask app to test against

Don’t do this. It’s too hard to copy the whole thing that you’re testing and if any of the thing that you’re testing against changes, then you have to keep it in sync.

Stub out part of requests

3 libraries do this well and the 4th is questionable:

  • responses (not sure if it’s maintained)
  • httpretty
  • request-mock
  • mock (I’m worried about when people use this)
session = mock.Mock()
get_resource(session, 'user')
...
(session.get(assert_called_once_with('api.github.com/users', headers=None, params=None)))

That looked like a unit test. Nothing is talking to the network.

What happens if the service returns a 404.

sess = mock.Mock()
resp = mock.Mock()
resp.ok = False
...
resp.raise_for_status()

Then if you go on a bike camping trip like me and my friend do, then you can hack on this when you’re in your tent at night.

Mock / Verbose

People don’t use Mock because it can be too verbose.

But… get_resource() isn’t that complicated.

What if you are testing for content and other things?

Pretend to be the NSA

Record it and then play it back during your tests. Two library do this well, but of which are inspired by vcr.py.

  • vcr.py
  • betamax

vcr.py

(Shows example)

Have all data recorded in ‘vcrexample.yml’ and load it in a “casset”. Next time you use it, it will be near instantaneous.

betamax

Similar, but different benefits

betamax forces you to use the session… and sessions give you a lot of benefits, more than I could tell you in a separate talk.

casset_library… full directory without having to explicitly name the data file.

Tests should be concise and very very clear. By creating a recorder object, and a session inside it, it’s not always clear or in one case.

If we’re using py.test, we can use its fixtures to do this.

What if the service changes? Instead of editing the code, you just delete the cassets, run the test again, and it will record it and go. Sounds awesome.

The reality

You probably want to use both unit tests and integration tests. Some things make more sense in one and not the other. So, testing with Requests is the two part solution. But this combination has really helped me increased the code quality of this library.

Mock et al. are great for regression testing, if there’s some weird corner case that’s crashing your application. Mock out the request, put in the bad data and then you have it.

Questions

If you want to test for intermediate socket errors, you can raise Exceptions as a side effect of get_resource()

Sessions give you a lot more control over your request. The Session will also store cookies for you permanently. Connection pooling makes it faster. You can set which SSL protocol you use. There’s tons of things that you can do with Session that you can’t do with the functional API. With the functional API, every time you call request.get(), we always start a new session and tear it down afterwards.

To ORM or not to ORM

with Christine Spang

Cofounder, Nylas

Code is here

In this talk I’ll build your intuition about whether it’s good to use them or not.

We built tech for the Linux kernel which allows you to update the kernel without restarting it.

Nylas is building a platform that integrates e-mail, calendar, etc.

Nylas sync engine is 25k lines of Python developed over 2 years.

What is an ORM?

Two-way sync between an object in a database to an object in memory. This is a hard problem to solve and systems that solve it generically are necessarilly complex.

Interpreter function

I’ll use examples from SQLAlchemy from the Nylas sync engine.

ORMs centralize the data model and then generate SQL to send to the database.

Why should you use one? (Or not)

Let’s start with a story from Ksplice. I inherited complex billing code, built which used sqlite. When I arrived it was using a hosted MySQL server which was really slow. It was scattered with raw SQL all through the application. The data model was not clear which made it very hard to reason about a good strategy for managing the data.

So instead of having proper data structures, we just used dicts and tuples and lists which worked fine against a local sqlite instance, but when we deployed, the performance penalties killed us and it all fell apart.

In this example, all of the code didn’t fit on one screen. The data models were shared across applications. There was no single source for describing what the data model was.

Advantages of an ORM

There is a clear, centralized place,

Create abstractions to manage complexity.

One way to manage complexity is to map multiple backend columns into a single object property with the @property decorator.

This helps us to apply powerful abstractions like Polymorphism in the data model. (Shows an example of how they used polymorphism to have separate implementations for different mail accounts).

Abstracting many-to-many relationships

ORMs also allow use of the with statement to do cleanup of the session.

ORMs provide a mechanism for batch flushing changes for performance.

Stop!

ORMs are not appropriate in all situations. Which should you not use them?

  • Overhead (not a ton, but it’s some)
  • They don’t really abstract the database
    • Shows a 200 line patch of getting their sync engine to run on PostgreSQL
  • If you want to be truly db agnostic
    • You have to limit yourself to LCD
    • You have to write custom code to handle specific data types and performance optimizations
  • ORMs hide queries
    • It’s less clear what SQL statements are being run
    • Unexpectedly bad performance

You can deal with this, but it involves work on your part and learning to customize.

Use echo and watch the SQL… it will help your intuition. Once you’ve done that, you may be able to convince your ORM to run different queries.

To do so is very complex. Remember that two-way sync between the application and your database is a hard problem.

What about building your own?

I’ve never done it. But if you’ve already built your own with a team of engineers

When not to use it

  • In simple scripts
    • one off tasks
    • don’t need to maintain
    • i.e. don’t need to manage complexity
  • In migrations
    • These are places where you need to be very specific about the SQL that is run.
    • These are a different context than an application
    • You just don’t need it
    • Trying to do so will make your life more difficult
    • Performance can be a big deal here.
    • For the most part, don’t do it.

How to decide which to use?

  • For django, use the built-in Django ORM
  • For everything else, use SQLAlchemy
    • Complaints against it cite that it is too “magicky”
    • But, it’s very reliable
    • The magic works

Conclusions

  • ORMs solve a hard problem
  • Abstractions manage data model complexity
  • Learn to use SQL echo! Tune your intuition

https://github.com/nylas/sync-engine

Brasseurs 6pm

Describing Descriptors

with Laura Rupprecht

Follow along

Outline?

  • What is a descriptor
  • Custom Descriptor Example
  • Kinds of descriptors
  • Attribute lookup order
  • @property/@classmethod
  • Usage in ORM
  • Problems

What is it?

A certain typ of attribute

class Foo():
    x = SomeDescriptor(some_args)

The descriptor has __get__, __set__, __del__

The interesting thing on __get__ is that I’m passing in the instance. We’ll see an example. Also I pass in the owner which is the class to whom the class belongs. Whereas the get and set only care about the instance (and the values).

Example

  • Dealing with JSON YouTube API responses
  • Dictionarey access is everywhere

video['snippet']['thumbnails']['maxres']['url']

No autocomplete in the IDE.

  • Hat to to Jonathan “Use Descriptors”

“uh… what’s a descriptor”. It turns out that the official documentation kinda sucks and doesn’t really describe it if you’re coming without any pre-knowledge. It talks about “binding behaviour”, a la __get__, __set__, __del__

How can we make the massive JSON object prettier?

1st try: put it into an object and set all of the attributes in the init function. But, I have to add everything explicitly.

Use dict_digger?

Also, if I update the JSON response, it’s not out of sync with the object that we just built up.

class ResponseDescriptor():
  def __get__(self, instance, objtype):
    return dict_digger.dig(instance.json_response, *self.path)

This way, we’re abstracting away the behaviour of how to retrieve the data. Totally not as repetative as before. Then just add some error checking and raise an AttributeError if the dict_digger can’t find anything.

But… there are two types of descriptors

  • Data Descriptor
    • __get__
    • __set__
    • __delete__
  • Non-data descriptor
    • Just a __get__

Attribute lookup

  • Check class dict for a data descriptor “
    • type(foo).__dict__['x']
    • For this one you are explicitly looking for something that has __set__ and __delete__
  • Check instance dict
    • foo.__dict__['x']
  • Check class dict for non-data descriptor or attribute
    • type(foo).__dict__['x']
    • In this step you are search for something that may only have a __get__
    • Method calls are usually found in this 3rd step
    • But you can override the call on the instance, if you put something into the instance dictionary
  • Throw an attribute error

Why is it this way? To have function attribute access fit in a little better with attribute access in general.

@property / @classmethod

People often don’t know how they work and it’s magic and it’s great. They’re a good way to get descriptor-like behaviour without having to worry about the complexity of the method resolution order (MRO).

@property lets you treat something like an attribute. But remember that @property is always a data descriptor and as such will take precendent in the MRO.

However, @classmethod can be called on an instance or the class itself.

Usage in ORMs

In the Django ORM, all of the class attributes are descriptors. You’re defining properties in the class.

Why use them?

  • To check data has the proper format (converting units)
  • To automatically perform verification of a field
  • To make it easier to manipulate existing data fields

Why not?

  • Because it looks awesome
  • Job security

To think about

Throw the exceptions

  • ValueError
  • ValidationError
  • AttributeError
  • NotImplementedError

Common Pitfalls

  • Confusing class and instance variables
  • Infinite recursion
    • Don’t use your get function to do a getattr
    • Store stuff in the instance dictionary .__dict__
    • Make sure to define __set__ in this case

Resources

Oh, Come On. Who Needs Bytearrays?

with Brandon Rhodes

In Python, normal string objects are immutable.

Python 2, Python 3

str or bytes, bytes

unicode, str

b prefix is option in Python2, but mandatory in Python3

Immutability

Immutable objects. Any time you call a method on them, it returns a new method, and the original one are unchanged.

  • Advantages
    • Simple
    • Functional
  • Disadvantages
    • Allocation
      • Every time you want to make a tweak to the string, it gets copied.
    • Copying

Therefore, Python3 introduced a builtin bytearray

  • A mutable string
  • Based on Python 3 bytes

Python3 is designed to be awkward to use, so that you will want to run decode() before treating it as characters

So, Python programmers are prepared for good internationalization.

Python 2/3, you can slice a string and bytearray

Python 3 owes me 18000 characters per year! Aka, “the parens tax”.

Python3 bytes are actually integers and you can’t necessarily pull them apart per element and put them back together. So bytearrays objects are not really a natural object.

bytearray is a mutable version of Python’s most underpowered string type

Potential applications?

The bit vector

I wrote my first bloom filter. It’s a dictionary of words that you want to look up and knock out a large class of words before you go to an expensive lookup.

(Shows example)

bytearray is more than 7% faster than old general-purpose array.array object (and the code runs on both natively). So… you might think that it is immediately useful.

But do you know what’s faster? A list of ints! Why? A bytearray is storing bytes that must be translating into a list of ints that then have to be translated to addresses and back. A list simply stores int objects to begin with; no translation!

So we have this new special-case container that is no better than the regular general purpose int.

Except on PyPy, where they are all the same, and they all run much much faster.

Conclusion? Verdict: it’s space efficient. Slightly slower, but 8x less space. And the point of a bloom filter is to save space. That’s why you use a bytearray.

The resuable buffer

When you read a string in, you can’t do anything to it because it is immutable. But for a binary array, it can be

dd takes 6x longer, because of the bytesize of 512 byte blocks by default. cat is the same as python and has a reasonable byte size.

So as we look at Python I/O, we will need to keep block size in mind. The size you read, determines how often you have to go to the OS, which determines the run size.

Tried readinto()

data = bytearray(blocksize)
while True:
  length = i.readinto(data)
  if not length:
    break
  o.write(data)

But was writing the entire length of the block regardless of the length.

o.write(data[:length])

What if we didn’t want to do that expensive slicing operation, since it causes copying. If we want to have zero copy, there’s also memoryview

s = bytearray(b'_________')
m = memoryview(s)
v = m[3:6]
v[0] = 65
v[1] = 66
v[2] = 67
s
bytearray(b'___ABC___')
data = bytearray(blocksize)
view = memoryview(data)
while True:
  length = i.readinto(data)
  if not length:
    break
  o.write(view[:length])

m has no memory of it’s own. It’s just a slicable object

So memory views are often imperitave to getting good performance out of the

  • dd .112 s
  • cat .113 s
  • memoryview .117 s

Creating a view object takes some time, when the blocksize is small. memoryview is a loss for small block size. 20% slowdown

I thought of something

What if we don’t always slice?

# The normal case is when the length == the blocksize
data = bytearray(blocksize)
view = memoryview(data)
while True:
  length = i.readinto(data)
  if not length:
    break
  elif length == blocksize:
    o.write(data)
  else:
    o.write(view[:length])

Lesson: It’s hard to beat old-fashioned strings.

Verdict: Dangerous, but offers a great memory profile

The accumulator

Q: How many bytes will recv(1024) return A: One (or more) if it feels like it

This is completely different from file I/O. In file I/O, the OS will wait for the disk to spin up, find it, and leave you waiting until it’s ready.

So on the network, you’re almost always given the case where you get small pieces of data.

But it worked when I ran against localhost

How does recv() perform with bytearrays

(Shows code example of a Python anti-pattern) (small string initialize +=)

How long does that take? Infinity time.

Instead, store the blocks in a list and join them together at the end.

recv() 1.08s

There is rec_into() but it now runs into a problem. recv_into() reads into the front of the bytearray and overwrites the old data. So now I have to build a memory view and do the slicing on every read.

data = bytearray(content_length)
view = memoryview(data)
n = content_length()

This takes 0.80s vs. 1.08s.

  • still copies data twice
  • but replaces .join()

data.extend(), 40 bytes of RAM bandwidth just to add to the end of your bytearray. But it doesn’t actually do that.

Q: Does bytearray have an append operation that’s any good? A: Yes!

Use the += ! The one operator that we’ve been telling people for 20 years not to use.

In the accumulator, this is the real win for the bytearray. and cleaner code.

Admit it, you’ve always wanted to use +=

The Freestyle Mutable String

I just want a string, that I can do stuff with and modify

  • There’s really not a good use for this.
  • You have to change part of the payload before sending it. So, the result is curious: a “mutable string” that doesn’t gain you anything. So to upper your bytearray, you have to make 2 extra copies.

It’s Awkward.

Conclusions

  • Memory-efficient (but not faster)
  • Help control memory fragmentation
  • Great way to accumulate data
  • Awkward for string operations (and underpowered)

Lessons learned with asyncio (“Look ma, I wrote a distributed hash table!”)

with Nicholas Tollervey

Freelance Python developer from the UK. This is an introduction to asyncio

This is not an exhaustive discussion of asyncio, and I’ve simplified plenty of times.

What does asyncio do?

The Python docs do not give me a practical feel for how I would use it.

It lets you write code that concurrently handles asynchronous network based I/O…

Messages arrive and depart via the network at unpredictable times - asynicio lets you deal with such interactions simultaneously.

Distributed Hash Table

  • Essentially a dictionary that is distributed
  • It is decentralized
  • Peer-to-peer key/value data store

Why do you want one?

  • No single point of failure or control
  • It scales well
  • Have to concurrently handle lots of asyncronous I/O

Core Concepts

  • The event loop
    • poll/select loop
    • polling takes place once during the event loop
    • all callbacks take place one after the other
    • so the loop cannot proceed if anything is stopped
  • Concurrency 101

  • Program never waits for a reply from network calls before continuing
  • Programmers define callbacks

Analogy: Washing machine… I can hang the close after the laundry is done.

I can squeeze the orange juice while waiting for the toast and the eggs

We don’t stand there watching the toast.

Questions

  • How are async concurrent tasks created

Coroutines

  • Something that eventually completes
  • Generators
  • Suspended (with yield)
  • can yield from other objects

(shows example of asyncio.coroutine that handles a request)

Represent activity that is already complete

But what about callbacks?

Futures and Tasks

  • A result that may not be available yet
  • A TODO list of sorts
  • These are first class functions calls

Represents results that may not yet be available.

A DHT Example

Hashing, distance, …

Imagine a clockface, …

Each peer keeps their own routing table, and each peer exchanges state information. Peers store fixed size buckets. A local node knows more about closer nodes. get() and set() require a lookup All interactions are asynchronous and lookups are in parallel/concurrent.

Recursive lookup

  • six degrees of separations

Ask the closest peers, they ask their closest peers, etc.

How does asyncio handle this?

A lookup is a Future

class Lookup(asyncio.Future):

What about networking? How does asyncio handle different networking protocols?

Transports and Protocols

The DHT is Network agnostic, so it can work over HTTP, Netstring, anything that I choose to implement.

Final thoughts;

  • Twisted? asyncio feels a lot like Twisted
  • My code has 100% unit test coverage (sort of)
  • DHT < 1k LOC (because asyncio makes it easy to think about concurrent problems)
  • I/O bound vs. CPU bound
    • Don’t do use async I/O if you’re going to be CPU bound because if your event-loop blocks, everything will hang.

http://github.com/ntoll/drogulus



blog comments powered by Disqus

Published

11 April 2015

Category

work

Tags