PyCon Day 2: Morning Sessions
Machine Learning 101
with Kyle Kastner, LISA, University of Montreal
Overall the talk was helpful as the “101” that it was advertised to be. The classification of machine learning techniques was informative, but found the speaker saying many times: “treat this as a black box, you don’t need to know what it does.” From my experience, this has not been the case. Not knowing how the parts work and not knowing the underlying math and statistics makes it virtually impossible to choose the correct technique to use. But maybe (hopefully!) the software has improved since last time I took a shot at it.
Two main things
- Automation: You want to do complicated things that usually take humans and do it really quickly.
- Data Analysis: Scientists engineers, process analysis, etc.
Applications
The future is very bright, because the hardware has finally caught up with the ideas from the 80’s.
- Speech Processing
- Image Processing
- Natural Language Processing
- Advertising
- Recommendations
Types of automation
Handcrafted rules
if elif elif elif elif elif else
DON’T TOUCH code
Lots of magic constants
Statistics
- linear models
- p values
- bayesian stats
Machine Learning
- k-means
- svm
- random forests
Deep learning
- neaural networks
Manifold Hypothesis:
Out of all possible images, there is a very small subset of them that we care about. Example random static, and cats and dogs.
Classification. Basic thing in machine learning. This is a cat, this is a dog. This is spam, this is not spam. Think of it as drawing a boundary in some space. Pretty much, that’s what classification boils don’t to.
Regression. Instead of one or two of thing that the state can be, you predict what it will do or some behaviour of it. Example: half of Mona Lisa, person without a head (you will imagine a head).
The core of machine learning is about Learning Functions. Imagine magic functions. You don’t know what is in the middle. You have inputs and outputs. Learn or discover what is the middle part.
y = f(x) : \theta
If we have data x, we can learn theta. And if we do it right, we can use theta for data that we’ve never seen before.
Train/Valid/Test
- Split current data
- Evaluate
-
Typical split
- 80% training
- 20% validation
- Testing data answers unknown
- Want systems to work on new data!
- This approach simulates new data
Speaker recommends Anaconda from continuum.io or Canopy from Enthough
Examples
Recommender System
Recommender systems, e.g. recommending recipes based on what you have cooked before. The goal of a recommender engine is that some people will rate some things and not other things.
Data set of jokes. The rating, who rated it and an index is good enough.
Use pandas to read an excel spreadsheet directly.
Use scipy
sparse matrices
V is our whole database. We believe that people are not necessarily unique. Goal is to go from a high level dimensional space to a lower dimensional space that people can reason about. Jokes were rated between -10 and 10. Error rate MAE is 3.x. So we have some idea about what jokes you might like even without looking at the content of the joke.
Spam Classification
Download data, run it through scikit-learn. Fit/predict. IDF vectorizer. 98% on the training set, 97% on the validation data. We’re doing a good job with data that we’ve never seen before.
Digits
Image recogniition, a manifold structure develops
Deep Neural Networks
Goal is to use neural network on your MacBookAir in 5 minutes. We are looking for sloths. 3 toed sloths, not 2 toed sloths.
There is a hierachy of things. Wordnet gives us a tree structure that shows how living things are related to each other.
PyPy.js: What? How? Why?
with Ryan Kelly from Mozilla
At Mozilla’s core, it is a web company, the web as a shared global technology platform.
The web
Technology (HTTP, HTML, CSS, JavaScript) vs. Curlture (Open, Ubiquitous, Secure, Trustworthy) Or… Cons vs. Pros
The web is open for both producers and consumers.
The problem is that you can’t just throw away the tech stack of the web and keep all of the great cultural byproducts of it.
For Mozilla, anything that the Web can’t do, or anything that the Web is not faster and better at than native technologies, is a bug.
A Bug in the Web
“I want to code it Python!”
Enter PyPy.js
What?
But, this is not the first project to put Python on the web.
Must tick three boxes:
- Compatibility
- Performance
- Should be able to run at comparable speed (turns out that this demo runs faster)
- Web-ish-ness
- interact with JavaScript
- Raise alerts
- interact with jquery and elements
If an engineer identifies three critical features, s/he is about to ask you to pick two of them.
How?
PyPy.js = PyPy + Emscripten
PyPy:
- Python interpreter written in RPython
- toolchain for translating RPython into C
- A JIT-Compiler-Generator
- Not just a JIT, but a JIT generator
Interesting thing about PyPy is that it’s targeted towards experiementation.
Emscripten:
- LLVM backend that generates JavaScript
- A simulated POSIX enviroonment in JavaScript
- Originally for porting games to the web
The speaker spoke with the original author of enscripten and asked him why he had written it: “A was writing this game, it was written in C++, but I wanted to run it on the web”.
asm.js: Mix of very high-level and very low-level stuff.
Start with mycode.c -> emcc -> mycode.js -> js jit -> mycode.x86
So you end up with the same place as you would have if you had just used GCC.
PyPy.js = PyPy + Emscripten
PyPy.py -> RPtyon -> PyPy.c -> emcc -> PyPy.js -> js jit -> PyPy.x86 (feed mycode.py here),
This is kind of awful. But… “it’s awful all the way down the stack”, and it has been for the last 30 years. So… this is nothing new. Accept the awfulness and move on.
Worth it?
Mozilla has a history of public benchmark tracking websites:
Upsides:
- Slightly faster than native C python
Downsides:
- Slightly slower on Chrome than native C python
- Download size of js is much much bigger.
- Load time
Why?
Consider Alternatives
We have py2exe… can we make py2web
py2web + webapp manifest + apk factory (turn your firefox OS into an android app)
IPython + NumPyPy + fortran libs
Free Stuff!
http://bitbucket.org/pypy/pypy # + Python 3 (for free)
A prolog interpreter (for free)
File bugs against the Web: We shouldn’t accept the web as a 2nd class computing platform.
Hyperactive: HTTP/2 and Python
with Cory Benfield
Who am I?
Core contributor to the requests
library, urllib3
, maintainer of 3 others, PSF Contributing Member.
I have a ton of material and I talk quickly anyways.
Work at Metaswitch Networks, and project calico.
Member HTTPBis Working Group and implemented HTTP/2 Stack (hyper)
Why are we here?
HTTP/1.1 is Old! It’s really old. And it’s aged quite poorly.
The wc3 page was 600 bytes and one HTML page in 1996. At the protocol level is very inefficent. TCP is about adapting to the state of the network. HTTP doesn’t reuse connections, it doesn’t keep connections open. It provides a ton of concurrent connections, which all get in the way of each other. Setting up the connection requires one RTT. Web app has 200 ms RTT, but a native app can respond in 1 or 2 millisecond.
Webapp developers keep coming up with ways to hack around the deficiencies of HTTP.
Example:
- minified css/js
- putting images together and only displaying one
- inlining images
This messes up caching. SO, the web needs an update.
It’s not an interesting technical discussion… very interesting political discussion.
Binary
- Text protocols are easy to debug but complex to parse
- Binary is simpler
- Simpler (usually) means faster
You cannot read it anymore. This has caused outrage.
There are plenty of people in this room who think that they can write an HTTP/1.1 parser and they would be wrong. Look through the bug tracker. There are so many bug that boil down to parse.
The big advantage of this is that you no longer need to touch every byte of the message to see if it is a newline character.
Efficient
- Multiplexing (wht priority/flow control)
- A single request/response pair is now called a “stream”
- Each stream tells you how important it is and what it depends on
- Flow control means that there’s no point sending data if the other side can’t do anything with it.
- Header compression (HPACK)
- Headers are too big
- Could do gzip, but it’s not that great, and the better way is to take advantage of their structure
- So… we invented a new compression type
- Early stream termination
- You can abort request/responses without terminating your connection
- Server Push
- The server can send responses that it thinks that you’re going to need.
- This is only for priming caches
- If you don’t want it, you can abort it
Not universally well received
HTTP/2.0 is not a technical masterpiece… I would flunk students in my (hypothetical) protocol design class if they submitted it.
Why is there so much negativity?
It’s imperfect. We have traded some downsides for other downsides.
- Difficult to reason about
- It has inherit state
- You need cooperation from your tools
- This means tools need to drop there connection logs into your debug logs
- Challenging to debug
- Awkward edge cases
- These came out of because we had to maintain backward compatibility with HTTP/1.1
- E.g. large headers > 16kb long
- Inherently concurrent
- Data and control frames are all on the same connection
- You need to process these quickly
- This is very difficult when writing syncronous programming
- This is going to be really hard
I would not be surprised if HTTP/2.0 is going to be the biggest push to adopt asyncio
.
But
There are a lot of ways in which 2.0 is really good.
Demo of loading two tiled images.
Play
34 different implementations of it.
nghttp2
: the reference implementation and it has all the things
Wireshark can decode HTTP/2.0 frames which would be useful except that it’s almost always served over TLS
Firefox Python is actually underserved
Hyper
It’s only a client library. Fits into the stack and httplib
, so that we can put urllib
on top of that and requests
on top of that.
Runscope’s httpbin Running behind H2O which is a HTTP/1.1 HTTP/2.0 reverse proxy
So you can put H2O in front of any python app you have now and add HTTP/2.0 support for it.
blog comments powered by Disqus