Machine Learning 101

with Kyle Kastner, LISA, University of Montreal

Follow allong

Overall the talk was helpful as the “101” that it was advertised to be. The classification of machine learning techniques was informative, but found the speaker saying many times: “treat this as a black box, you don’t need to know what it does.” From my experience, this has not been the case. Not knowing how the parts work and not knowing the underlying math and statistics makes it virtually impossible to choose the correct technique to use. But maybe (hopefully!) the software has improved since last time I took a shot at it.

Two main things

  • Automation: You want to do complicated things that usually take humans and do it really quickly.
  • Data Analysis: Scientists engineers, process analysis, etc.

Applications

The future is very bright, because the hardware has finally caught up with the ideas from the 80’s.

  • Speech Processing
  • Image Processing
  • Natural Language Processing
  • Advertising
  • Recommendations

Types of automation

Handcrafted rules

if elif elif elif elif elif else

DON’T TOUCH code

Lots of magic constants

Statistics

  • linear models
  • p values
  • bayesian stats

Machine Learning

  • k-means
  • svm
  • random forests

Deep learning

  • neaural networks

Manifold Hypothesis:

Out of all possible images, there is a very small subset of them that we care about. Example random static, and cats and dogs.

Classification. Basic thing in machine learning. This is a cat, this is a dog. This is spam, this is not spam. Think of it as drawing a boundary in some space. Pretty much, that’s what classification boils don’t to.

Regression. Instead of one or two of thing that the state can be, you predict what it will do or some behaviour of it. Example: half of Mona Lisa, person without a head (you will imagine a head).

The core of machine learning is about Learning Functions. Imagine magic functions. You don’t know what is in the middle. You have inputs and outputs. Learn or discover what is the middle part.

y = f(x) : \theta

If we have data x, we can learn theta. And if we do it right, we can use theta for data that we’ve never seen before.

Train/Valid/Test

  • Split current data
  • Evaluate
  • Typical split

    • 80% training
    • 20% validation
  • Testing data answers unknown
  • Want systems to work on new data!
  • This approach simulates new data

Speaker recommends Anaconda from continuum.io or Canopy from Enthough

Examples

Recommender System

Recommender systems, e.g. recommending recipes based on what you have cooked before. The goal of a recommender engine is that some people will rate some things and not other things.

Data set of jokes. The rating, who rated it and an index is good enough.

Use pandas to read an excel spreadsheet directly.

Use scipy sparse matrices

V is our whole database. We believe that people are not necessarily unique. Goal is to go from a high level dimensional space to a lower dimensional space that people can reason about. Jokes were rated between -10 and 10. Error rate MAE is 3.x. So we have some idea about what jokes you might like even without looking at the content of the joke.

Spam Classification

Download data, run it through scikit-learn. Fit/predict. IDF vectorizer. 98% on the training set, 97% on the validation data. We’re doing a good job with data that we’ve never seen before.

Digits

Image recogniition, a manifold structure develops

Deep Neural Networks

Goal is to use neural network on your MacBookAir in 5 minutes. We are looking for sloths. 3 toed sloths, not 2 toed sloths.

There is a hierachy of things. Wordnet gives us a tree structure that shows how living things are related to each other.

PyPy.js: What? How? Why?

with Ryan Kelly from Mozilla

At Mozilla’s core, it is a web company, the web as a shared global technology platform.

The web

Technology (HTTP, HTML, CSS, JavaScript) vs. Curlture (Open, Ubiquitous, Secure, Trustworthy) Or… Cons vs. Pros

The web is open for both producers and consumers.

The problem is that you can’t just throw away the tech stack of the web and keep all of the great cultural byproducts of it.

For Mozilla, anything that the Web can’t do, or anything that the Web is not faster and better at than native technologies, is a bug.

A Bug in the Web

“I want to code it Python!”

Enter PyPy.js

What?

But, this is not the first project to put Python on the web.

Must tick three boxes:

  • Compatibility
  • Performance
    • Should be able to run at comparable speed (turns out that this demo runs faster)
  • Web-ish-ness
    • interact with JavaScript
    • Raise alerts
    • interact with jquery and elements

If an engineer identifies three critical features, s/he is about to ask you to pick two of them.

How?

PyPy.js = PyPy + Emscripten

PyPy:

  • Python interpreter written in RPython
  • toolchain for translating RPython into C
  • A JIT-Compiler-Generator
    • Not just a JIT, but a JIT generator

Interesting thing about PyPy is that it’s targeted towards experiementation.

Emscripten:

  • LLVM backend that generates JavaScript
  • A simulated POSIX enviroonment in JavaScript
  • Originally for porting games to the web

The speaker spoke with the original author of enscripten and asked him why he had written it: “A was writing this game, it was written in C++, but I wanted to run it on the web”.

asm.js: Mix of very high-level and very low-level stuff.

Start with mycode.c -> emcc -> mycode.js -> js jit -> mycode.x86

So you end up with the same place as you would have if you had just used GCC.

PyPy.js = PyPy + Emscripten

PyPy.py -> RPtyon -> PyPy.c -> emcc -> PyPy.js -> js jit -> PyPy.x86 (feed mycode.py here),

This is kind of awful. But… “it’s awful all the way down the stack”, and it has been for the last 30 years. So… this is nothing new. Accept the awfulness and move on.

Worth it?

Mozilla has a history of public benchmark tracking websites:

Upsides:

  • Slightly faster than native C python

Downsides:

  • Slightly slower on Chrome than native C python
  • Download size of js is much much bigger.
  • Load time

Why?

Because it’s awesome

Consider Alternatives

We have py2exe… can we make py2web

py2web + webapp manifest + apk factory (turn your firefox OS into an android app)

IPython + NumPyPy + fortran libs

Free Stuff!

http://bitbucket.org/pypy/pypy # + Python 3 (for free)

A prolog interpreter (for free)

File bugs against the Web: We shouldn’t accept the web as a 2nd class computing platform.

Hyperactive: HTTP/2 and Python

with Cory Benfield

Who am I?

Core contributor to the requests library, urllib3, maintainer of 3 others, PSF Contributing Member.

I have a ton of material and I talk quickly anyways.

Work at Metaswitch Networks, and project calico.

Member HTTPBis Working Group and implemented HTTP/2 Stack (hyper)

Why are we here?

HTTP/1.1 is Old! It’s really old. And it’s aged quite poorly.

The wc3 page was 600 bytes and one HTML page in 1996. At the protocol level is very inefficent. TCP is about adapting to the state of the network. HTTP doesn’t reuse connections, it doesn’t keep connections open. It provides a ton of concurrent connections, which all get in the way of each other. Setting up the connection requires one RTT. Web app has 200 ms RTT, but a native app can respond in 1 or 2 millisecond.

Webapp developers keep coming up with ways to hack around the deficiencies of HTTP.

Example:

  • minified css/js
  • putting images together and only displaying one
  • inlining images

This messes up caching. SO, the web needs an update.

It’s not an interesting technical discussion… very interesting political discussion.

Binary

  • Text protocols are easy to debug but complex to parse
  • Binary is simpler
  • Simpler (usually) means faster

You cannot read it anymore. This has caused outrage.

There are plenty of people in this room who think that they can write an HTTP/1.1 parser and they would be wrong. Look through the bug tracker. There are so many bug that boil down to parse.

The big advantage of this is that you no longer need to touch every byte of the message to see if it is a newline character.

Efficient

  • Multiplexing (wht priority/flow control)
    • A single request/response pair is now called a “stream”
    • Each stream tells you how important it is and what it depends on
    • Flow control means that there’s no point sending data if the other side can’t do anything with it.
  • Header compression (HPACK)
    • Headers are too big
    • Could do gzip, but it’s not that great, and the better way is to take advantage of their structure
    • So… we invented a new compression type
  • Early stream termination
    • You can abort request/responses without terminating your connection
  • Server Push
    • The server can send responses that it thinks that you’re going to need.
    • This is only for priming caches
    • If you don’t want it, you can abort it

Not universally well received

HTTP/2.0 is not a technical masterpiece… I would flunk students in my (hypothetical) protocol design class if they submitted it.

Why is there so much negativity?

It’s imperfect. We have traded some downsides for other downsides.

  • Difficult to reason about
    • It has inherit state
    • You need cooperation from your tools
    • This means tools need to drop there connection logs into your debug logs
  • Challenging to debug
  • Awkward edge cases
    • These came out of because we had to maintain backward compatibility with HTTP/1.1
    • E.g. large headers > 16kb long
  • Inherently concurrent
    • Data and control frames are all on the same connection
    • You need to process these quickly
    • This is very difficult when writing syncronous programming
    • This is going to be really hard

I would not be surprised if HTTP/2.0 is going to be the biggest push to adopt asyncio.

But

There are a lot of ways in which 2.0 is really good.

Demo of loading two tiled images.

Play

34 different implementations of it.

nghttp2: the reference implementation and it has all the things Wireshark can decode HTTP/2.0 frames which would be useful except that it’s almost always served over TLS

Firefox Python is actually underserved

Hyper

It’s only a client library. Fits into the stack and httplib, so that we can put urllib on top of that and requests on top of that.

http://http2bin.org

Runscope’s httpbin Running behind H2O which is a HTTP/1.1 HTTP/2.0 reverse proxy

So you can put H2O in front of any python app you have now and add HTTP/2.0 support for it.



blog comments powered by Disqus

Published

10 April 2015

Category

work

Tags