PyCon 2019: Day 1
Morning Keynotes
No notes per se, aside from the fact that these two speakers told a powerful story from two perspectives.
Just check out the videos when they become available on pyvideo (they are not, as of this writing).
Releasing the World’s Largest Python Site Every 7 Minutes
Shuhong Wong, Production Engineering Manager at Instagram(IG)/FB
State of server release
Release server code 70-100 times daily, every 7 minutes at peak
Inspiration for your continuous deployment system
Build package -> (Run Test) -> (Run Canary) -> (Take lock) -> (Notify Authors) -> (Track Deploy Start) -> (Parallel Deploy) -> (Track Deploy End) -> (Release Lock)
- Deployment script matured over time
- Lots of features - we kept building)
- Deploying IG for many years
- Laid the foundation to how we do deployment at IG
Push script
- Improvmenets to the script came organically
- Anyone can deploy code
- encourage authors to take ownershiop
- Human is the weak link
- Can push doesn’t mean will push
- inconsistent human response to options
- “Tests failed, do you want to continue? (y/N) y”
Deploy automation
- Do you want to run tests before pushing? (Y)
- “test fails, do you wish to continue pushing?” (N)
- No human input needed when everything is working with safe defaults (ABORT/ABORT)
- Post commit hook… deploy script (auto defaults) = Continuous Deployment!
- Deploy every single commit consistenty and as soon as it lands
- CD became a service
- fewer people know how to deploy and revert a change to the site!
- back to deployment team to unblock when the build fails
- DO NOT BREAK TRUNK
- The people who notice the breakage are not the people who caused it
- Land blocking
- Immediately deploy after
Land blocking
- Ensure commit is production worthy before allowing to land
- Commit author owning the change
- lesser incidencts of trunk broken
- everyone moves faster
- Commit pushed to prod within 1 hour of the commit landing
- engineers can be expected to be around to support the change
- Before landing, commit? lock
- After landing, deploy lock
- What happens when an error slips past test and canary
- Tests and canary are not bulletproof
- Need volume
Deploy in phases
- Canary was our last line of defense, it has its limits
- Add the c1 tier
- C1, C2, C3, running in parallel, pipelined.
- Deploy script is very complex
- Massive coordination needed
- Still have a post commit hook
- pushes to db
- controller, makes a decision and promote a version to a tier
- c1, c2 runners
- broadcast the instructions to roll() and deploy to all the machines
- code knows how to pull the backage build it and deploy
- Now have pipelined deployment system
- c1 was second defense
Deploy as fast as we can
- Why?
- Engineers are around to support their changes
- Our engineers don’t slow down for us
- Better productivity and safety
- 1 commit in queue
- deploy
- 2
- still can happen in an hour
- 20 commits?
- Can deploy 8 / hour, need to deploy 3 at once
- Can only deploy at the speed of allowed capacity loss
- Also deploy at peak traffic periods
- can only use idle compute power to reload the web server
- uWSGI
- Fork a new master
- Shutdown idle worker
- Spawn worker on anew master
- eventually move all workers to new master
- Shutdown old master
- servers server traffic all the time during reload
- We can’t limit the volume of commits
- how fast we deploy affects… how many deployments we can do in 1 hour…
Fully autonomous fleet
- Used to be 100% homongenous
- Sometimes py2/py3
- C1/C2
- Different config at parts of the fleet
- Different runner for each time of config
- replace c1/c2 with cache?
The “North Stars”
- Do simple things first
- Do what is enough for us to scale
- Push as fast as you can
- smaller batches, turns out to be more stable for production
- Build a culture around testing
- There is no CD without QC and testing
- Type checking
- 100% test coverage
- Continuously asking why something wasn’t testing
Questions
- Does every problem show up within 6 minutes? Aren’t there some things that take say 6 hours?
- At IG scale, everything shows up within a minute. The Canary is the safeguard. Catch 1-2 bad commits per day. If we missed it in the pipeline, we iterate and improve.
- Alpha, beta, production staging?
- No. The best signal comes from actual code hitting production. When it hits every user, the sooner we get that signal back, the better. I wouldn’t reject that thought, but this model works very well with us.
- How many services?
- This is
instagram.com
which is a monolith. Everything. There are other microservices, but they have their own release cycles.
- This is
- How much do engineers depend on local tests, vs. canary and pushing to commit?
- No data. Have a hunch…
- Do engineers also use feature flagging?
- Not introduced… want to expand our current system to
- How do you deploy your deployment system?
- Continuously. It breaks, but we like to know as soon as it breaks. It’s under control and we can roll it back fast. 20 minutes downtime, we can move on.
- How the uWSGI master process works.
- Watch the talk from PyCon Australia
Dependency hell: a library author’s guide
Yanhui (Angela) Li, Brian Quinlan
Package, distribution, PyPI, Internal
Introduction/Motivation
Welcome to hell.
These are all real examples
$ pip install apache_beam tensoflow
tensorboard 1.11.2 has requirements protobuf>=3.4.0, but you'll have protobuf 3.3.0 which is incompatible.
As a user, what do I do? I didn’t ask for this and I don’t care about it. So I can assume that someone at apache is a liar. Or I can just quit and not use tensorflow.
Incompatibilities in the top PyPI packages: 7 of 100
Most of these packages are foundational and don’t have any dependencies! This is untenable!
The big problem is this is one of the situations where a user encounters and error and can’t do anything about it.
Technical Details
Diamond dependency
My App -> Apache Beam (protobuf <= 3.3.0)->protobuf > Tensor Flow (protobuf>=3.6.1) /
The dependencies come from the setup.py in the dependencies
Dependency Issues
Common things that lead to diamond dependencies.
Never release a breaking change in the minor release. That’s the most common reason that diamond dependencies come from.
E.g. oauth2client
removed API between v4.1.1 -> v4.1.2
- Version number isn’t communicating changes accurately.
- Hard for the users to depend on the latest version of oauth2client
- Hard for users to specify dependency ranges
Breaking changes in a minor release. User has to use version pinning in the requirements. However, later, using version pinning, their package could be incompatible with any other package that requires something else.
- Version range too narrow (e.g. > 1.3.0, < 1.4.0)
- Use a broad range, too many version to support (> 1.2.0, < 1.0.9)
- Not supporting the latest version
Use outdated dependencies
- If your pacakge can’t work with the latest than it will be widely incompatible
- Missing bug/security fixes
Best Practices
- Everything isn’t hopeless
- What to do?!
- Use semantic versioning
- Best thing about this? You don’t have to coordinate with anyone. You just choose a version and do it.
- Advertise to your users that you use semver
- Avoid API churn
- Use tensorflow -> tensorflow makes breaking change -> code breaks -> refactor code -> use tensorflow
- Make reasonable constraints
- E.g. ‘six>=1.10’ (they basically never have released a breaking change
- If six had used semver you could say ‘six>=1.10,<2’
- Once you stop pinning versions, you have more testing to do
@nox.session(name=['3.4', '3.5', '3.6', '3.7'])
@nox.parametrize(
'min_version', ['Jinja2==2.9.0,
'Pillow==5.0.0,
...
]
def compatibility_test(session, min_version):
session.install()
- Support new depenedency version
- new packages will use the latest version
- important to pick up security fixes
Make users happy!
- Use semver
- Avoid API churn
- Support as large a range as possible
- Support the latest versions of your dependencies
Questions?
- How do automated dependency tools fit into this
$ pip check
will check your virtualenv
- Is the impossible conflict common b/c of how Python does dependency management
- sdists. In general, you don’t know its transitive dependencies until it downloads them and run the setup script
- A Python only supported wheel, the PyPI API will tell you what the dependencies are. And then you could have more sophisticated tools for resolution. I was an advocate of that until I talked to other people who hate wheels.
Advanced asyncio: Solving real-world production problems
Lynn Root
SRE at Spotify @roguelynn
Build infrastructure for people who write machine learning models for signal processing FOSS advocate at Spotify PyLadies
Agenda
- Initial setup of Mayhem Mandrill
- Dev best practices
- Testing debugging profil
Intro
Simple illustrations are not very helpful. Basically souped up hello world examples
Some help you get up and running, but then you realize that you’re doing it wrong
I’m not building “web crawlers” at spotify
I’m building services that make a lot of HTTP requests that have to be non-blocking
Pub/sub, handle errors, service level metrics. Need non-asyncio compatible dependencies
Example
Service that does periodic hard restarts
Chaos monkey -> Mayhem Mandrill
Listen for pub/sub message and restart the host based on that message
Initial setup
Not using await
… creating a tasks. Returns the task, but using as a fire/forget.
Consumer
Concurrent work
Store message in db for later playing
Restart and save don’t depend on one another, but maybe you do want it to happen serially. E.g. restart hosts that have an uptime > 7 days. Serial code with dependencies doesn’t mean that it can’t be async.
Block when needed; put logic in a separate coroutine
Finalization tasks
cleanup, ack message. handle message
unblocking the finalization tasks
- async != concurrently
- serial != blocking
It is a mental paradigm shift. Think about what you can farm out and what you cannot.
Graceful shutdowns
Clean up database connections, finish current request, while not accepting new. Respond to signals
Responding to signals
Attach a signal handler to the loop.
Which signals to care about?
Mmmmmm, no standard. All of them.
not-so-graceflu asyncio.shield
- try/except/finally isn’t enough
- define desired shutdown behaviour
- use signal handlers
- listen for appropriate signals
Exception handling
We haven’t done that yet.
We can use a global exception handler and attach it to the loop
Specific handlers?
return_exceptions=True
is super imperativeasyncio.gather
withreturn_exceptions=True
has deterministic ordering
Threads
Sometimes you have to work with them and I’m sorry if you do
- running coroutines from other threads
- ThreadPoolExecutor: calling threaded code from the main event loop
- asyncio.run_coroutine_threadsafe: running a coroutine on the main event loop from another thread
- I was deadlocking myself in production before I realized this
Testing
asyncio.run
in py37
pytet.mark.asyncio
will do the hard work for you in <py37
mocking coroutines
E.g. save()
calls another coroutine or it might call a database. You
don’t want to wait for that when you’re running tests, right?
Unittest.mock and pytest.mock don’t support async mocks.
testing create_task()
Create some mock queue and use the mock_queue fixture
testing the event loop
100% test coverage… how do we get there with main
pytest-asyncio
+ mocked coroutines
Debugging
One small thing. Use print_stack(), you’ll see the stack for every running task, and you can increase the number of frames that are printed.
PYTHONASYNCIODEBUG=1
It’s able to tell you if you’re threadsafe!
Acts as a tiny profiler that flags calls that are slower than 100ms (configurable). Highlighting any unnecessarily blocking tasks.
- In production
aiodebug
that will log callbacks for you- Can report delayed calls to statsd
- Super lightweight
Profiling
Event loop can already track coroutines that take too long. Hard to differentiate a pattern from abnormal behaviour.
Cprofiler. Nothing stands out except main event loop.
kcachegrind can be used with python
pyprof2calltree –kcachegrind -i mayham.prof
Visualization groups modules together by color.
line_profiler package where we can hone in on pieces of are code that are suspicious.
aiologger
: allows for non-blocking logging
Live profiling (don’t want to have to stop the service to look at the results). Can’t attach to a running process, but when you launch with it, you get a text based UI, and you can save performance data and view it at a later time. Server that you can connect to from else where.
Not much difference between profiling aysnc code vs. regular
Does remote work really work?
Lauren Schaefer
Why I work remotely
2008
My boyfriend proposed. I said yes. Life was great.
I was a computer scientist, and he was a nuclear engineer. We did get along. We had our interships.
I interviewed with IBM. Three offers at IBM. Fantastic problem. Nuclear recruiting didn’t start until the next spring. You have to live where the nuclear power plants.
Awkward situation. I want to live with my husband.
I called these hiring managers up. I can work with you in the office for a year and then I don’t know.
Will you let me work from home. Yes, maybe, no. Easy decision.
My husband went to Maryland. We split physically (not romantically).
Went to cubeland for a year. THE WORST.
After a year of that I was excited to work remotely.
Working remotely can be THE WORST. So many life adjustments can be very challenging.
I went from having lots of local friends to one single set of couple friends. My life change in a major way and I wasn’t happy about that.
I’d talk to my team once per week for an hour. It was a struggle.
I was in charge of build verification tests. Get on at 6:30 to make sure that the build was ready for 9:00. I had to keep checking in all day long when the test didn’t pass. Found myself very unpassionate about the work.
Working remotely can be THE BEST. Learned how to switch teams. Learned out to speak at conferences. Started a BOF session for remote employees and then started a remote support group for 11k IBM employees.
Worked for SugarCRM for a year and then have been at MongoDB. Don’t ever want to go back to an office.
How to go from the worst to the best.
Why do employees want it
- Unable to relocate due to spouse’s job or kid’s school
- Lengthy commutes
- Availability for children and aging parents
- Distracting office environments
- Travel the world
Why to employers want it
- Attract and retain top talent–no matter where they live
- Increase employee morale
- Save $$
- Smaller/no office space
- No relocation costs for new employees
- Increase employee productivity
- Fewer sick days
- Shorter breaks
- Fewer distractions
Research
2019 Stack Overflow Developer survey
40% of devs want to work not in the office
How often do you work remotely? 43% never.
60% more coding experience for those who work remotely full time.
Greatest challenge to productivity
- distracting work environment (42%)
- meetings (37%)
- time spent commuting
2014 scientific study
16k employees. Who would like to work remotely. Split the yes’s into two control groups. 13% performance increase in work from home employees.
- more minutes per shift
- remote employees took fewer breaks and fewer sick days
- more calls per minute
- quieter work environment that was more conducive to getting their work done
- statistically significant work satisfaction
- 50% reduction in work from home employees
- huge!
- Downside: 50% reduction in promotion rate conditional on performance
- out of sight, out of mind. Your not going to be considered for promotion
- work from home, couldn’t develop their interpersonal skills
- work from home employees didn’t want to go back to the office and wouldn’t put in for it
- 22% performance increase after allowing employees to chose where to work
2009 Cisco Teleworker Survey
- productivity, work-life felx increase
- $277 million savings in productivity
- 47k metric tons of GHGs not release in 1 year due to teleworking
Lit review, meta study (46 over 20 years)
- Clear upside
- “…no straightforward damaging effects on quality of workplace relationships or perceived career prospects”
Downsides
- “Professional isolation negatively impacts job performance”
- Having a remote manager may negatively impact you
- Remote employees fear stalling careers, isolation, distractions, and blurred lines between work and home life
- Remote work encourages employees to “overwork and to allow their work to infringe on their family role”
- remote employees are often paranoid about appearing have been slacking off
How to convince your boss
“Remote office not required”
Propose an experiment. Sick mother-in-law? Want to travel the world? Be honest. What are we doing and how to we evaluate it. Propose the experiment for all team members.
“But what about collaboration and water-cooler conversations?”
“Water cooler” channel in Slack.
Some manager just have a “gut feeling” that it’s not going to work.
How many wildly innovation ideas can you implement at once. One? Two?
When all else fails, talk about the bottom line. Talk about the reduced attrition rates (not in a threatening way).
Steps
In reverse priority order.
- Join the right team (fully distributed)
- Know everyone’s communication style, and be comfortable.
- Schedule 1-on-1s with each team member
- Share a bit about yourself. Be personal.
- Be productive.
- Do your job.
- Set daily goals.
- I don’t have anyone else accountable, I have to do that myself
- Create a workspace you love. Don’t work on your couch.
- Communicate with your team
- Be present. Be present when you say you are.
- Be a great PR agent for yourself. Be careful of the words you use to describe yourself.
- Travel
- You have done this to get here today so good job.
- If you have a chance to meet your teammates, do it.
- Hack the system (arrange a client visit with your teammates, conference presentation).
- Actively prevent burnout
- Take a lunch break (I ignored my husband who wanted me to skip lunch to finish 30 minutes earlier)
- Stretch before meeting (other people don’t show up on time)
- Turn off your computer after work (and notifications on your home). You have to take a step away
Slides are on twitter @Lauren_Schaefer
At the end of the deck there is an appendix of references.
What I wish people had told me about Python’s multiprocessing
Complex multiprocessing example
- Status subprocesses
- Observation subprocess
- Send queue -> send subprocess -> Logging servers
- Reply queue -> listen subprocesses -> event queue
IoT HVAC Process
- Main process
- System status sub-process
- HVAC Observation Sub-process
- Send sub-process
- listen sub-process
Tips
- Don’t share data, pass messages
- when you share data, you have to manage the locks, use messages
- use multiprocessesing queues
- Great thing… it ships! It’s part of the stdlib
- If you need to scale, you can swap it out
- Down side: uses pipes, every message is pickled
- each queue handles one type of message (with one exception)
- each process should read from at most one queue
- refactor later to use other queueing systems
send_q = multiprocessing.Queue()
event_q = multiprocessing.Queue()
event_q.put("FOO")
# in another subprocess
event = eventq.get(block
- Always clean up after yourself
- notify processes to shutdown using “END” messages and a single shutdown_event
- every one of these, when you’re reading out of a queue, is a loop: get the next job, get the next job
- All processes: notice shutdown and then clean up after themselves
- Main process: cleans up subprocesses and queues
- notify processes to shutdown using “END” messages and a single shutdown_event
while not shutdown_event.is_set():
try:
item = work_queue.get(block=True, ...)
except
def stop_procs(self):
self.shtudown_event.set()
end_+time = time.time() + self.STOP_WAIT_SECS
for proc in self.procs:
join_secs = max(
0.0,
min(end_time - time.time(), STOP_WAIT_SECS)
# If I'm still alive, I have problems
proc.terminate()
# You want to log that... I'll talk about that later
Always join your threads. Removes a spurrious error if you are killing your master and the
- Obey all signals
- Every process, needs to handle both TERM (kill) and INT (Ctl-C) (and other) signals
- Set the shutdown_event the first two times
- That way during debugging, I can test failure modes
- If you can shutdown cleanly, you’ve probably thought about and handled all of the race conditions that are possible
- Third time
raise
- Maybe change system settings
- Don’t ever wait forever
- No process should get stuck
- Loops must terminate
- Blocking calls need timeout
- Timeouts based on how long you can wait
- With queues, you can tell it to block and give it a timeout
- Sockets, you can set the timeout when you startup
- If there’s no timeout, write it yourself
- Report, and log all things
- Use the Python logging module
- Use single time relative to application start
- Include Name of process
- Must log: Errors, Exception Tracebacks, and Warnings
- Should log: Start, Stop, Events
- In DEBUG mode: log a lot, Yeah, log even more.
- How do I log?
start_time = time.monotonic()
Conclusion
- Don’t share, pass messages
- Always clean up after yourself
- Handle TERM and INT signals
- Don’t ever wait forever
- Report, and log all the things
As part of this talk, I wrote a blog post for my company.
pamela@cloudcity.io
@pmcanulty01
CUDA in your Python: Effective Parallel Programming on the GPU
William Horton
@hortonhearsafoo
Moore’s Law is dead
Number of transistors on an integrated circuit will double ever two years
Based on data from the 50s - 70s. Maintained until 2016.
Then Physics happend. As you get down to smaller levels, nanometer scale. Probably with power coming off and head dissapation.
Gordon Moore who died in 2015 said that Moore’s Law is dead.
GPUs
Developed for gaming. Designed to be good at matix operations.
Typical workloads…
NVidia, 4352 CUDA cores across 68 streaming multiprocessors. 1.35 GHz Base clock
GPU Devotes more transistors to data processing and less to Cache and control
GPU is mostly arithmetic units (ALU).
GPUs are truly general purpose parallel processors.
Different models: CUDA (Nvidia), APP (AMD), OpenCL (open standard maintained by Khronos Group)
About Me
Senior Software Engineer on the Data team at Compass
Real Estate platform.
We work with: PySpark, Kafka, Airflow.
Hobbies include deep learning, faist.ai, kaggle, pytorch
“Horton’s Law: Your AWS bill will double every month with your interest in deep learning”
Data Pipelines: Uber uses AresDB for real-time analytics.
How do I start?
import numpy as np
x = np.random.randn(1000000000000000000).astype(np.float32)
y = np.random.randn(1000000000000000000).astype(np.float32)
z = x + y
In CUDA:
import cupy as cp
x = cp.random.randn(1000000000000000000).astype(cp.float32)
y = cp.random.randn(1000000000000000000).astype(cp.float32)
z = x + y
Different approaches
- Drop-in replacement
- Compiling CUDA strings in Python
- C/C++ extension
Increasing complexity, but greater control.
CuPy
A drop in replacement for NumPy
Developed for the deel learning framework
API differences:
- Data types: strings and objects
- numpy.array([some, list]) doesn’t work in cupy
- reduction methods return arrays and not scalars
More drop ins
- cuDF
- cu=
Compiling CUDA strings into your program
CUDA let’s you look at your data/threads in multiple dimensions as well. Threads are in a block, so that you can map your processing to your data
Blocks and Grids
Blocks are groups of threads Grids are groups of blocks
Each threads get assigned to an ALU on the processor
Host and Device CPU will run most general purpose logic and the GPU will run the big parallel operations. Need to specify what code runs on the device vs. the host
PyCUDA
Built by researcher Andresas Klockner at UIUC?
Benefits?
- Auto memory management
- objects get cleaned up when their lifetimes are over
- Data transfer: in, out, and inout
- wrappers around your numpy array to transfer data to and from the GPU
- Error checking
- Metaprogramming
- PyCUDA will benchmark at runtime and optimize at runtime
https://github.com/rmcgibbo/npcuda-example
Uses Cython to generate C++
Manual Memory Management
How to start?
Access a GPU
Google Colab
Kaggle Kernels
Cloud GP Instances (but remember Horton’s law!)
Where to go next?
Applying CUDA to your workflow
Parallel programming algorithms
Other kinds of devices (xPUs like TPU , FPGA through PYNQ)
blog comments powered by Disqus