Zero Infrastructure: Build Serverless Realtime Data Pipelines with Python and AWS Lambda

Mercedes Coyle: “The crazy chicken lady”

Realtime/Streaming Systems Architecture

  • Use case
    • Online video syndication platform
    • 2-3 million video streams per day
  • What does real-time mean?
    • Event-based
    • Near realtime - up to several seconds between request/response
  • Started with a legacy data system
    • hosted 100 million data events per day
    • impressions… pause, play, etc.
    • Upload logs to S3
    • Load logs to hid and do metrics
    • What did we learn from it?
      • Need for faster data analysis
      • real-time continuous reporting
      • ELK wasn’t easy to transform hadoop queries
      • scheduled jobs are not intelligent
        • they might run, but the data might not be there
      • mangled data
        • logging to disk is prone to failure
        • thrashing disk for every event
        • had to clean up nginx’s logging substitutions

AWS Serverless Components

  • Going Serverless
    • What do we need for a new system?
      • Streaming analytics
      • Reduce system complexity
      • Data source and storage agnostic
      • Flexibility
      • Route data/store in S3
    • API Gateway handles the request payload
      • Quick and easy to set up
      • Public HTTML interface or use API keys
      • Can trigger lambda or go directly to Kinesis stream
    • Kinesis Streams
      • Queueing service
      • HTTP PUT single or batched records
      • 7 day data retention
      • Multiple subscriber (shared iterator)
      • Horizontally scalable
    • S3
      • Stores file objects (not a file system)
      • Categorize file objects by buckets
      • ///filename.bz2
    • EMR (Elastic Map Reduce)
      • Managed Hadoop cluster
      • Spin up, process, destroy
    • AWS Lambda
      • Event driven push/pull
      • scales up/down automatically
      • Supports Python!
      • Heart of every lambda function is the event_handler
        • two params: event, context
        • The context object is a metadata about the running function
      • context.get_remaining_time_in_millis()
      • Key design feature is statelessness
      • lamda functions don’t know anything about previous events
      • 50 LOCS replace 5 servers, a cron job and a bunch of other things
      • Any print or logging statement is logged to CloudWatch
      • Metrics dashboard displays high level performance data
      • Limit of 100 running Lanmda functions on one accout
      • Testing?
        • mock out event and context object
        • can run the code noon its own
        • Invoke lambda functions manually from AWS CLI
      • Packaging
        • you have to build a zipped deployment package file
        • $ pip install <module> -t /project-dir/
        • Just put all the stuff you’re using into the file
      • Deployment
        • AWS CLI from Travis CI job
        • Cloud Formation template

Summary

  • Python 2.7 only
  • Faster development cycles (we can develop functions faster than we can develop libraries)
  • Code more for business goals, and less for infrastructure
  • Factor in maintenance and operational costs when pricing out
  • Consider how “whole hog” you want to go into AWS infrastructure

When is it good to be bad? Web scraping and data analysis of NHL penalties

Wendy Grus - data analysis - organizer of Seattle PyLadies

  • Use one of your own personal interests to drive your web scraping project.
  • Zac Rinaldo slams Kris Letang into the boards (booooooooooooooooooooooooooo)
  • Says, “Yeah, I changed the whole game, man. Fuck, who knows what the game would have been like if I hadn’t done that”.
  • Claimed that the Flyers won, b/c of his penalty.
  • He got suspended for 8 games (yaaaaaay!).

Can taking a penalty ever help you?

  • Two types of players
    • Skilled players
    • Enforcers (keep the skilled players from scoring)
  • Zac needs for it to be good for him to take penalties. Job security.

Analysis

  • Let’s look at NHL penalty data
  • However… there’s no penalty data set available
  • Let’s make our own; web scraping!
  • Web Scraping
    • What data do I need? For this, you need your brain.
    • What features should I look for?
    • How do I collect, combine, and analyze data? requests, BeautifulSoup, pandas, statsmodels
  • Experiment
    • What question do you want to answer?
    • What data do you need to answer it?
    • What data is available?
  • Website features
    • Easy to automate moving from page to page?
      • I was able to find play-by-play for ever game since 2012 and the URL was sensibly designed
      • All of a sudden, I downloaded data for 10,000 hockey games
    • Easy to parse data from the source code?
      • Tables in tables in tables
      • Used BeautifulSoup
      • Data is added in real-time, so there was a lot of terrible formatting
      • Half of the games weren’t valid HTML
      • lxml parser was way faster than BeautifulSoup
  • How do I combine the data?
    • Multiple data sources may not name the fields the same way (e.g. N.J. vs NJD (New Jersey Devil))
    • Had to build a dictionary of related abbreviations
    • Players were described differently (# / lastname vs. firstname / lastname)
    • Teams changes names over time
    • Teams move
    • I reduced 113 distinct penalty names into 8 penalty types (we’re most interested in “Physical foul”)

Evaluating the outcome

  • Next goal; immediate effect
  • Positive change; is the final state better than the state at the time of the penalty
  • Score differential; what’s the difference in score at the time of penalty vs. the end of the game

  • Used logistic regression for next goal and positive change
  • Used linear regression for score differential
  • Covariates
    • Physical foul indicator
    • Hometeam indicator
    • Team strength: win percentage
    • Opponent strength: opponent’s win percentage
    • Penalized player strength: time on ice and penalty minutes per game
    • Drawing player strength: time on ice and penalty minutes per game

Analysis

  • 91476 total penalties
  • 83k drawn by information
  • 77k after last filter

  • Import data into pandas
  • Logistic regression
    • Used stats modules logit function
    • Physical foul was not a significantly related
    • hometeam, win percentage and opponents percentage were significant
  • Linear Regression
    • foul was not an indicator
    • hometeam, win percentage and opponent percentage were significant

Conclusion: when is it good to be bad?

  • When you are the home team
  • When you have a better record than your opponent

  • When I did the analysis for 5 years prior, there was a correlation. The NHL landscape has change.

Epilogue

  • Rinaldo was traded, suspended, demoted to the minors, and then suspended indefinitely (yay!)

From Developer to Manager

Sean O’Connor - Director of Application Engineering at bitly

bit.ly

  • How do we have income to support 90 people’s salaries? We built tools that help marketers.
  • 20 people in engineering
  • Growing
  • At this size, you need some organization

Context

  • When I started at bitly 3.5 years ago, I was just coding
  • It was great
  • I had some experience, started doing some mentorship
  • A year in, there was a musical chairs shuffle
  • All of a sudden, everyone that was my boss or higher was gone
  • I was a more senior person on the team, and was given the choice as to whether to become a manager or not

Becoming a manager is a choice

  • Think
  • You have to deal w/ fuzzy humans
  • Computers may be frustrating, but they are deterministic
  • You have to be much more deliberate in your actions and you speech
  • If you say something stupid, people will notice
  • At the end of the day, you are making a career change
  • You have to understand people and communication
  • What’s important will change
  • Your goals and priorities will change
  • What you accomplish and how you are measured will change
  • Your next job prospect will change
  • The interview process will change
  • How you evaluate your new employer will change a lot
  • Intrinsically none of that is good/bad, it just is
  • What is it that makes you happy?

OK Let’s become a manager

  • It can be kind of terrifying
  • Management is a learnable skill
  • You can have intrinsic traits that make you better at, but it’s still a learnable skill
  • Collect experiences, not things
  • You don’t necessarily have had to have direct experience to not know anything
  • You’ve had a boss before and you can draw from evaluating them
  • There’s a lot of existing knowledge about management
  • You can read it an learn from it
    • http://randsinrepose.com/dont-skip-this/ Managing Humans
  • Find mentors and get support
  • Just like coding, people are happy to give guidance

Management is about relationships

  • This is the core thing that you’re trying to build and trying to achieve
  • At the end of the day, if the relationships on your team are not good, then nothing else is going to work right
  • Take that premise and what can we do to improve relationships
  • To begin, you have to have trust/build trust
  • Communicate! If you can’t communicate with someone, you can’t build trust
  • Weekly 1-on-1s are one of my best tools
    • low friction tools
    • easier to check in and ask questions that may not fly in an open concept office
  • Have empathy!
    • If you can’t empathize w/ your reports, it’s going to be hard to communicate with the
    • If you know someone on your team is bored with their project, you need to act on it
  • What happens when folks are out on vacation and you start doing the programming again?
    • Every time I spend a week to sit down and work on code it feels good!
    • But at the end, you wake up with a hangover
    • The rest of my team is sitting around twiddling their thumbs, because I haven’t given them what they need

Management is about not doing the work

  • When you’re managing, you have to be mindful of how you present yourself
  • I’m a terrible actor, and am terrible at displaying an emotional state that’s different than my own
  • If I’m rough and curt, that’s not going to set anyone up to have a good day at work
  • Be mindful of what emotional load you’re putting on your team

Management is about setting the tone

  • I had a really easy team when I transitioned from engineering to management
  • But sometimes things go downhill
  • Starts off when you notice someone on your team who is usually really engaged, but then they kind of detach
  • I don’t know! There’s no debugger for humans!
  • Gut reaction; meh, maybe if I ignore it, it will just get better on its own

Issues cannot be left unresolved

  • Maybe something about this person’s work environment has become abusive or toxic
  • Maybe something that another team member is doing
  • Could lead to lawsuit
  • Could be something going on outside of work and you ignore the side effects of that
    • They may not be realizing how much its impacting their work
    • They may feel trapped or might not know what options/support is available
    • If you don’t have that conversation, it puts them in a trapped spot
  • If they’re dissatisfied, if they’re board, not happy with the company’s direction
    • Ignoring that will not make it get better
  • You have to address it.
  • It can be really uncomfortable to bring it up.
  • Similar to sales, at some point you have to ask someone to buy something.
  • “Hey, uhhh, your not doing awesome”
  • Maybe you don’t walk out of the room with a resolution
  • But you have to at least start out and build from there
  • This is important for you, the individual, and the team

  • By definition, as a manager, everyone’s problems are your problems

Learn about and be inclusive

  • There are entire categories of issues and problems that people have to go through, that I didn’t even know existed
  • If you don’t go out of your way to learn about what experiences peoples have had, you won’t be able to empathize with them
  • http://geekfeminism.wikia.com
  • http://opensourcebridge.org Technical conference, but has a non-technical track
  • http://joinfundclub.com Helps fund programs that do inclusively work

You’re on the front line of mental health

  • There is still a big stigma regarding mental health
  • If someone lost their glasses and couldn’t see, you wouldn’t tell them to squint harder
  • You should be able to guide people to the correct support
  • http://osmihelp.org Challenges for mental health in the tech industry
  • http://mentalhealthfirstaid.org Program started in Australia in the late 2000s, but has spread worldwide. Often run by the local health department.
  • JIT delivery is not good enough. You need to be proactive or you will be too late.

Conclusions/Resources

  • Management can be scary and tricky
  • But you can get back up and do better
  • http://manager-tools.com
  • http://bit.ly/rands-slack (Really high volume)

Q/A

  • Q: We’re doubling the size of our team soon. What are the benchmark levels of team size as to when you need managers? A: The initial trigger for management is when you can’t keep up with what everyone on the team is doing. If you’re doing 1-on-1s, having more than 5 direct reports is tough.
  • Q: Is your technical skillset starting to atrophy? If so, how do you guide a team of engineers? A: Like anything, it’s not strictly binary. I’m not coding, but I’m still doing architecture, code review, etc. And side projects. For my career, the main value that I was adding was seeing how things fit together. It’s OK for the people that you manage to be smarter than you and to know more about the tech than you do. As for getting rusty, maybe I need to look at more docs, but that comes back with practice.
  • Q: Have you ever had team members carry different working hours, and have you had to set standard. A: Yeah, I have people in New York who work “San Francisco hours”? It gets tough, because there are questions about coordination and everything, but you can’t answer that on yourself, because it’s also a question of company culture. Main thing that I focus on is whether people are collaborating effectively. As long as they own the collaboration and that stays efficient, then I’m OK with it.
  • Q: Top tips for dealing with other managers (especially non-technical)? A: I come from the perspective that people are fulfilling their job until their is evidence to the contrary. If you are having trouble working with other groups or other managers, it’s best to have specific things to talk about. Getting concrete is helpful. Other than that, maybe find mediation.
  • Q: I’m not a manager, but I have an intern and like it. How do I know I’d like to manage a whole team? A: What do you like about your day? And what would it be like if you have 10x that? If you’re bugging out about missing out on coding, then more may not be good.
  • Q: Difference between a mentor and a manager, a manager can actually do something about your problems. A: I try to manage expectations. Make sure that my manager knows about what’s happening on my team and makes sure that they know what you need.
  • Q: Do you suggestions for creating a balance for responsibilities, where you don’t have to be the technical expert? A: I had a baby and have had less time for taking the lead. Other folks on the team have stepped up, but I needed to first find out who was interested in taking a technical lead and who was not. Needed to make sure that people who do take the technical lead know that they have the trust and support from me.
  • Q: Full time vs. part time? When I’m 50/50, it doesn’t work very well. Context switching sucks and the different things require different time and attention. 50/50 doesn’t last very long… at some point you may have to choose.


blog comments powered by Disqus

Published

31 May 2016

Category

work

Tags