Source: https://www.reddit.com/r/sysadmin/comments/r6zfv/we_are_sysadmins_reddit_ask_us_anything/
Also available at: https://github.com/yanhan/notes/blob/master/reddit-sysadmins-ama.md
I came across this Reddit AMA a while ago and wanted to take down some notes of the more interesting stuff I read there. Finally got down to doing it today.
Stats
- Peak bandwidth: 924.21MBits / second. They used Akamai heavily
- Aggregate size of databases: 2.4TB. Seems to be growing a few GB per week
- On load balancer: ~8K established connections, ~250K in time wait (with very short time wait timeout)
What they use
- Akamai
- AWS (284 running instances, 161 were app servers)
- Puppet
- Ganglia
- Zenoss
- RabbitMQ
- MCollective
- Central memcached servers (with pylibmc). Each app server has small memcached instance for very local caching that cannot suffer network latency
- rsyslog
- Log consolidation: rsyslog with RELP module
- Hadoop (for in-house data warehouse)
Interesting stuff
- They use HAProxy on EC2 instances instead of ELB. Total 8 instances
- ELB is HAProxy with an API. Limited control over instance size of ELB. Initially set to very small instance
- ELB load balancing is done via round-robin DNS. When one of the backing instances crashes, any cached DNS on the Internet is going to suck. A lot of devices/software/ISPs still cache DNS incorrectly
- If ELB has these, it will be useful:
- Static VIP support. Just round-robin DNS is not acceptable
- Granular control over instance size that backs ELB
- More rule functionality in load balancing. Very limited compared to HAProxy
- At one point, Postgres replication issues were taking down the site very often.
- These were due to EBS failures. They had to login and start addressing replication immediately to prevent really bad breakages
- Upgrading to Postgres 9 and moving away from EBS took care of it
- When they took Reddit down during SOPA protest, they had to prepare for severe amount of immediate load because everyone knew the site was coming back online
- So they cannot do anything that cause the caching layers to clear. Otherwise site would have fallen flat on its face when it came back online
- Load testing: users
- They do not have a load testing infra that can replicate user traffic
- At every place one of them has worked at, one of the most difficult problems is to simulate load properly. With dynamic services like reddit, it takes a lot of work to develop a suitable load simulator
- Non logged in traffic hits Akamai's cache
- Security focus: ensuring evildoers cannot get into app and do evil things. Since they are only hosting web, the infra has a very small number of vectors which are under decent security controls
- Most common attack: people trying to 'DDOS' them by scraping one URL over and over again
- For async stuff, RabbitMQ is used. For instance:
- Votes
- Comment tree recomputing
- New comments
- Thumbnailer
- Search engine updates
- IPv6: Akamai supports it and takes most burden off them
- They keep a close eye on request rate hitting infra and real time stats from Google Analytics
- Worst downtime: https://redditblog.com/2011/03/17/why-reddit-was-down-for-6-of-the-last-24-hours/
- Silliest downtime:
iptables -t nat -L
to check rules on primary load balancer. This loads all the iptables modules, including conntrack. Conntrack table immediately filled up and took site down for a few seconds - Servers are patched as necessary. They subscribe to all security alert notification lists
- Backup strategies: encrypt and send to S3. There's also one backup Postgres server where everything from every database cluster is written to (for more real time backup needs)
Challenges
- Starting from scratch on a lot of stuff
- Bottlenecks constantly popping up. Fix one bottleneck and the increased throughput introduces multiple new bottlenecks
- Cannot touch memcached boxes. Reheating them will be very painful
- At their scale, they must make heavy use of caching whenever possible. Hence shutting everything down and starting everything back up is a painful process
- Need to engineer a clean way to reheat caches without having users hit the site
- One idea is to replay access logs against front-end hosts
- Another idea is to send increasing amounts of real traffic. Say every 1 in 4 requests gets to somewhere other than the maintenance page
Advice
- Spend a lot of time working on own stuff. Eg, set up a web / database server just for the hell of it.
- Break stuff, rebuild it, repeat
- Find every interesting thing you can do on your home server and try it. Even if you are never going to use it personally.
- If anything breaks or doesn't make sense, don't drop it until you truly understand what is going on
- Avoid adopting any cargo cult mentality at all costs
- If that sounds like an extreme bore, reconsider sysadmin aspirations
- Certs may help you get an interview at some companies and leverage for promotions at current workplace
- But they mostly demonstrate at most a shallow understanding of a system
- If you already know a system inside out, doesn't hurt to spend a small amount of time getting a cert
Bare metal vs. cloud
- Bare metal:
- Load balancers and database servers will benefit from bare metal
- Plus point: can experiment with new hardware
- Cloud:
- App servers will benefit from cloud
- Plus points: nice to not have to worry about things like networking infra, installing new hardware, ordering new hardware, rack power, etc
Mistakes they made
- Everything used to be in one security group
What they were working on
- Automating most infrastructure tasks, such as building out new servers
- Getting the site to run in more than one region. Huge project that will require a lot of work throughout entire stack
Top comments (1)
This is fascinating. I'd love to see an updated version to see how they're dealing with 2018 traffic and demands five years later. Their platform has gotten way bigger, but infrastructure has also improved dramatically.