Shopify is a commerce platform with the mission to make commerce better for everyone. We power over one million merchants around the world, and we’re just getting started. At Shopify, we ship on quality instead of time. Our teams deploy new code many times a day, and our production scale is massive. Hundreds of thousands of merchants will see your work within minutes – a tough but incredibly rewarding responsibility.
Shopify has many critical components, and sometimes they fail. Members of our Resiliency Team are the ones ensuring we can get back to green as fast as possible when that happens. This team will be setting the foundation for building and running resilient systems at Shopify. This is a team of engineers with in-depth operational knowledge of the entire Shopify stack, who will act as first responders and leaders during an incident.
Our job is to get to a resolution as quickly as possible and guide teams to build a more resilient Shopify. We will build the tools and systems used to quickly resolve incidents, and will look to automate away the manual toil.
Commerce happens 24/7, and we need to build a team that can respond whenever necessary. We are hiring for a distributed team to provide availability in Singapore (UTC +8)
What You'll Do
- Respond to automated alerts and execute playbooks.
- Manage ongoing incidents, using your understanding of Shopify to involve the right teams and resolve as quickly as possible.
- Clean up the noise in our signals, ensuring we can get an understanding of the system and debug a problem easily.
- Set the standards with teams for building resilient, debuggable systems.
- Ensure we never fail for the same reason twice.
- Follow up each incident to ensure the appropriate action items are in place and prioritized.
- Help teams build tools to automate the toil of on call duties.
Nice To Have But Not Necessary
- You have experience handling on call shifts for mission critical systems.
- You have been responsible for the tools and processes used to debug and correct failures in those systems.
- You reject the idea that on call has to be a terrible, disruptive experience.
- You are a generalist developer who is comfortable with multiple languages such as C, Rust, Ruby, and Go.
- You have done hands-on development with cloud infrastructure (AWS, GCE, Azure, Kubernetes, Docker).
- You have handled multiple on call shifts, and have navigated more than one incident through to the retrospective process.
- You have experience working with a variety of open-source software including nginx, redis, memcached and MySQL.
- You have familiarity with network and web protocols, from IP to HTTP.