Our large, Fortune client is ranked as one of the best companies to work with, in the world. The client fosters progressive culture, creativity, and a Flexible work environment. They use cutting-edge technologies to keep themselves ahead of the curve. Diversity in all aspects is respected. Integrity, experience, honesty, people, humanity, and passion for excellence are some other adjectives that define this global technology leader.
As a Site Reliability Engineering, you will be working in close collaboration with the infrastructure team, you will be included in the daily life of the teams in order to have the best understanding and work relations. Through automation, process improvement, teaching, coaching, and proactive monitoring you will build and maintain highly fault-tolerant CI pipelines which incorporate proactive monitoring, alerting, and outage notifications on numerous custom software platforms. This role will need to have deep familiarity with integrations on numerous hosting systems, deployment means, and containerized deployments. This position will be a mix of strategy, implementation, and hands-on individual contributor development. Heavy dev ability required. Custom automation of building and deploy pipelines, custom per team will be needed for integration of new components and services.
Primary/Essential Duties and Key Responsibilities:
- You will drive impact through technical influence across the organization and play a critical role in the development of the monitoring and deployment infrastructure.
- Writing stress and regression tests to see breaking points and scalability issues of the application and subsequently following up by creating stories for the development team to improve
- Keeping the STG, PRD environments in stable condition and updating the status to all the stakeholders
- Making sure the team understands its SLIs, SLOs, and client-facing SLAs
- Troubleshooting existing production issues and collaborating with numerous teams to solve an underlying issue.
- Taking information from learned incidents and working on improving tooling and visibility
- Helping teams evaluate complexity vs understandability of features, patterns or decisions
- Responsible for building trustworthy, secure, and reliable infrastructure
- Ease of integration and management of the infrastructure where it is easy to scale as development and adoption scales
Knowledge, Skills, and Abilities
- Strong understanding of AWS
- Containerized deployment (Docker, etc)
- Experience with CI/CD tooling (Jenkins)
- Advanced experience with at least one programming language (Java – TestNG framework or Node.JS)
- Understanding of file systems, Linux.
- Basic knowledge on SAP, SFDC, TIBCO, Siebel Systems
- Antifragility pattern design
- Cloud deployments
- Microservice patterns/deployment
- Experience with HA and distributed IAAS