Engineering Manager, SRE
ringDNA is seeking an Engineering Manager to help build the Site Reliability Engineering (SRE) team.
The SRE team works 24/7 hours a day to keep ringDNA and our customers protected. As manager of the SRE team, you are responsible for detecting and resolving production incidents within minutes. This objective is met by monitoring the company’s core services, reacting to problems, and proactively addressing issues before they impact performance, security, or availability. You must be a hands-on leader who can build out the company’s SRE capabilities and team as needed.
What You'll Do
You will use your balance of technical expertise, leadership skills, and managerial experience to build SRE core capabilities for ringDNA and eventually supervise the day-to-day responsibilities of front-line Site Reliability Engineers.
You will set technical direction on incident bridges and marshal resources accordingly. You will ensure that investigations are following appropriate troubleshooting paths, and that monitoring, triage and change execution processes are optimal.
You will drive continuous improvement while streamlining how we run our operations. This will involve building and maintaining strong relationships with connected areas of the business, ensuring the SRE team are vital stakeholders within any process and procedural enhancements.
The leader in this role must demonstrate a strong focus on engineering and infrastructure operations practices, service ownership, agile leadership and people management skills.
Your day-to-day responsibilities include:
- Keep all user-facing services and ringDNA production systems running smoothly 24/7/365
- Act in key support roles during major incidents
- Lead RCAs and partner with the Engineering and Product Management teams to permanently fix issues
- Drive the team to be proactive in diagnostics, detection and configuration of applications
- Build competencies in SRE team to respond to incidents in a timely manner and identify root cause
- Work successfully across teams (Engineering, Product Management, QA) by fostering positive, influential relationships
- Automate manual and repetitive processes to support SRE objectives
- Help to scale infrastructure from a technical and financial planning perspective
- Continue to mature the company’s disaster recovery strategy
- Fully leverage our existing logging and monitoring services and propose new ones as needed
- Lead the evolution of our incident and change management processes
Who You Are
- 7+ years of Infrastructure Engineering or Operations experience
- 3+ years managing Site Reliability, NOC, or mixed operations teams preferably in globally distributed environments
- Expertise in AWS and related services
- Experience in 24/7/365 operations team, managing data centers and infrastructure
- Passion for teamwork and collaboration, adaptability, communication, problem solving, customer focus, results, and innovation
- Strong understanding of enterprise monitoring systems and their administration, such as New Relic and Sumo Logic
- Track record of team building, including employee development with experience successfully coaching individuals to achieve goals
- Experience with Salesforce
- Background in Incident Management and strong understanding of ITIL service operations and SCRUM methodologies
- Experience designing, developing, debugging, and operating resilient distributed systems