Engineering Manager, SRE

Revenue.io

Sorry, this job was removed at 4:00 p.m. (PST) on Wednesday, September 25, 2019

View 1286 Jobs

Find out who's hiring in Greater LA Area.

See all Developer + Engineer jobs in Greater LA Area

View 1286 Jobs

Apply

By clicking Apply Now you agree to share your profile information with the hiring company.

Save job

ringDNA is seeking an Engineering Manager to help build the Site Reliability Engineering (SRE) team.

The SRE team works 24/7 hours a day to keep ringDNA and our customers protected. As manager of the SRE team, you are responsible for detecting and resolving production incidents within minutes. This objective is met by monitoring the company’s core services, reacting to problems, and proactively addressing issues before they impact performance, security, or availability. You must be a hands-on leader who can build out the company’s SRE capabilities and team as needed.

What You'll Do

You will use your balance of technical expertise, leadership skills, and managerial experience to build SRE core capabilities for ringDNA and eventually supervise the day-to-day responsibilities of front-line Site Reliability Engineers.

You will set technical direction on incident bridges and marshal resources accordingly. You will ensure that investigations are following appropriate troubleshooting paths, and that monitoring, triage and change execution processes are optimal.

You will drive continuous improvement while streamlining how we run our operations. This will involve building and maintaining strong relationships with connected areas of the business, ensuring the SRE team are vital stakeholders within any process and procedural enhancements.

The leader in this role must demonstrate a strong focus on engineering and infrastructure operations practices, service ownership, agile leadership and people management skills.

Your day-to-day responsibilities include:

Keep all user-facing services and ringDNA production systems running smoothly 24/7/365
Act in key support roles during major incidents
Lead RCAs and partner with the Engineering and Product Management teams to permanently fix issues
Drive the team to be proactive in diagnostics, detection and configuration of applications
Build competencies in SRE team to respond to incidents in a timely manner and identify root cause
Work successfully across teams (Engineering, Product Management, QA) by fostering positive, influential relationships
Automate manual and repetitive processes to support SRE objectives
Help to scale infrastructure from a technical and financial planning perspective
Continue to mature the company’s disaster recovery strategy
Fully leverage our existing logging and monitoring services and propose new ones as needed
Lead the evolution of our incident and change management processes

Who You Are

7+ years of Infrastructure Engineering or Operations experience
3+ years managing Site Reliability, NOC, or mixed operations teams preferably in globally distributed environments
Expertise in AWS and related services
Experience in 24/7/365 operations team, managing data centers and infrastructure
Passion for teamwork and collaboration, adaptability, communication, problem solving, customer focus, results, and innovation
Strong understanding of enterprise monitoring systems and their administration, such as New Relic and Sumo Logic
Track record of team building, including employee development with experience successfully coaching individuals to achieve goals
Experience with Salesforce
Background in Incident Management and strong understanding of ITIL service operations and SCRUM methodologies
Experience designing, developing, debugging, and operating resilient distributed systems

Read Full Job Description

Engineering Manager, SRE

Location

Similar Jobs