Site Reliability Engineering Manager
Job Description Summary
We are seeking a talented Site Reliability Engineering Manager to be part of the fast moving, innovative CCC ONE product team. We build enterprise class, hosted solutions that span multiple data centers and public clouds and service hundreds of thousands of end-users. This is a great opportunity for a highly-motivated person interested in leading a team responsible for the overall health, performance and operational design of our systems and applications. In this position, you will work alongside DevOps, software developers, database administrators, network engineers, systems engineers and information security, in an agile environment building modern software solutions and infrastructure.
Job Duties
Responsibilities
- Serve as a hands-on manager of a team of software/system engineers
- Own end-to-end availability and performance of key systems and services
- Triage potential application issues received through various channels and work with appropriate teams to lead to resolution
- Lead by example, mentor the team and establish credibility through quality technical execution
- Gain and disseminate knowledge of our complex applications
- Application metrics and operational intelligence
- Manage on-call rotations with support from development teams
Qualifications
Qualifications
- 2+ years management experience leading an engineering team with technical deep-dives into code, networking, operating systems and/or storage
- 5+ years working in an Agile/Scrum development methodology
- 5+ years work experience using Microsoft technologies, preferably .NET and C# focused
- Proven ability in designing and configuring monitoring and alerting solutions across multiple systems and services using tools such as Prometheus, Grafana, Kibana and/or Application Insights
- Experience preparing and presenting operational artifacts to senior management
- Experience with DevOps tools (Azure DevOps, Puppet, Chef or Ansible), processes and culture with a focus on automation
- Working knowledge of databases including SQL, indexing and schema design
- Familiarity with technical considerations involved in designing for complex systems at large scale
- Production troubleshooting skills that span systems, networks and code
- Desire to build, grow and improve a team
- Ability to encourage and foster a culture of visibility and transparency across teams
Other Beneficial Skills
- Experience with Microsoft Azure and/or AWS
- Experience with Kubernetes and/or Azure Service Fabric
- Technical knowledge of SQL Server internals with emphasis on query performance
- Experience with queuing frameworks and message brokers