The Senior Site Reliability Engineer (SRE) plays a pivotal role in ensuring that Blackline’s services/infrastructure are carefully planned and deployed in a time, place, and configuration which is ideal for serving BlackLine’s clients. The SRE role sits at a nexus of capacity planning, technical project execution, product planning, business analysis, site reliability, and software engineering.
The Sr Site Reliability Engineer is responsible for assessing, testing, tracking, predicting and reporting all related aspects of a suite of production applications from a scalability, performance, responsiveness, capacity and availability perspective.
Roles and Responsibility (list in order of importance)
- Develop and maintain subject matter expertise in BlackLine’s service and infrastructure architecture, operation, performance characteristics
- Act as a primary resource for the Support organization in responding to customer escalations for performance or availability issues
- Identify and communicate issues or conditions that currently, or may in the future, prevent BlackLine services and infrastructure from performing as needed to meet customer expectations; act to resolve the issue, including determining the root cause of the issues, facilitating development of a solution to resolve the issue, gathering a cross-functional team as needed
- Improve and maintain a continuous metric framework that observes and records and trends real time availability data for all of our clients
- Develop and maintain on premise and cloud capacity plans that ensure we are delivering a BlackLine service that is performant and cost effective
- Collaborate with development and other technology teams on requirements definition, observability standards, capacity planning, and process refinement
- Improve the BlackLine SaaS service experience by discovering and highlighting optimization opportunities with existing code to address application availability, performance, observability, efficiency, and security challenges.
- Develop tools and systems to automate the identification, analysis, and remediation of application events, infrastructure issues, or requests.
- Establish and maintain Key Performance Indicators for the overall health of the service and build tools to exercise and evaluate if these KPI’s are being met.
- Work cross-functionally with other teams to surface common pain points, architect solutions, establish conventions, and evangelize application development and operations best practices.
- Transform discoveries into requests to others or action items for you and your team.
- Regularly learn new systems and tools as the BlackLine platform and ecosystem evolves.
- Contribute knowledge, skills, and personal qualities to a dedicated team of top engineers solving real-life problems in a bleeding-edge, high-performance, and high-traffic environment.
- Publish performance result findings, conclusions, recommendations
- Create second tier level analysis of capacity constraint points and performance and discuss with development teams/infrastructure
- Support integration of performance data into customer experience analytics tools and reporting
- Ensure application and infrastructure capacity management efforts have verifiable capacity data to support business cases
- Monitor industry trends and keep abreast of new tools and technologies.
- Participate in our on-call rotation, act as crisis manager/tier 3 technical support for major incidents, and conduct incident reviews
- Other duties as assigned
Years of Experience in Related Field: 5+ Years
Education: BS in Computer Science or equivalent work experience
Technical/Specialized Knowledge, Skills, and Abilities:
- Extensive knowledge of managing cloud platforms and cloud native tools.
- Demonstrated expertise with networking and distributed systems.
- Capable of participating in, and leading customer-facing performance evaluations and briefings
- Intermediate knowledge of at least two of the following programming languages: C#, Visual Basic, PowerShell, Java, Go, Linux Shell, Ruby.
- Significant experience in a lead role on a software development or operations team.
- Intermediate level knowledge of IIS and Windows Server or Linux and Apache. Intermediate level knowledge of Windows and Linux based systems and automating the management of core kernel and systems configurations, experience with Java and Python.
- Intermediate level knowledge with configuration management tools.
- Experience with container orchestration platforms like Kubernetes.
- Intermediate level knowledge deploying and managing observability tools; such as Elastic, Kibana, Prometheus, etc.
- Capable of producing clean, readable code in a multi-developer team environment.
- Someone energized by a fast-paced, iterative approach.
- Eager to learn and soak in new information.
- Must maintain the highest level of integrity, courtesy and respect while interacting with internal and external customers, employees and business contacts
- Excellent oral and written communication skills
- Ability to interface with internal technical experts using professional interpersonal skills
- Experience in analyzing datasets to draw conclusions, and graph datasets supporting these conclusions
- Exhibit creative problem-solving, logical troubleshooting and analytical skills
- Basic level proficiency in application load balancing methods (F5 LTM, Windows NLB, etc.)
- Working knowledge of TCP/IP and networking concepts
- Proficiency with statistical concepts; confidence interval, hypothesis testing, sampling
- Operating systems concepts such as CPU, memory, CPU and disk queues and graphing/analyzing these over time
- Must possess strong organizational skills and be able to work with minimal oversight
- Ability to understand new technologies quickly and adapt these into daily work and goals
- Prior C#, ASP.NET, Ruby, Go or Java development experience, preferably in an agile SaaS environment.
- Significant experience with open source platforms and technologies.
- Experience with software development processes and methodologies.
- Track record of architecting, developing, implementing robust, distributed online solutions.