Senior Site Reliability Engineer

Sorry, this job was removed at 11:16 a.m. (PST) on Thursday, April 21, 2022

View 1286 Jobs

Find out who’s hiring remotely

See all Remote jobs

View 1286 Jobs

Apply

By clicking Apply Now you agree to share your profile information with the hiring company.

Save job

Job Summary

The Senior Site Reliability Engineer (SRE) plays a pivotal role in ensuring that Blackline’s services/infrastructure are carefully planned and deployed in a time, place, and configuration which is ideal for serving BlackLine’s clients. The SRE role sits at a nexus of capacity planning, technical project execution, product planning, business analysis, site reliability, and software engineering.

The Sr Site Reliability Engineer is responsible for assessing, testing, tracking, predicting and reporting all related aspects of a suite of production applications from a scalability, performance, responsiveness, capacity and availability perspective.

Roles and Responsibility (list in order of importance)

Develop and maintain subject matter expertise in BlackLine’s service and infrastructure architecture, operation, performance characteristics
Act as a primary resource for the Support organization in responding to customer escalations for performance or availability issues
Identify and communicate issues or conditions that currently, or may in the future, prevent BlackLine services and infrastructure from performing as needed to meet customer expectations; act to resolve the issue, including determining the root cause of the issues, facilitating development of a solution to resolve the issue, gathering a cross-functional team as needed
Improve and maintain a continuous metric framework that observes and records and trends real time availability data for all of our clients
Develop and maintain on premise and cloud capacity plans that ensure we are delivering a BlackLine service that is performant and cost effective
Collaborate with development and other technology teams on requirements definition, observability standards, capacity planning, and process refinement
Improve the BlackLine SaaS service experience by discovering and highlighting optimization opportunities with existing code to address application availability, performance, observability, efficiency, and security challenges.
Develop tools and systems to automate the identification, analysis, and remediation of application events, infrastructure issues, or requests.
Establish and maintain Key Performance Indicators for the overall health of the service and build tools to exercise and evaluate if these KPI’s are being met.
Work cross-functionally with other teams to surface common pain points, architect solutions, establish conventions, and evangelize application development and operations best practices.
Transform discoveries into requests to others or action items for you and your team.
Regularly learn new systems and tools as the BlackLine platform and ecosystem evolves.
Contribute knowledge, skills, and personal qualities to a dedicated team of top engineers solving real-life problems in a bleeding-edge, high-performance, and high-traffic environment.
Publish performance result findings, conclusions, recommendations
Create second tier level analysis of capacity constraint points and performance and discuss with development teams/infrastructure
Support integration of performance data into customer experience analytics tools and reporting
Ensure application and infrastructure capacity management efforts have verifiable capacity data to support business cases
Monitor industry trends and keep abreast of new tools and technologies.
Participate in our on-call rotation, act as crisis manager/tier 3 technical support for major incidents, and conduct incident reviews
Other duties as assigned

Required Qualifications

Years of Experience in Related Field: 5+ Years

Education: BS in Computer Science or equivalent work experience

Technical/Specialized Knowledge, Skills, and Abilities:

A minimum of five years of experience with a significant subset of the following technologies: GCP, AWS, Azure, Kubernetes, GCP, AWS, Azure, HTML, CSS, XML, SOAP, Ajax, JavaScript, IIS, MSSQL, MySQL, Go, Jenkins, Chef, PowerShell, WMI, Java, Apache, Tomcat, SSL, Docker
Extensive knowledge of managing cloud platforms and cloud native tools.
Demonstrated expertise with networking and distributed systems.
Capable of participating in, and leading customer-facing performance evaluations and briefings
Intermediate knowledge of at least two of the following programming languages: C#, Visual Basic, PowerShell, Java, Go, Linux Shell, Ruby.
Demonstrated history of developing or operating production web applications and solid understanding of HTTP(S), HTML, JavaScript, CSS, and XML.
Significant experience in a lead role on a software development or operations team.
Intermediate level knowledge of IIS and Windows Server or Linux and Apache. Intermediate level knowledge of Windows and Linux based systems and automating the management of core kernel and systems configurations, experience with Java and Python.
Intermediate level knowledge with configuration management tools.
Experience with container orchestration platforms like Kubernetes.
Intermediate level knowledge deploying and managing observability tools; such as Elastic, Kibana, Prometheus, etc.
Capable of producing clean, readable code in a multi-developer team environment.
Someone energized by a fast-paced, iterative approach.
Eager to learn and soak in new information.
Must maintain the highest level of integrity, courtesy and respect while interacting with internal and external customers, employees and business contacts
Excellent oral and written communication skills
Ability to interface with internal technical experts using professional interpersonal skills
Experience in analyzing datasets to draw conclusions, and graph datasets supporting these conclusions
Exhibit creative problem-solving, logical troubleshooting and analytical skills
Basic level proficiency in application load balancing methods (F5 LTM, Windows NLB, etc.)
Working knowledge of TCP/IP and networking concepts
Proficiency with statistical concepts; confidence interval, hypothesis testing, sampling
Operating systems concepts such as CPU, memory, CPU and disk queues and graphing/analyzing these over time
Must possess strong organizational skills and be able to work with minimal oversight
Ability to understand new technologies quickly and adapt these into daily work and goals

Preferred Qualifications

Prior C#, ASP.NET, Ruby, Go or Java development experience, preferably in an agile SaaS environment.
Significant experience with open source platforms and technologies.
Experience with software development processes and methodologies.
Track record of architecting, developing, implementing robust, distributed online solutions.
#LI-NR1
#LI-REMOTE

Senior Site Reliability Engineer

What are BlackLine Perks + Benefits

Additional Perks + Benefits

More Jobs at BlackLine