Niche Logo

Niche

Senior Site Reliability Engineer

Posted 2 Hours Ago
Remote
Hiring Remotely in USA
Senior level
Remote
Hiring Remotely in USA
Senior level
The Senior Site Reliability Engineer at Niche will manage cloud infrastructure, oversee incident responses, mentor team members, and promote best practices to ensure reliability across distributed systems and applications.
The summary above was generated by AI

About Niche

Niche is the leader in school search. Our mission is to make researching and enrolling in schools easy, transparent, and free. With in-depth profiles on every school and college in America, 140 million reviews and ratings, and powerful search tools, we help millions of people find the right school for them. We also help thousands of schools recruit more best-fit students, by highlighting what makes them great and making it easier to visit and apply.

Niche is all about finding where you belong, and that mission inspires how we operate every day. We want Niche to be a place where people truly enjoy working and can thrive professionally.


About The Role


**We are currently recruiting for this role in Argentina and Brazil only.  We will not be considering US based candidates at this time.  All interviews are being held remotely. If there are preparations we can make to help ensure you have a comfortable and positive interview experience, please let us know.**

**Only applications/resumes written in English will be accepted.**


We are looking for an experienced, proactive, and systematic Senior Site Reliability Engineer to join the SRE team at Niche. As a Senior SRE, you will take ownership of reliability outcomes for critical services, lead incident response efforts, and mentor team members while driving improvements across our platform. You will architect scalable solutions, champion reliability best practices, and influence technical decisions across the engineering organization. Engineering focus will be placed on distributed systems, automation, observability, and cloud infrastructure across AWS and GCP environments. You will lead 24/7 on call rotations, drive incident resolution, and help shape the reliability culture that serves millions of users researching schools and colleges.

 

What You Will Do

  • Own and architect cloud infrastructure across AWS and GCP, including EC2, EKS/Kubernetes, RDS, ElastiCache, S3, and networking components (VPCs, load balancers, DNS), driving improvements that increase reliability and reduce operational burden
  • Lead the design and implementation of secrets management strategies using Hashicorp Vault and other tools, establishing organizational standards for secure configuration management
  • Architect and evolve infrastructure-as-code practices using Terraform, driving adoption of patterns that improve consistency and reduce deployment risk
  • Design and optimize deployment pipelines and CI/CD systems, troubleshoot complex deployment failures with Git and FluxCD, and establish best practices for safe, reliable releases
  • Support database operations including migrations and performance tuning
  • Own Kafka clusters and message queue systems, including architecture decisions, capacity planning, and troubleshooting complex processing issues
  • Participate in 24/7 oncall rotations, responding to alerts, triaging incidents, and coordinating with development teams to resolve production issues
  • Design and implement monitoring, alerting, and observability strategies using Prometheus, Grafana, Sumo Logic, and related tools, establishing organizational standards that catch issues before customers notice them
  • Define and own Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for critical services, balancing business needs with engineering resources
  • Lead blameless post-mortems, write comprehensive incident analyses that teach others, and drive systemic improvements that prevent entire classes of incidents
  • Champion access controls, IAM policies, and security configurations across cloud environments, ensuring infrastructure meets compliance and security requirements
  • Identify and eliminate systemic sources of operational toil by designing automation, building self-service tooling, and improving developer workflows that scale the team's impact
  • Lead AI-assisted automation initiatives to streamline SRE processes, implementing solutions that reduce toil and improve incident response
  • Partner with product development teams as the reliability subject matter expert, providing architecture guidance, production readiness reviews, and proactive capacity planning
  • Mentor and coach SRE team members, helping them develop technical skills and operational judgment through pairing, code review, and incident response shadowing
  • Lead knowledge sharing initiatives, demos, and cross-team collaboration to elevate reliability culture and operational excellence across the engineering organization

During the First Month:

  • Learn about Niche by meeting with various team members to learn more about our company through our Onboarding meetings
  • Shadow SRE team members to learn about our tech stack (AWS, GCP, Kubernetes, Terraform, Vault), the products we support, and our development standards
  • Gain access to production systems, observability tools, and documentation
  • Begin contributing to bug fixes, documentation improvements, and small infrastructure tasks for initial exposure and impact

Within 3 Months:

  • Gain familiarity with our platform's underlying application stacks, deployment processes, and software development lifecycle
  • Collaborate with SRE team members to implement new features and improvements within our infrastructure
  • Participate in code reviews and provide constructive feedback on infrastructure changes
  • Have the skills and knowledge to help analyze and resolve production issues, becoming a participant in oncall rotations
  • Begin partnering with product development teams to provide platform reliability guidance

Within 6 Months:

  • Continue gaining exposure to critical subsystems including databases, data automation, Kafka, task orchestration, and observability platforms
  • Be an advocate for reliability standards within your product team partnerships, empowering developers to move faster with confidence
  • Contribute to automation initiatives that reduce operational toil and improve team efficiency
  • Support compliance efforts by maintaining security controls and contributing to audit evidence collection

Within 12 Months:

  • Confidently troubleshoot and resolve complex production issues across our distributed systems
  • Identify areas for improvement in our infrastructure, research best practices, and make recommendations to the team
  • Use your growing knowledge of our applications to help developers implement changes that increase reliability
  • Contribute to defining SLIs and SLOs for services and help establish observability coverage
  • Practice and help define what it means to be an SRE at Niche

What We Are Looking For

  • Required:

    • 5+ years experience with cloud platforms (AWS or GCP) and container orchestration systems (Kubernetes/Docker)
    • Experience with cloud networking concepts and services including VPCs, subnets, security groups, NAT gateways, VPC peering, load balancers, and DNS management (Route 53, Cloud DNS)
    • Strong programming skills in one or more languages (Python, Go, Bash) with demonstrated ability to build automation and tooling
    • Advanced experience with Infrastructure as Code tools (Terraform, Helm, Ansible) including module design and organizational standards
    • Deep understanding of Linux systems administration and networking fundamentals (TCP/IP, DNS, load balancing, distributed systems)
    • Experience with SQL databases (PostgreSQL, MySQL, or SQL Server) including performance tuning and capacity planning
    • Experience designing and operating CI/CD pipelines for reliable software delivery
    • Track record of leading incident response and driving complex issues to resolution
    • Demonstrated ability to mentor engineers and contribute to team technical growth
    • Excellent collaboration and communication skills, with ability to influence technical decisions across teams

    Preferred:

    • Experience designing and implementing observability strategies using Prometheus, Grafana, Datadog, Sumo Logic, or similar platforms
    • Deep understanding of SRE principles including SLIs, SLOs, error budgets, toil reduction, and reliability engineering practices
    • Experience operating message queue systems (Kafka, RabbitMQ, or similar) at scale
    • Experience with secrets management tools (HashiCorp Vault, AWS Secrets Manager) including design of organizational policies
    • Experience with cloud systems infrastructure design, capacity planning, and cost optimization
    • Interest in leveraging AI and automation tooling (such as MCP servers, agentic workflows, or LLM-assisted operations) to streamline SRE responsibilities
    • Bachelor's degree in Computer Science, a related field, or equivalent experience

Interview Process

Candidate experience is a top priority for our talent and hiring teams.  We believe in providing a transparent, authentic and comprehensive interview process where you have the opportunity to learn about us while we get to know you and your experience.  The interview process is outlined here:

  • Phone Screen with Talent Acquisition Partner - 30 Minutes

  • Video Interview with Hiring Manager - 45 Minutes

  • Team Interview - 45-60 Minutes 

  • Leadership Interview - 30 Minutes 


**We are currently recruiting for this role in Argentina and Brazil only.  We will not be considering US based candidates at this time.  All interviews are being held remotely. If there are preparations we can make to help ensure you have a comfortable and positive interview experience, please let us know.**

**Only applications/resumes written in English will be accepted.**


All interviews are being held remotely. If there are preparations we can make to help ensure you have a comfortable and positive interview experience, please let us know.

Top Skills

AWS
Bash
Docker
GCP
Git
Go
Grafana
Kafka
Kubernetes
Prometheus
Python
SQL
Sumo Logic
Terraform

Similar Jobs

3 Days Ago
Easy Apply
Remote or Hybrid
USA
Easy Apply
180K-220K Annually
Senior level
180K-220K Annually
Senior level
Healthtech • Information Technology • Software • Telehealth
The Senior Site Reliability Engineer will develop, monitor, and maintain distributed production systems, ensuring uptime for patients and providers while automating processes and supporting a large engineering team.
Top Skills: AWSDockerGCPKubernetes
4 Days Ago
Remote or Hybrid
United States
160K-210K Annually
Senior level
160K-210K Annually
Senior level
HR Tech • Information Technology • Professional Services • Sales • Software
Own and operate production-grade Kubernetes infrastructure on AWS, build GitOps CI/CD with GitHub Actions and ArgoCD, develop AI agents and internal DevOps tooling, maintain Datadog-based observability, and manage on-call incident response while collaborating with engineering teams to improve reliability and delivery speed.
Top Skills: Kubernetes,Aws,Python,Go,Datadog,Github Actions,Argocd,Gitops,Ci/Cd,Ai/Llm
6 Days Ago
In-Office or Remote
Atlanta, GA, USA
120K-175K Annually
Senior level
120K-175K Annually
Senior level
Fintech • Gaming • Mobile • Sports • Esports
Design, implement, and monitor reliable production systems at scale. Lead incident response and post-mortems, debug critical production issues, build observability and monitoring, drive reliability best practices and SLO governance, and mentor/train engineers to improve system scalability, resilience, and security.
Top Skills: AWSAzureCrossplaneDatadogGCPGoGrafanaKubernetesNew RelicPythonRubyTerraform

What you need to know about the Los Angeles Tech Scene

Los Angeles is a global leader in entertainment, so it’s no surprise that many of the biggest players in streaming, digital media and game development call the city home. But the city boasts plenty of non-entertainment innovation as well, with tech companies spanning verticals like AI, fintech, e-commerce and biotech. With major universities like Caltech, UCLA, USC and the nearby UC Irvine, the city has a steady supply of top-flight tech and engineering talent — not counting the graduates flocking to Los Angeles from across the world to enjoy its beaches, culture and year-round temperate climate.

Key Facts About Los Angeles Tech

  • Number of Tech Workers: 375,800; 5.5% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Snap, Netflix, SpaceX, Disney, Google
  • Key Industries: Artificial intelligence, adtech, media, software, game development
  • Funding Landscape: $11.6 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Strong Ventures, Fifth Wall, Upfront Ventures, Mucker Capital, Kittyhawk Ventures
  • Research Centers and Universities: California Institute of Technology, UCLA, University of Southern California, UC Irvine, Pepperdine, California Institute for Immunology and Immunotherapy, Center for Quantum Science and Engineering

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account