Chess.com Logo

Chess.com

Senior SRE - Distributed Systems & Cloud Infrastructure

Posted 11 Hours Ago
Remote
Hiring Remotely in USA
Senior level
Remote
Hiring Remotely in USA
Senior level
The role involves architecting and optimizing cloud-native services, deep performance tuning of applications, incident response, and managing large-scale distributed systems.
The summary above was generated by AI

About Us


Chess.com is one of the largest gaming sites in the world and the #1 platform for playing, learning, and enjoying chess.


We are a team of 600+ fully remote people in 60+ countries working hard to serve the global chess community. We are here to support 200M+ chess players worldwide with the best possible product, content, and tools to serve the community!


We are a tech company. A gaming company. A content company. And we do it all with passion and commitment to the game. Above all we prize our mission-driven, flat, life-celebrating, no-corporate culture, and we look forward to meeting you and learning more about what you can bring to the team.



About You

  • You’re a passionate member of the Chess.com community, with an acute understanding of our users and their needs.
  • You have advanced expertise in distributed systems and several years of experience integrating and optimizing cloud-native services using Kubernetes, Golang, and TypeScript at scale.
  • You excel at deep-diving into both application code and core system internals to optimize performance and architect robust solutions.
  • You thrive in globally distributed teams, are humble, humorous, and take strong ownership of your work.
  • You’re enthusiastic about tackling the complexities of high-traffic, data-intensive environments and are eager to push the limits of infrastructure reliability and scalability for Chess.



What you'll do
Architect & Optimize Infrastructure:

  • Lead the design and optimization of cloud-native services using Kubernetes, Terraform, and GitOps tools like ArgoCD.
  • Develop high-performance integration patterns and manage scalable, distributed systems handling extensive data volumes.

Deep Performance Tuning:

  • Dive into Golang and TypeScript codebases to identify and resolve performance bottlenecks at scale.
  • Optimize infrastructure and application code to achieve aggressive performance and reliability targets, with a focus on chess programming at the bits level.

Collaboration & Best Practices:

  • Work closely with development teams to refine cloud service integration architectures and implement best practices.
  • Monitor and enhance system reliability and performance through effective collaboration and innovative solutions.

Incident Response & Operational Excellence:

  • Participate in incident response for critical infrastructure issues, ensuring rapid resolution and minimal downtime.
  • Drive improvements in infrastructure reliability, scalability, and operational efficiency.

Infrastructure & Automation:

  • Utilize Terraform and Kubernetes to manage and scale our cloud infrastructure, ensuring robust, automated deployment processes.



Required Skills
High-Scale Cloud Operations:

  • 5+ years of experience managing and scaling large-scale, cloud-native distributed systems.
  • Deep understanding of Kubernetes, Terraform, and GitOps practices.
  • Expert in observability practices and ability to support incident response / on call.

Advanced Development in Golang:

  • Extensive experience in high-performance service development with Golang
  • Proven ability to profile and optimize applications for high throughput and reliable operation.

Distributed Systems Expertise:

  • Strong knowledge of distributed systems design, failure modes, and robust architectural principles.
  • Experience with data modeling and indexing strategies to support efficient service operations.

Performance Optimization:

  • Demonstrated experience improving system reliability and performance through deep code-level and architectural analysis.

Communication & Collaboration:

  • Excellent written and verbal communication skills.
  • Experience working in globally distributed teams.



Preferred Skills

Chess Programming:

  • Experience in chess programming, including bit-level manipulations and optimizations.
  • C/C++ Experience

Observability & Cloud Practices:

  • Familiarity with modern observability tools and practices.
  • Hands-on experience with Kubernetes and cloud-native workflows.



About the Opportunity

  • This is a full-time opportunity
  • We are 100% remote (work from anywhere!)

You can learn more about us here:

  • https://www.chess.com/article/view/how-chess-com-virtual-team-works-together
  • https://www.chess.com/about

Top Skills

Argocd
Gitops
Go
Kubernetes
Terraform
Typescript

Similar Jobs

21 Days Ago
Remote
5 Locations
Senior level
Senior level
Software • Energy • Utilities
As a Senior Site Reliability Engineer, you'll manage GCP infrastructure, improve incident processes, develop observability platforms, and advocate for reliability best practices.
Top Skills: GCPInfrastructure-As-CodeKubernetesUnix
16 Days Ago
In-Office or Remote
Newton, MA, USA
119K-165K Annually
Senior level
119K-165K Annually
Senior level
Security • Software
The Senior Site Reliability Engineer will manage AWS infrastructure, ensure SaaS reliability, automate platforms, and respond to production incidents.
Top Skills: AnsibleAWSCloudFormationCloudwatchDatadogGrafanaHelmKubernetesOpensearchPagerdutySaltTerraform
9 Days Ago
In-Office or Remote
2 Locations
Senior level
Senior level
Artificial Intelligence • Software • Generative AI
This role involves designing and maintaining cloud infrastructure, automating provisioning, and enhancing system reliability through monitoring, collaboration, and mentorship.
Top Skills: AWSAzureDockerElk StackGCPGoGrafanaJavaKubernetesPrometheusPythonTerraform

What you need to know about the Los Angeles Tech Scene

Los Angeles is a global leader in entertainment, so it’s no surprise that many of the biggest players in streaming, digital media and game development call the city home. But the city boasts plenty of non-entertainment innovation as well, with tech companies spanning verticals like AI, fintech, e-commerce and biotech. With major universities like Caltech, UCLA, USC and the nearby UC Irvine, the city has a steady supply of top-flight tech and engineering talent — not counting the graduates flocking to Los Angeles from across the world to enjoy its beaches, culture and year-round temperate climate.

Key Facts About Los Angeles Tech

  • Number of Tech Workers: 375,800; 5.5% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Snap, Netflix, SpaceX, Disney, Google
  • Key Industries: Artificial intelligence, adtech, media, software, game development
  • Funding Landscape: $11.6 billion in venture capital funding in 2024 (Pitchbook)
  • Notable Investors: Strong Ventures, Fifth Wall, Upfront Ventures, Mucker Capital, Kittyhawk Ventures
  • Research Centers and Universities: California Institute of Technology, UCLA, University of Southern California, UC Irvine, Pepperdine, California Institute for Immunology and Immunotherapy, Center for Quantum Science and Engineering

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account