Our team is growing because our user base is growing - 30% in the last year!
We’re looking to add a Senior Site Reliability Engineer who will bring their passion for optimizing new and existing systems, building infrastructure and eliminating redundant work through testing and automation. You will have the opportunity to solve challenging problems, work with the latest technologies, collaborate with a kind and capable team, and contribute to a one-of-a-kind app.
If you're tired of responding to calls at all hours for preventable issues, you're going to love this team. We focus on stability and reliability to keep one of the largest social networking sites live and enjoyed by 15M active users 24/7 around the world.
Here's a sneak peek at some of what this team is up to:
- 300+ EC2 Instances
- 25k API calls per second
- 30k Chat system actions per second
- Kubernetes CI/CD
In this role, you will work closely with the site reliability team and our engineering leadership to build and maintain a robust, reliable, product, while proactively looking for improvements to our operations ecosystem.
Location: West Hollywood or Remote from the U.S. or Canada
- Deployment and support of the full lifecycle of applications running on Amazon Web Services
- Design and document high-traffic systems using containers, Kubernetes, and technologies like Redis, Consul, Elasticsearch, and RabbitMQ
- Develop effective monitoring and response tools to both identify and address reliability risks using Datadog, Pagerduty, and similar systems
- Engage with Product Engineering teams to triage production outages and carry forward action items
- Build robust systems using auto-scaling and self-healing, and identify risks to keep uptime as high as possible
- Write, test and maintain automation tools in Bash, Python, or other languages
- Practice Security by Design and regular security analysis/auditing of systems
- Work with Ci/CD tools to create smooth release and rollback functionality even for complex distributed systems
- B.S. in Computer Science or equivalent experience
- 4 years of experience working in high volume, large-scale environments
- Ability to perform root cause analysis on stability and performance related events
- Extensive experience with AWS technologies (preferred: EC2, RDS, S3, EKS, Route53, Pinpoint)
- Experience with large-scale Kubernetes deployments (EKS is a plus)
- Experience with alerting and monitoring systems (Datadog is a plus)
- Experience with writing code around infrastructure automation (Bash, Python, Java)
- Experience with distributed system design, implementation, and capacity planning
- Passionate about testing software and systems
- Strong infrastructure, information, and network security experience
- Experience with Jenkins Pipeline and integrations for CI/CD
- Plus: Experience with database management (MySQL, DynamoDb, Redis)
- Plus: Experience with EFK stack (Elasticsearch, Fluentd, Kibana)
- Plus: eig data experience, including Apache Hive, Apache Airflow, Cloudera, Spark, EMR, Kafka
- 100% covered medical, dental and vision insurance
- Generous Parental Leave
- Flexible Time Off Policy
- Competitive Salaries
- 401(k) Matching
As an equal opportunity employer, we are committed to diversity in the workforce. In accordance with applicable law, we prohibit discrimination against any applicant or employee based on any legally recognized basis, including, but not limited to; race, color, religion, sex, sexual orientation, gender identity, age national origin or ancestry, physical or mental disability, genetic information, veteran status, uniformed service member status or any other status protected by federal, state or local law.
*Recruiting firms that submit resumes to Grindr without first entering into a written contract will not be entitled to any compensation on candidates referred by that firm.