Text copied to clipboard!
Title
Text copied to clipboard!Site Reliability Engineer
Description
Text copied to clipboard!
We are looking for a Site Reliability Engineer to join our growing technology team. As a Site Reliability Engineer (SRE), you will be responsible for ensuring the reliability, availability, and performance of our production systems and services. You will work closely with software engineers, DevOps, and infrastructure teams to build and maintain scalable systems that support our business goals.
The ideal candidate will have a strong background in systems engineering, automation, and monitoring. You will be expected to design and implement tools and processes that improve system reliability and reduce manual intervention. Your work will directly impact the user experience by minimizing downtime and ensuring high availability of services.
In this role, you will be responsible for developing and maintaining infrastructure as code, implementing monitoring and alerting systems, and participating in incident response and postmortem analysis. You will also contribute to capacity planning, performance tuning, and disaster recovery strategies.
We value a proactive mindset, strong problem-solving skills, and a passion for continuous improvement. If you are someone who thrives in a fast-paced environment and enjoys working on complex technical challenges, we encourage you to apply.
This is an excellent opportunity to be part of a dynamic team that is shaping the future of our technology infrastructure. You will have the chance to work with cutting-edge tools and technologies, and to make a meaningful impact on the reliability and performance of our systems.
Responsibilities
Text copied to clipboard!- Design, build, and maintain scalable and reliable infrastructure
- Develop and implement monitoring, alerting, and logging systems
- Automate operational tasks to improve system efficiency
- Participate in on-call rotations and incident response
- Conduct root cause analysis and postmortems for system failures
- Collaborate with development teams to improve system architecture
- Ensure high availability and performance of production systems
- Implement security best practices and compliance standards
- Contribute to capacity planning and disaster recovery strategies
- Continuously improve deployment and release processes
Requirements
Text copied to clipboard!- Bachelor’s degree in Computer Science or related field
- 3+ years of experience in site reliability or systems engineering
- Proficiency in scripting languages such as Python, Bash, or Go
- Experience with cloud platforms like AWS, GCP, or Azure
- Strong knowledge of Linux/Unix systems
- Familiarity with containerization and orchestration tools (Docker, Kubernetes)
- Experience with CI/CD pipelines and infrastructure as code tools (Terraform, Ansible)
- Understanding of networking, load balancing, and security principles
- Excellent problem-solving and troubleshooting skills
- Strong communication and collaboration abilities
Potential interview questions
Text copied to clipboard!- What experience do you have with cloud infrastructure?
- Can you describe a time you resolved a major system outage?
- What monitoring tools have you used in previous roles?
- How do you approach automating repetitive tasks?
- What is your experience with container orchestration platforms?
- How do you ensure system security and compliance?
- Describe your experience with infrastructure as code.
- How do you handle being on-call and responding to incidents?
- What strategies do you use for capacity planning?
- How do you collaborate with development teams to improve reliability?