Talentify helps candidates around the world to discover and stay focused on the jobs they want until they can complete a full application in the hiring company career page/ATS.
Principal Site Reliability Engineer - Remote
Brazil /
Engineering /
Full-time
Apply for this job
Hiring Company is an open source platform for secure collaboration across the entire software development lifecycle. Hundreds of thousands of developers around the globe trust Hiring Company to increase their productivity by bringing together team communication, task and project management, and workflow orchestration into a unified platform for agile software development.
Founded in 2016, Hiring Company’s open source platform powers over 800,000 workspaces worldwide with the support of over 4,000 contributors from across the developer community. The company serves over 800 customers, including European Parliament, NASA, Nasdaq, Samsung, SAP, United States Air Force and Wealthfront, and is backed by world-class investors including Battery Ventures, Redpoint, S28 Capital, YC Continuity. To learn more, visit www.Hiring Company.com.
We value high impact work, ownership, self-awareness and being focused on customer success. If these values match who you are, we hope you'll learn more about working at Hiring Company and apply!
We are looking for an engineer with demonstrated experience in software development and infrastructure using Kubernetes. You will be ensuring high reliability and scaling of Hiring Company’s new SaaS offering through building tools, deploying infrastructure and automation in Kubernetes.
Here is some of the challenges and work of SRE team:
Monitoring Cloud Environments at Scale with Prometheus and Thanos
How We Use Sloth to do SLO Monitoring and Alerting with Prometheus
Automate EKS Node Rotation for AMI Releases
Responsibilities
Build services and tools to ensure the stability of Hiring Company’s SaaS offering
Define infrastructure in code with IaC tools like Terraform
Write thoughtful and high-quality code in Go
Follow our engineering best practices, and ensure alignment with our Leadership Principles
Provide technical mentorship for fellow engineers
Develop services to handle automatic recovery from incidents and disasters
Automate incident or disaster simulations to identify blindspots
Set technical vision and innovate to be on the forefront of self-healing SaaS services
Implement, maintain and tune monitoring and alerting systems
Deploy applications to and manage Kubernetes clusters
Participate in our on-call rotation to respond to incidents and resolve problems.
Required Background/Skills
Bachelor's degree in Computer Science or related fields, or significant professional DevOps or SRE experience
5+ years of previous experience as a developer or SRE with operational responsibilities
Proven experience responding on-call to incidents with superior knowledge of incident response processes
Strong skills and experience working with Kubernetes inside and out
Strong skills and experience working with infrastructure as code tools, such as Terraform
Solid programming skills and experience with or an ability to quickly become proficient in Go
Familiarity with container systems such as Kubernetes & Docker
Familiarity with GitOps and Chaos Engineering
Ability and willingness to be on-call
Preferences
Experience with distributed application systems using HTTP, WebSockets, RPC, pub/sub, etc. at scale
Open source contributions to related projects
Knowledge of Grafana and Prometheus suite
Comfortable with GitHub, Jira, Jenkins, CircleCI
Experience with WebRTC for real-time communication architectures
Experience working in open source communities
Hiring Company is a remote-first company with sta
Job Summary
Job ID
:
975
Company
:
Talentify.io
Location
:
Worldwide
Job Type
:
Full-time
Primary Tag
:
Software Development
To claim this job, send an email to admin@remoteng.com from your work email with the job ID.