Director of Engineering, Site Reliability
New York, NY
Sapphire Digital is looking for an operational leader who will have experience building technical teams, scaling management teams, and driving their success while building consensus and alignment throughout the company. This person should have previously proven success as the leader of a dynamic passionate team that loves to code, firefight, build in additional resiliency, and get the most out of service incidents.
In this position, you'll be responsible for:
- Direct a team of leaders and build a best in class incident response organization based on Site Reliability Engineering principles.
- Establish best practice SLIs/SLOs/SLAs to help guide the organization to continuously achieve high standards for platform health.
- Lead from the front lines and aim to reduce customer pain and increase their success and satisfaction.
- Conduct incident lifecycle and blameless postmortem exercises to identify resiliency and reliability improvements.
- Create, maintain, train, and optimize an equitable on-call rotation of support personnel that can share the burden of off-hours support while seeking to document and transfer knowledge and learnings.
- Work closely with leaders in Engineering, Product Management, and Implementation Client Services to align on strategy and lead efforts that identify and reduce toil, and deliver reliable, performant products.
- Communicate progress, challenges, and key decisions broadly within the organization.
- Oversight and optimization of AWS infrastructure using configuration management and infrastructure-as-code best practices.
- Triaging, routing, and resolution of issues and incidents identified by both internal and external stakeholders.
- Advising and guiding other organizational teams with a focus on automation, maintainability, reliability, performance, and security.
- Leading, advising, and analyzing load and performance testing exercises to identify performance bottlenecks and breakpoints, and determine infrastructure needs accordingly.
- Measurement, monitoring, and reporting of availability, latency, and overall system health based on SLIs/SLOs/SLAs.
- Engagement in capacity planning, demand forecasting, software performance analysis, and systems tuning.
- Managing the CI/CD pipeline and migration of client software releases through QA, UAT, and production environments to ensure high-quality, on time delivery of all dependencies.
- Documentation of tribal knowledge to reduce knowledge silos and reliance on institutional memory to support and maintain reliable systems.
You might be a good fit if you have:
- Currently or have been in the past in an Engineering Leadership role either overseeing other Engineering Managers or Team Leads with a history of delivering reliable products running at scale
- A track record of building high-performing, self-sufficient teams. This could be from hiring the right talent to mentoring & growing it internally.
- A strategic background and are comfortable being at the apex of strategy for the future of the SRE org and the business at large
- Hands on experience managing operations of large scale internet-centric production environments for application or infrastructure services serving tens to millions of end users.
- Deep understanding of Site Reliability Engineering (SRE) philosophy, Chaos Engineering, technologies, platforms and tools, SLI/SLO/SLA management, incident resolution, and automation.
- Demonstrated knowledge and experience with Devops/SRE culture
- Experience working with AWS
- Exceptional communication skills.
- Mastery of application, data, and infrastructure architecture disciplines
- Command of architecture, design, and business processes
- Knowledge of industry-wide technology trends and best practices
- Experienced in modern programming languages
- Expertise using solid engineering practices to design, code, test, and deliver software via multiple technology stacks.
- Mastery of some of the infrastructure components. (E.g. routing, load balancers, cloud products, container systems, compute, storage).
- A BS/BA degree or equivalent experience
- Hand-on experience with cloud-based technologies and tools especially in deployment, configuration management, monitoring and operations, such as Puppet, Monit, Kibana, Datadog, New Relic, Slack, etc.
- Software engineering experience and/or site reliability engineering in one or more of the following languages: Ruby, Python, Angular or other SPA-based web front-end technologies, and shell scripting (Unix/Linux).
- Have developed monitoring tools and log analysis tools to manage operations.
- Managed and/or influenced infrastructure services to ensure application service uptime and user experience.
- Developed and managed operations leveraging key event streaming, messaging and DB services such as RDS, DynamoDB, Amazon SQS, or RabbitMQ
- Prior experience in large scale internet companies/technologies, where uptime and continuous availability was core to the business
- Worked with other engineering teams to design reusable patterns to deploy to applications, provide governance around adoption, and influence application development teams on roadmaps and designs.
- Identified and partnered with engineering and data teams to implement automation opportunities to drive down toil and reduce technical debt.
- An understanding of Networking and cloud technologies, for example Security, Load Balancing, Network routing protocols
EOE Committed to Diversity