Director, Platform Reliability
The Director, Platform Reliability is responsible for ensuring system reliability while rapidly delivering technology products.
This role will provide senior technical leadership and guidance across applications, platforms, and infrastructure to improve the design and operation of systems, make them scalable, reliable, and efficient while ensuring performance and high availability of products / services.
They will help guide development, platform, and operations teams to adapt PRE principles that will enable highly available and resilient systems.
This role will lead a team of Platform Reliability Engineers which will consider the resiliency and stability of the production systems as their primary focus, yet at the same time will be committed to new features and operational improvements through the application of software engineering practices to operations.
They will guide and mentor their teams to proactively react to outages, continuously improve existing systems and services and seek new ways of improving reliability.
You’re great at
Setting up monitoring systems and application metrics (SLO) as well as supervise them for prediction / detection of failures
Providing solutions to Technology Product Delivery teams and other stakeholders for resilience, availability, disaster recovery, and monitoring
Providing training programs for both internal and external team members
Working with other leaders to determine the periodization of reliability fixes / improvements
Analyzing the current technology environment to detect critical deficiencies and develop solutions for improvement
Participating in periodic rotations (when required) with the rest of the PRE team to support and maintain the high availability of systems
Leading PRE team in meeting and exceeding SLOs, using best practices around automation, pushing changes that improve reliability and velocity
Working collaboratively with other stakeholders in IT & Operations to embed reliability in software platforms and frameworks from the onset of initiatives
Instilling the practice of proactive health monitoring, and identification of reliability vulnerabilities
Managing reliability of critical infrastructure platforms
Improving and maintaining site availability, scalability, service and system performance
Developing Linux / Windows Operating System standards around deployment and configuration.
Identify and track incidents on an on-going basis and categorize incidents based on severity, scope and priority reframe this to be resolution based (no tracking, more about proactive identification)
Work with other teams to identify and resolve day to day issues
Track metrics to ensure that SRE teams facilitate post-mortems once incidents have been resolved
Develop SLO standards around incident practices and policies
Minimize the latency of services across networks (from OKRs)
Investigate system errors and problems, bottleneck analysis of the system at scale, etc.
Who are you?
You have 10+ years of experience in an operational role, DevOps, SRE, or Software Engineering
You have 10+ years of experience developing applications with an active user base, and deploying to production
You have 5+ years of change management experience
You must have previous experience managing teams and coordinating internal and external stakeholder groups
You have demonstrated success working in varied and complex technical environments involving multiple stakeholders with competing objectives
You have direct experience with ITIL process, Agile ways of working, and DevOps
Your have previous experience managing multi-disciplinary and cross-functional teams with considerable decision-making autonomy
You have a university degree in a relevant field (Computer Science, Business Administration, Engineering) or equivalent combination of education and operations management experience
You have strong operational understanding of the payments industry, services and technology
You have excellent skills in people management, communication, problem solving, conflict management, client-focused, customer-driven and ability to work under pressure
You have the ability to investigate complex / technical issues and provide alternative solutions, project & production methodologies
You have a strong understanding of Agile & Lean methodologies for requirements / design methodology
You have in depth understanding of CI / CD principles, including deployment pipelines and zero-downtime deployment techniques
You have intimate knowledge of IT lifecycle processes and tools
You have a deep knowledge of the IT service management process
You have knowledge in logging, metrics, and tracing solutions such as Elasticsearch, Grafana, Prometheus, Zipkin
You have a strong understanding of security best practices for cloud architecture, applications, and procedures
You are proficient with public cloud platforms such as Azure, AWS or Google Cloud Platform
You have a superior understanding of containerization with some experience with Docker
You have expertise in cloud architecture and design patterns
You have experience with PaaS; such as PCF, Openshift, PKS
You have expertise with automation tooling (Jenkins, Terraform, Puppet, Ansible, etc.
You have knowledge of Linux Shell, Windows PowerShell and Python scripting
How we work
We know that exceptional people have great ideas and are passionate about their work. Our culture encourages excellence and actively rewards contributions with :
You’re surrounded by talented people every day who are driven by their passion of a common goal.
Core Values :
They define us. Living them helps us be the best at what we do.
Compensation & Benefits :
Pay is driven by individual and corporate performance and we provide a multitude of benefits and perks.
To ensure you are the best at what you do we invest in you.