Director, Platform Reliability
Interac Inc.
Toronto, Ontario, Canada
5d ago

Director, Platform Reliability

The Director, Platform Reliability is responsible for ensuring system reliability while rapidly delivering technology products.

This role will provide senior technical leadership and guidance across applications, platforms, and infrastructure to improve the design and operation of systems, make them scalable, reliable, and efficient while ensuring performance and high availability of products / services.

They will help guide development, platform, and operations teams to adapt PRE principles that will enable highly available and resilient systems.

This role will lead a team of Platform Reliability Engineers which will consider the resiliency and stability of the production systems as their primary focus, yet at the same time will be committed to new features and operational improvements through the application of software engineering practices to operations.

They will guide and mentor their teams to proactively react to outages, continuously improve existing systems and services and seek new ways of improving reliability.

You’re great at

  • Setting up monitoring systems and application metrics (SLO) as well as supervise them for prediction / detection of failures
  • Providing solutions to Technology Product Delivery teams and other stakeholders for resilience, availability, disaster recovery, and monitoring
  • Providing training programs for both internal and external team members
  • Working with other leaders to determine the periodization of reliability fixes / improvements
  • Analyzing the current technology environment to detect critical deficiencies and develop solutions for improvement
  • Participating in periodic rotations (when required) with the rest of the PRE team to support and maintain the high availability of systems
  • Leading PRE team in meeting and exceeding SLOs, using best practices around automation, pushing changes that improve reliability and velocity
  • Working collaboratively with other stakeholders in IT & Operations to embed reliability in software platforms and frameworks from the onset of initiatives
  • Instilling the practice of proactive health monitoring, and identification of reliability vulnerabilities
  • Managing reliability of critical infrastructure platforms
  • Improving and maintaining site availability, scalability, service and system performance
  • Developing Linux / Windows Operating System standards around deployment and configuration.
  • Identify and track incidents on an on-going basis and categorize incidents based on severity, scope and priority reframe this to be resolution based (no tracking, more about proactive identification)
  • Work with other teams to identify and resolve day to day issues
  • Track metrics to ensure that SRE teams facilitate post-mortems once incidents have been resolved
  • Develop SLO standards around incident practices and policies
  • Minimize the latency of services across networks (from OKRs)
  • Investigate system errors and problems, bottleneck analysis of the system at scale, etc.
  • Who are you?

  • You have 10+ years of experience in an operational role, DevOps, SRE, or Software Engineering
  • You have 10+ years of experience developing applications with an active user base, and deploying to production
  • You have 5+ years of change management experience
  • You must have previous experience managing teams and coordinating internal and external stakeholder groups
  • You have demonstrated success working in varied and complex technical environments involving multiple stakeholders with competing objectives
  • You have direct experience with ITIL process, Agile ways of working, and DevOps
  • Your have previous experience managing multi-disciplinary and cross-functional teams with considerable decision-making autonomy
  • You have a university degree in a relevant field (Computer Science, Business Administration, Engineering) or equivalent combination of education and operations management experience
  • You have strong operational understanding of the payments industry, services and technology
  • You have excellent skills in people management, communication, problem solving, conflict management, client-focused, customer-driven and ability to work under pressure
  • You have the ability to investigate complex / technical issues and provide alternative solutions, project & production methodologies
  • You have a strong understanding of Agile & Lean methodologies for requirements / design methodology
  • You have in depth understanding of CI / CD principles, including deployment pipelines and zero-downtime deployment techniques
  • You have intimate knowledge of IT lifecycle processes and tools
  • You have a deep knowledge of the IT service management process
  • You have knowledge in logging, metrics, and tracing solutions such as Elasticsearch, Grafana, Prometheus, Zipkin
  • You have a strong understanding of security best practices for cloud architecture, applications, and procedures
  • You are proficient with public cloud platforms such as Azure, AWS or Google Cloud Platform
  • You have a superior understanding of containerization with some experience with Docker
  • You have expertise in cloud architecture and design patterns
  • You have experience with PaaS; such as PCF, Openshift, PKS
  • You have expertise with automation tooling (Jenkins, Terraform, Puppet, Ansible, etc.
  • You have knowledge of Linux Shell, Windows PowerShell and Python scripting
  • How we work

    We know that exceptional people have great ideas and are passionate about their work. Our culture encourages excellence and actively rewards contributions with :

    Connection :

    You’re surrounded by talented people every day who are driven by their passion of a common goal.

    Core Values :

    They define us. Living them helps us be the best at what we do.

    Compensation & Benefits :

    Pay is driven by individual and corporate performance and we provide a multitude of benefits and perks.

    Education :

    To ensure you are the best at what you do we invest in you.

    My Email
    By clicking on "Continue", I give neuvoo consent to process my data and to send me email alerts, as detailed in neuvoo's Privacy Policy . I may withdraw my consent or unsubscribe at any time.
    Application form