Microsoft 365 (M365) Intelligent Conversation and Communications Cloud (IC3)
Intelligent Conversation and Communication Cloud (IC3) powers billions of real-time customer conversations across Microsoft’s first party (Teams, Skype), and second party (Dynamics) solutions.
IC3 enables reliable and high-quality audio / video calling, meeting, and messaging services that work every time from anywhere seamlessly across all customer touchpoints.
IC3 makes conversations on our platforms more intelligent in real-time empowering best-in-class productivity tools for the modern workplace where every call, meeting, or chat will make the next one better.
About the Team
We are the team which powers all the messaging scenarios across IC3. We develop one of the largest scale, business critical services in Microsoft.
Our services run in every region and we process hundreds of millions of active users and billions of messages a day. The Micro Service we build must be highly scalable, highly available and extremely performant in geo-redundant multi tenant systems and honor obligations for data sovereignty, privacy, security and compliance.
Site Reliability Engineer IC3 Messaging Services
We are looking for a passionate Site Reliability Engineers to join our team that manages the planet scale infrastructure for Messaging Services.
In this role, you 'll be responsible for building tools to automate and streamline our processes, managing and improving deployment and live site infrastructure across a portfolio of messaging services.
You will have an opportunity to work with a highly collaborative and fun team in a fast learning environment.
Key responsibilities :
Design, write and deliver software to optimize all aspects of deployments (Resources / Applications) infrastructure-as-code’.
Optimize service release by improving Azure DevOps release pipelines.
Drive services towards reliable / predictable deployments achieving better time-to-deploy’ metrics for Services across Microsoft Teams.
Develop safe rollout plans for a portfolio of services to prevent outages.
Build, run, and improve critical service environments in large scale data centers.
Learn and enhance existing tools, developing new tools to meet new scale and features aimed at reducing manual intervention, enhancing prevention, detection, and mitigation of service impacts.
Manage world-wide capacity for a portfolio of services to meet the usage growth and efficiency requirements.
Coordinate planning and execution with internal engineering teams, business partners and technical leaders across the division.
Influence and Collaborate across orgs to bring best practices, architectures, standards, and methods for large-scale distributed systems.
Analyze data and providing operational insights into service reliability, customer experience to Design and Product teams.
Essential qualifications :
BS / BSE in computer science, Management Information Systems or technical disciplines or equivalent education
2+ years as Site Reliability Engineer / Developer working on large scale / distributed systems.
2+ years implementing / automating using CICD tools.
Good knowledge of basic networking fundamentals & troubleshooting tools.
Proven experience creating distributed systems tools of moderate to high complexity.
Ability to manage and deliver multiple project phases at the same time.
Strong analytical and problem solving and organizational skills.
Excellent written and oral communication skills.
Ability to deal with the ambiguity associated with working in a fast-paced and changing environment.
Strong Windows OS / Linux troubleshooting experience.
Preferred qualifications :
3+ years of Azure development experience (ARM templates, Azure Monitor, PowerShell, Kubernetes, Docker etc.)
2+ years automating builds / releases using YAML.