Site Reliability Engineer II - CTJ - Top Secret
Location: Reston
Posted on: June 23, 2025
|
|
Job Description:
Do you have a passion for high scale services and working with
some of Microsoft’s most critical customers? We’re looking for a
Site Reliability Engineer II with the right mix of software
development, on-line services experience and passion for quality to
envision, design, and deliver Office 365 government cloud service
offerings. Office 365 is at the center of Microsoft’s cloud first,
devices first strategy as it brings together cloud versions of our
most trusted communication and collaboration products like
Exchange, SharePoint, and Teams with our cross-platform desktop
suites and mobile apps. The Office 365 Enterprise Cloud team works
with Microsoft’s largest enterprise and government customers to
deliver features that meet their specific needs and enable cloud
adoption. As you would expect, our customers have the highest
expectations for feature quality, security, reliability,
availability, and performance. The Site Reliability Engineering
(SRE) team provides leadership, direction and accountability for
application architecture, system design, and end-to-end
implementation. As a Site Reliability Engineer, you will identify
and deliver software improvements using your expertise in software
development, complexity analysis, and scalable system design.
Collaboration skills will be required to work closely with other
engineering teams to ensure services/systems are highly stable and
performant, meeting the expectations of our government customers
and users. At Microsoft, we can offer you great teams, exciting
challenges, and a fun place to work. The work environment empowers
you to have a positive impact on millions of end users. The right
candidate for this job (is): Passionate about distributed systems
and working with highly scalable services. Enjoys new technological
challenges and is motivated to solve them. Excited about making
better software and continuously improving the development,
integration, and deployment processes. Smart, highly motivated,
self-starter who thrives in a bottoms-up, fast-paced, highly
technical environment. Effective collaborator, experienced in
creating technical partnerships across teams. Unwavering passion
for meeting customer demands and delivering a dial tone service.
Microsoft’s mission is to empower every person and every
organization on the planet to achieve more. As employees we come
together with a growth mindset, innovate to empower others, and
collaborate to realize our shared goals. Each day we build on our
values of respect, integrity, and accountability to create a
culture of inclusion where everyone can thrive at work and beyond.
Qualifications Required/Minimum Qualifications: Masters Degree in
Computer Science, Information Technology, or related field OR
Bachelors Degree in Computer Science, Information Technology, or
related field AND 1 years of technical experience in software
engineering, network engineering, or systems administration OR 4
years of technical experience in software engineering, network
engineering, or systems administration Other Requirements: Security
Clearance Requirements: Candidates must be able to meet Microsoft,
customer and/or government security screening requirements are
required for this role. These requirements include, but are not
limited to the following specialized security screenings:
Candidates must have an active Top Secret and be willing to upgrade
to TS/SCI (with polygraph). This role will require candidates to
maintain the TS/SCI (with polygraph) clearance. Ability to meet
Microsoft, customer and/or government security screening
requirements are required for this role. Failure to maintain or
obtain the appropriate clearance and/or customer screening
requirements may result in employment action up to and including
termination. Microsoft Cloud Background Check: This position will
be required to pass the Microsoft Cloud background check upon
hire/transfer and every two years thereafter. Clearance
Verification : This position requires successful verification of
the stated security clearance to meet federal government customer
requirements. You will be asked to provide clearance verification
information prior to an offer of employment. Citizenship &
Citizenship Verification: This position requires verification of
U.S. citizenship due to citizenship-based legal restrictions.
Specifically, this position supports United States federal, state,
and/or local United States government agency customer and is
subject to certain citizenship-based restrictions where required or
permitted by applicable law. To meet this legal requirement,
citizenship will be verified via a valid passport, or other
approved documents, or verified US government Clearance
Preferred/Additional Qualifications: Masters Degree in Computer
Science, Information Technology, or related field AND 1 years of
technical experience in software engineering, network engineering,
or systems administration OR Bachelors Degree in Computer Science,
Information Technology, or related field AND 2 years of technical
experience in software engineering, network engineering, or systems
administration OR 5 years of technical experience in software
engineering, network engineering, or systems administration Site
Reliability Engineering IC3 - The typical base pay range for this
role across the U.S. is USD $100,600 - $199,000 per year. There is
a different range applicable to specific work locations, within the
San Francisco Bay area and New York City metropolitan area, and the
base pay range for this role in those locations is USD $131,400 -
$215,400 per year. Microsoft will accept applications for the role
until June 26, 2025 Responsibilities Technical Knowledge and
Domain-Specific Expertise Demonstrates expertise in distributed
systems design, interactions between cloud technology layers and
components, common dependencies at scale, and the code that defines
infrastructures. Can identify and recommend configurations optimal
of cloud technology solutions and modify the code base that defines
systems or cloud technologies to improve the reliability and
operability of supported products with minimal guidance from other
engineers. Develops an understanding of the code, features, and
operations of specific products at scale as required to contribute
to incremental improvements in product availability, reliability,
efficiency, observability, and/or performance; participates in
on-boarding, code/design reviews, and regular meetings with the
engineering teams that develop and/or manage those products.
Researches and maintains an awareness in industry trends, advances
in distributed systems and cloud technologies, new tools, and/or
processes for maintaining and improving product availability,
reliability, efficiency, observability, and/or performance.
Contributes to the implementation of new solutions within their
team by identifying ways they can be applied to solve persistent
problems. Contributions to Development and Design Leverages
technical expertise in large scale distributed systems and specific
products, as well as objective insights drawn from analyses of
production telemetry data to suggest changes or add-ons to product
features or code to improve the availability, reliability,
efficiency, observability, and performance of product components or
features supported by their team. Develops and tests basic changes
to optimize code and improve the observability, reliability and
operability of a defined range of platform, system, or product
components or features with direction from other engineers. Engages
with product engineering teams by participating code/design
reviews, regular meetings, on-call rotations and incident responses
throughout product development and operations cycles; leverages
technical expertise on underlying systems/platforms and insights
drawn from engagements with product engineering teams and telemetry
analyses to propose potential improvements in code base and designs
across components and features of one or more products. Driving
Operational Excellence Independently develops code or scripts that
automate the performance of repetitive and easily scalable
operations processes (e.g., monitoring, alerting, deploying
products and updates) across components and features of products
operating at scale. Leverages technical expertise and telemetry
analysis across a range of components and/or features to identify
patterns and opportunities to implement configuration and data
changes for one or more platforms, systems, or products in
production using code, tooling, and automation. Identifies
opportunities to leverage existing tools and automation to enable
product engineering teams to increase the velocity in which they
can reliably and safely implement changes in production; monitors
the effects of changes across multiple components or features
within a single platform or system. Designs, develops, and
maintains telemetry pipelines and monitoring tools that detail
operations metrics (e.g., availability, reliability, performance,
efficiency) of product components and features operating at scale.
Independently performs analyses using existing tools and/or models
to identify insights and shares them with product engineering teams
to directly contribute to improvements in product development
and/or operations; monitors the impact of changes on operations
metrics (e.g., Time-to-X). Independently uses existing tools and/or
models to troubleshoot problems or flaws affecting the
availability, reliability, performance, and/or efficiency of
components and features; proposes solutions that will resolve and
prevent recurring issues and brings them to the attention of their
Site Reliability Engineering (SRE) and/or product engineering
teams. Responds to incidents during regular on-call rotations by
identifying the level of impact, troubleshooting issues, and
deploying appropriate fixes to resolve root cause(s); alerts
product teams and owners to major customer impacting issues and
escalates resolution of highly impactful issues affecting multiple
components or features to other engineers or engineering teams as
needed. Shares details related to incidents and their resolution
through post-mortem reports and during regular review meetings.
Develops alerts and instrumentation across components and features
to monitor product capacity and resource demands and analyze
telemetry data using existing capacity planning models; draws
insights from analyses of capacity and resource data to optimize
component and feature code to manage resources and capacity across
limited range of use conditions and system parameters. Utilizes
insights from performance and resource monitoring tools to identify
whether there is a need to optimize the efficiency of component and
feature code, or if changes to compute resources are required;
models the predicted effect of changes to code and/or compute
resources across components or features to document the efficacy of
proposed solutions. Shares insights and best practices that can be
applied to improve development and operations of system, platform,
or product components and features by participating in code/design
reviews, incident drills and debriefs, and regular meetings, as
well as interactions with more experienced SREs and members of
product engineering teams. Additional Duties Design, develop, and
deliver the required software engineering to serve and protect O365
government clouds. Own deployment, availability, reliability,
performance and customer escalation targets for sovereign
environments. Proactively identify and reduce issues through
design, testing, and implementation of software-based solutions.
Collaborate with Engineering and Program Management partners to
translate customer, business, and technical requirements into
architectural designs and feature releases. Drive efficiencies
through software improvement and root cause analysis resulting in
service delivery, maturity, and scalability. Work within a highly
skilled team of engineers to deliver revolutionary improvements to
the cloud and scale them. Other Embody our culture and values
Keywords: , Arlington , Site Reliability Engineer II - CTJ - Top Secret, IT / Software / Systems , Reston, Virginia