CISO series: Lessons learned from the Microsoft SOC—Part 3a: Choosing SOC tools

The Lessons learned from the Microsoft SOC blog series is designed to share our approach and experience with security operations center (SOC) operations. Our learnings in the series come primarily from Microsoft’s corporate IT security operation team, one of several specialized teams in the Microsoft Cyber Defense Operations Center (CDOC).

Over the course of the series, we’ve discussed how we operate our SOC at Microsoft. In the last two posts, Part 2a, Organizing people, and Part 2b: Career paths and readiness, we discussed how to support our most valuable resources—people—based on successful job performance.

We’ve also included lessons learned from the Microsoft Detection and Response Team (DART) to help our customers respond to major incidents, as well as insights from the other internal SOC teams.

For a visual depiction of our SOC philosophy, download our Minutes Matter poster. To learn more about our Security operations, watch CISO Spotlight Series: The people behind the cloud.

As part of Cybersecurity Awareness month, today’s installment focuses on the technology that enables our people to accomplish their mission by sharing our current approach to technology, how our tooling evolved over time, and what we learned along the way. We hope you can use what we learned to improve your own security operations.

Our strategic approach to technology

Ultimately, the role of technology in a SOC is to help empower people to better contain risk from adversary attacks. Our design for the modern enterprise SOC has moved away from the classic model of relying primarily on alerts generated by static queries in an on-premise security information and event management (SIEM) system. The volume and sophistication of today’s threats have outpaced the ability of this model to detect and respond to threats effectively.

We also found that augmenting this model with disconnected point-solutions lead to additional complexity and didn’t necessarily speed up analysis, prioritization, orchestration, and execution of response action.

Selecting the right technology

Every tool we use must enable the SOC to better achieve its mission and provide meaningful improvement before we invest in purchasing and integrating it. Each tool must also meet rigorous requirements for the sheer scale and global footprint of our environment and the top-shelf skill level of the adversaries we face, as well as efficiently enable our analysts to provide high quality outcomes. The tools we selected support a range of scenarios.

In addition to enabling firstline responders to rapidly remediate threats, we must also enable deep subject matter experts in security and data science to reason over immense volumes of data as they hunt for highly skilled and well-funded nation state level adversaries.

Making the unexpected choice

Even though many of the tools we currently use are made by Microsoft, they still must meet our stringent requirements. All SOC tools—no matter who makes them—are strictly vetted and we don’t hesitate to reject tools that don’t work for our purposes. For example, our SOC rejected Microsoft’s Advanced Threat Analytics tool because of the infrastructure required to scale it up (despite some promising detection results in a pilot). It’s successor, Azure Advanced Threat Protection (Azure ATP) solved this infrastructure challenge by shifting to a SaaS architecture and is now in active use daily.

Our SOC analysts work with Microsoft engineering and third-party tool providers to drive their requirements and provide feedback. As an example, our SOC team has a weekly meeting with the Windows Defender ATP team to review learnings, findings, request features or changes, share engineering progress on requested features, and share attacker research from both teams. Even today, as we roll out Azure Sentinel, our SOC is actively working with the engineering team to ensure key requirements are met, so we can fully retire our legacy SIEM (more details below). Additionally, we regularly invite engineers from our product groups to join us in the SOC to learn how the technology is applied by our experts.

History and evolution to broad and deep tooling

Microsoft’s Corporate IT SOC protects a cross platform environment with a significant population of Windows, Linux, and Macs running a variety of Microsoft and non-Microsoft software. This environment is approximately 95 percent hosted on the cloud today. The tooling used in this SOC has evolved significantly over the years starting from the classic model centered around an on-premises SIEM.

Phase 1—Classic on-premises SIEM-centric model

This is the common model where all event data is fed into an on-premises SIEM where analytics are performed on the data (primarily static queries that were refined over time).

We experienced a set of challenges that we now view as natural limitations of this model. These challenges included:

Overwhelming event volume—High volume and growth (on the scale of 20+ billion events a day currently) exceeded the capacity of the on-premises SIEM to handle it.
Analyst overload and fatigue—The static rulesets generated excessive amounts of false positive alerts that lead to alert fatigue.
Poor investigation workflow—Investigation of events using the SIEM was clunky and required manual queries and manual steps when switching between tools.

Phase 2—Bolster on-premises SIEM weaknesses with cloud analytics and deep tools

We introduced several changes designed to address shortcomings of the classic model.

Three strategic shifts were introduced and included:

1. Cloud based log analytics—To address the SIEM scalability challenges discussed previously, we introduced cloud data lake and machine learning technology to more efficiently store and analyze events. This took pressure off our legacy SIEM and allowed our hunters to embrace the scale of cloud computing to apply advanced techniques like machine learning to reason over the data. We were early adopters of this technology before many current commercial offerings had matured, so we ended up with several “generations” of custom technology that we had to later reconcile and consolidate (into the Log Analytics technology that now powers Azure Sentinel).

Lesson learned: “Good enough” and “supported” is better than “custom.”

Adopt commercial products if they meet at least the “Pareto 80 percent” of your needs because the support of these custom implementations (and later rationalization effort) takes resources and effort away from hunting and other core mission priorities.

2. Specialized high-quality tooling—To address analyst overload and poor workflow challenges, we tested and adopted specialized tooling designed to:

Produce high quality alerts (versus high quantity of detailed data).
Enable analysts to rapidly investigate and remediate compromised assets.

It is hard to overstate the benefits of this incredibly successful integration of technology. These tools had a powerful positive impact on our analyst morale and productivity, driving significant improvements of our SOC’s mean time to acknowledge (MTTA) and remediate (MTTR).

We attribute a significant amount of this success of these tools to the direct real-world input that was used to design them.

SOC—The engineering group spent approximately 18-24 months with our SOC team focused on learning about SOC analyst needs, thought processes, pain points, and more while designing and building the first release of Windows Defender ATP. These teams still stay in touch weekly.
DART team—The engineering group directly integrated analysis and hunting techniques that DART developed to rapidly find and evict advanced adversaries from customers.

Here’s a quick summary of the key tools. We’ll share more details on how we use them in our next blog:

Endpoint—Microsoft Defender ATP is the default starting point for analysts for almost any investigation (regardless of the source of the alert) because of its powerful visibility and investigation capabilities.
Email—Office 365 ATP’s integration with Office 365 Exchange Online helps analysts rapidly find and remove phishing emails from mailboxes. The integration with Microsoft Defender ATP and Azure ATP enables analysts to handle common cases extremely quickly, which lead to growth in our analyst caseload (in a good way ☺).
Identity—Integrating Azure ATP helped complete the triad of the most attacked/utilized resources (Endpoint-Email-Identity) and enabled analysts to smoothly pivot across them (and added some useful detections too).
We also added Microsoft Cloud App Security and Azure Security Center to provide high quality detections and improve investigation experience as well.

Even before adding the Automated investigations technology (originally acquired from Hexadite), we found that Microsoft Defender ATP’s Endpoint Detection and Response (EDR) solution increased SOC’s efficiency to the point where Investigation teams analysts can start doing more proactive hunting part-time (often by sifting through lower priority alerts from Microsoft Defender ATP).

Lesson learned: Enable rapid end-to-end workflow for common Email-Endpoint identity attacks.

Ensure your technology investments optimize the analyst workflow to detect, investigate, and remediate common attacks. The Microsoft Defender ATP and connected tools (Office 365 ATP, Azure ATP) was a game changer in our SOC and enabled us to consistently remediate these attacks within minutes. This is our number one recommendation to SOCs as it helped with:

Commodity attacks—Efficiently dispatch (a high volume of) commodity attacks in the environment.

Targeted attacks—Mitigate impact advanced attacks by severely limiting attack operator time to laterally traverse and explore, hide, set up command/control (C2), etc.

3. Mature case management—To further improve analyst workflow challenges, we transitioned the analyst’s primary queue to our case management service hosted by a commercial SaaS provider. This further reduced our dependency on our legacy SIEM (primarily hosting legacy static analytics that had been refined over time).

Lesson learned: Single queue

Regardless of the size and tooling of your SOC, it’s important to have a single queue and govern quality of it.

This can be implemented as a case management solution, the alert queue in a SIEM, or as simple as the alert list in the Microsoft Threat Protection tool for smaller organizations. Having a single place to go for reactive analysis and ensuring that place produces high quality alerts are key enablers of SOC effectiveness and responsiveness. As a complement to the quality piece, you should also have a proactive hunting activity to ensure that attacker activities are not lost in high noise detection.

Phase 3—Modernize SIEM to cloud native

Our current focus is the transition of the remaining SIEM functions from our legacy capability to Azure Sentinel.

We’re now focused on refining our tool strategy and architecture into a model designed to optimize both breadth (unified view of all events) and depth capabilities. The specialized high-quality tooling (depth tooling) works great for monitoring the “front door” and some hunting but isn’t the only tooling we need.

We’re now in the early stages of operating Microsoft’s Azure Sentinel technology in our SOC to completely replace our legacy on-premises SIEM. This task is a bit simpler for us than most, as we have years of experience using the underlying event log analysis technology that powers Azure Sentinel (Azure Monitor technology, which was previously known as Azure Log Analytics and Operations Management Suite (OMS)).

Our SOC analysts have also been contributing heavily to Azure Sentinel and its community (queries, dashboards, etc.) to share what we have learned about adversaries with our customers.

Learn more details about this SOC and download slides from the CISO Workshop:

Lesson learned: Side-by-side transition state

Based on our experience and conversations with customers, we expect transitioning to cloud analytics like Azure Sentinel will often include a side-by-side configuration with an existing legacy SIEM. This could include a:

Short-term transition state—For organizations that are committed to rapidly retiring a legacy SIEM in favor of Azure Sentinel (often to reduce cost/complexity) and need operational continuity during this short bridge period.

Medium-term coexistence—For organizations with significant investment into an on-premises SIEM and/or a longer-term plan for cloud migration. These organization recognize the power of Data Gravity—placing analytics closer to the cloud data will avoid costs and challenges of transferring logs to/from the cloud.

Managing the SOC investigations across the SIEM platforms can be accomplished with reasonable efficiency using either a case management tool or the Microsoft Graph Security API (synchronizing Alerts between the two SIEM platforms).

Microsoft is continuing to invest in building more detailed guidance and capabilities to document learnings on this process and continue to refine technology to support it.

Learn more

To learn more, read previous posts in the “Lessons learned from the Microsoft SOC” series, including:

Also, see our full CISO series.

Watch the CISO Spotlight Series: The people behind the cloud.

For a visual depiction of our SOC philosophy, download our Minutes Matter poster.

Stayed tuned for the next segment in “Lessons learned from the Microsoft SOC” where we dive into more of the analyst experience of using these tools to rapidly investigate and remediate attacks. In the meantime, bookmark the Security blog to keep up with our expert coverage on security matters. Also, follow us at @MSFTSecurity for the latest news and updates on cybersecurity.

Our strategic approach to technology

Selecting the right technology

Making the unexpected choice

History and evolution to broad and deep tooling

Learn more

You May Also Like

Microsoft 365 Defender demonstrates industry-leading protection in the 2022 MITRE Engenuity ATT&CK® Evaluations

Blue teams helping red teams: A tale of a process crash, PowerShell, and the MITRE ATT&CK evaluation

Stopping C2 communications in human-operated ransomware through network protection