SCOM of the Earth: Replacing Operations Manager with Azure Monitor (Part Two)

In this blog, we continue where we left off in part one, spending a bit more time expanding on the capabilities of Azure Monitor. Specifically, how powerful Log Analytics & KQL can be, saving us huge amounts of time and preventing alert fatigue. If you haven’t already decided whether or not to use SCOM or Azure monitor, head over to the Xello comparison article here.

For now, lets dive in!

Kusto Query Language (KQL) – Not your average query tool.

Easily the biggest change that Microsoft recommends when moving from SCOM to Azure Monitor is to change your alerting mindset. Often organisations get bogged down in resolving meaningless alerts – Azure Monitor enables administrators to query data on the fly, acting on what they know to be bad, rather than what is defined in a SCOM Management Pack. To provide these fast queries, Microsoft developed Kusto Query Language – a big data analytics cloud service optimised for interactive ad-hoc queries over structured, semi-structured, and unstructured data. Getting started is pretty simple and Microsoft have provided cheat-sheets for those of you familiar with SQL or Splunk queries.

What logs do I have?

By default, Azure Monitor will collect and store platform performance data for 30 days. This might be adequate for simple analysis of your virtual machines, but ongoing investigations and detailed monitoring will quickly fall over with this constraint. Enabling extra monitoring is quite simple. Navigate to your work space, select advanced settings, and then data.

From here, you can on board extra performance metrics, event logs and custom logs as required. I’ve already completed this task, electing to on board some Service, Authentication, System & Application events as well as guest level performance counters. While you get platform metrics for performance by default, on-boarding metrics from the guest can be an invaluable tool – Comparing the two can indicate where systems are failing & if you have an underlying platform issue!

Initially, I just want to see what servers I’ve on-boarded so here we run our first KQL Query:

Heartbeat | summarize count() by Computer

A really quick query and an even quicker response! I can instantly see I have two servers connected to my work space, with a count of heartbeats. If I found no heartbeats, something has gone wrong in my on-boarding process and we should investigate the monitoring agent health.

Show me something useful!

While a heartbeat is a good indicator of a machine being online, it doesn’t really show me any useful data. Perhaps I have a CPU performance issue to investigate. How do I query for that?

Perf | where Computer == “svdcprod01.corp.contoso.com” and ObjectName == “Processor” and TimeGenerated > ago(12h) | summarize avg(CounterValue) by bin(TimeGenerated, 1minutes) | render timechart

Looks like a bit, but in reality this query is quite simple. First, I select my Performance data. Next I filter this down. I want data from my domain controller, specifically CPU performance events from the last 12 hours. Once I have my events, I request a 1 minutes summary of the CPU value and push that into a nice time chart! The result?

Using this graph, you can pretty quickly identify two periods when my CPU has spiked beyond a “normal level”. On the left, I spike twice above 40%. On the right, I have a huge spoke to over 90%. Here is where Microsoft’s new monitoring advice really comes into effect – Monitor what you know, when you need it. As this is a lab domain controller, I know it turns on at 8 am every morning. Note there is no data in the graph prior to this time? I also know that I’ve installed AD Connect & the Okta agent – The CPU increases twice an hour as each data sync occurs. With this context, I can quickly pick that the 90% CPU spike is of concern. I haven’t setup an alert for performance yet, and I don’t have to. I can investigate when and if I have an issue & trace this back with data! My next question is – What started this problem?

If you inspect the usage on the graph, you can quickly ascertain that the major spike started around 11:15 – As the historical data indicates this is something new, it’s not a bad assumption that this is something new happening on the server. Because I have configured auditing on my server and elected to ingest these logs, I can run the following query:

SecurityEvent | where EventID == “4688” and TimeGenerated between(datetime(“2019-07-14 1:15:00”) .. datetime(“2019-07-14 1:25:00”))

This quickly returns me out a manageable 75 records. Should I wish, I could probably manually look through this and find my problem. But where is the fun in that? A quick scan reveals that our friend xelloadmin appears to be logged into the server during the specified time frame. Updated Query?

SecurityEvent | where EventID == “4688” and Account contains “xelloadmin” and TimeGenerated between(datetime(“2019-07-14 1:15:00”) .. datetime(“2019-07-14 1:25:00”))

By following a “filter again” approach you can quickly bring large 10,000 row data sets to a manageable number. This is also great for security response, as ingesting a the correct events will allow you to reconstruct exactly what has happened on a server without even logging in!
Thanks to my intelligent filtering, I’m now able to zero in on what appears to be a root cause. It appears that xelloadmin launched two cmd.exe processes less than a second apart, exactly prior to the CPU spike. Time to log in and check!

Sure enough, these look like the culprits! Terminating both process has resulted in the following graph!

Let’s create alerts and dashboards!

I’m sure you’re thinking at this point, that everything I’ve detailed out is after the fact – More importantly, I had to actively look for this data. You’re not wrong to be concerned about this. Again, this is the big change in mindset that Microsoft is pushing with Azure Monitor – Less alerting is better. Your applications are fault tolerant, loosely coupled and scale to meet demand already right?

If you need an alert, make sure it matters first. Thankfully, configuration is extremely simple should you require one!
First, work out your alert criteria- What defines that something has gone wrong? In my case, I would like to know when the CPU has spiked to over a threshold. We can then have a look in the top right of our query window- You should notice a “new alert rule” icon. Clicking this will give you a screen like the following:

The condition is where the magic happens – Microsoft has been gracious enough to provide some pre-canned conditions, and you can write your own KQL should you desire. For the purpose of this blog, we’re going to use a Microsoft rule.

As you can see, this rule is configured to trigger when CPU hits 50% – Our earlier spike thanks to the careless admin would definitely be picked up by this! Once I’m happy with my alert rule, I can configure my actions – Here is where you can integrate to existing tools like ServiceNow, JIRA or send SMS/Email alerts. For my purposes, I’m going to setup email alerts.
Finally, I configure some details about my alert and click save!

Next time my CPU spikes, I will get an email from Microsoft to my specified address and I can begin investigating in almost realtime!

The final, best and easiest way for administrators to get quick insights into their infrastructure is by building a dashboard. This process is extremely simple – Work out your metrics, write your queries and pin the results.

You will be prompted to select your desired dashboard – If you haven’t already created one, you can deploy a new one within your desired resource group! With a properly configured workspace and the right queries, you could easily build a dashboard like the one shown below. For those of you who have Azure Policy in place, please note that custom dashboards deploy to the Central US region by default, and you will need to allow an exception to your policy to create them.

Final Thoughts

If you’ve stuck with me for this entire blog post, thank you! Hopefully by now you’re well aware of the benefits of Azure monitor over System Center Operations Manager. If you missed our other blogs, head on over to Part One or our earlier comparison article! As Always, please feel free to reach out should you have any questions, and stay tuned for my next blog post where I look at replacing System Center Orchestrator with cloud native services!