SCOM of the Earth: Replacing Operations Manager with Azure Monitor (Part Two)

In this blog, we continue where we left off  in part one, spending a bit more time expanding on the capabilities of Azure Monitor. Specifically, how powerful Log Analytics & KQL can be, saving us huge amounts of time and  preventing alert fatigue. If you haven’t already decided whether or not to use SCOM or Azure monitor, head over to the Xello comparison article here.

For now, lets dive in!

Kusto Query Language (KQL) – Not your average query tool.

Easily the biggest change that Microsoft recommends when moving from SCOM to Azure Monitor is to change your alerting mindset. Often organisations get bogged down in resolving meaningless alerts – Azure Monitor enables administrators to query data on the fly, acting on what they know to be bad, rather than what is defined in a SCOM Management Pack. To provide these fast queries, Microsoft developed Kusto Query Language – a big data analytics cloud service optimised for interactive ad-hoc queries over structured, semi-structured, and unstructured data. Getting started is pretty simple and Microsoft have provided cheat-sheets for those of you familiar with SQL or Splunk queries.

What logs do I have?

By default, Azure Monitor will collect and store platform performance data for 30 days. This might be adequate for simple analysis of your virtual machines, but ongoing investigations and detailed monitoring will quickly fall over with this constraint. Enabling extra monitoring is quite simple. Navigate to your work space, select advanced settings, and then data.

From here, you can on board extra performance metrics, event logs and custom logs as required. I’ve already completed this task, electing to on board some Service, Authentication, System & Application events as well as guest level performance counters. While you get platform metrics for performance by default, on-boarding metrics from the guest can be an invaluable tool – Comparing the two can indicate where systems are failing & if you have an underlying platform issue!

Initially, I just want to see what servers I’ve on-boarded so here we run our first KQL Query:

Heartbeat | summarize count() by Computer  

A really quick query and an even quicker response! I can instantly see I have two servers connected to my work space, with a count of heartbeats. If I found no heartbeats, something has gone wrong in my on-boarding process and we should investigate the monitoring agent health.

Show me something useful!

While a heartbeat is a good indicator of a machine being online, it doesn’t really show me any useful data. Perhaps I have a CPU performance issue to investigate. How do I query for that?


Perf | where Computer == “svdcprod01.corp.contoso.com” and ObjectName == “Processor” and TimeGenerated > ago(12h) | summarize avg(CounterValue) by bin(TimeGenerated, 1minutes) | render timechart

Looks like a bit, but in reality this query is quite simple. First, I select my Performance data. Next I filter this down. I want data from my domain controller, specifically CPU performance events from the last 12 hours. Once I have my events, I request a 1 minutes summary of the CPU value and push that into a nice time chart! The result?

perf

Using this graph, you can pretty quickly identify two periods when my CPU has spiked beyond a “normal level”. On the left, I spike twice above 40%. On the right, I have a huge spoke to over 90%. Here is where Microsoft’s new monitoring advice really comes into effect – Monitor what you know, when you need it. As this is a lab domain controller, I know it turns on at 8 am every morning. Note there is no data in the graph prior to this time? I also know that I’ve installed AD Connect & the Okta agent – The CPU increases twice an hour as each data sync occurs. With this context, I can quickly pick that the 90% CPU spike is of concern. I haven’t setup an alert for performance yet, and I don’t have to. I can investigate when and if I have an issue & trace this back with data! My next question is – What started this problem?

If you inspect the usage on the graph, you can quickly ascertain that the major spike started around 11:15 – As the historical data indicates this is something new, it’s not a bad assumption that this is something new happening on the server. Because I have configured auditing on my server and elected to ingest these logs, I can run the following query:


SecurityEvent | where EventID == “4688” and TimeGenerated between(datetime(“2019-07-14 1:15:00”) .. datetime(“2019-07-14 1:25:00”))

This quickly returns me out a manageable 75 records. Should I wish, I could probably manually look through this and find my problem. But where is the fun in that? A quick scan reveals that our friend xelloadmin appears to be logged into the server during the specified time frame. Updated Query?

SecurityEvent | where EventID == “4688” and Account contains “xelloadmin” and TimeGenerated between(datetime(“2019-07-14 1:15:00”) .. datetime(“2019-07-14 1:25:00”))

By following a “filter again” approach you can quickly bring large 10,000 row data sets to a manageable number. This is also great for security response, as ingesting a the correct events will allow you to reconstruct exactly what has happened on a server without even logging in!
Thanks to my intelligent filtering, I’m now able to zero in on what appears to be a root cause. It appears that xelloadmin launched two cmd.exe processes less than a second apart, exactly prior to the CPU spike. Time to log in and check!

Sure enough, these look like the culprits! Terminating both process has resulted in the following graph!

Let’s create alerts and dashboards!

I’m sure you’re thinking at this point, that everything I’ve detailed out is after the fact – More importantly, I had to actively look for this data. You’re not wrong to be concerned about this. Again, this is the big change in mindset that Microsoft is pushing with Azure Monitor – Less alerting is better. Your applications are fault tolerant, loosely coupled and scale to meet demand already right? 

If you need an alert, make sure it matters first. Thankfully, configuration is extremely simple should you require one!
First, work out your alert criteria- What defines that something has gone wrong? In my case, I would like to know when the CPU has spiked to over a threshold. We can then have a look in the top right of our query window- You should notice a “new alert rule” icon. Clicking this will give you a screen like the following: 


The condition is where the magic happens – Microsoft has been gracious enough to provide some pre-canned conditions, and you can write your own KQL should you desire. For the purpose of this blog, we’re going to use a Microsoft rule. 


As you can see, this rule is configured to trigger when CPU hits 50% – Our earlier spike thanks to the careless admin would definitely be picked up by this! Once I’m happy with my alert rule, I can configure my actions – Here is where you can integrate to existing tools like ServiceNow, JIRA or send SMS/Email alerts. For my purposes, I’m going to setup email alerts. 
Finally, I configure some details about my alert and click save!

Next time my CPU spikes, I will get an email from Microsoft to my specified address and I can begin investigating in almost realtime!

The final, best and easiest way for administrators to get quick insights into their infrastructure is by building a dashboard.  This process is extremely simple – Work out your metrics, write your queries and pin the results.

You will be prompted to select your desired dashboard – If you haven’t already created one, you can deploy a new one within your desired resource group! With a properly configured workspace and the right queries, you could easily build a dashboard like the one shown below. For those of you who have Azure Policy in place, please note that custom dashboards deploy to the Central US region by default, and you will need to allow an exception to your policy to create them.

Dashboard

Final Thoughts

If you’ve stuck with me for this entire blog post, thank you! Hopefully by now you’re well aware of the benefits of Azure monitor over System Center Operations Manager. If you missed our other blogs, head on over to Part One or our earlier comparison article! As Always, please feel free to reach out should you have any questions, and stay tuned for my next blog post where I look at replacing System Center Orchestrator with cloud native services!

SCOM of the Earth – Replacing Operations Manager with Azure Monitoring

When first interviewed at Xello, I was asked what I thought of the System Center suite.

Working primarily with Configuration Manager, I had just started managing my day-to-day with Service Manager & Orchestrator. I saw how versatile the platform was, and I loved it. I had also seen how much extra work can be involved when it was deployed and used wrong.

System Center is without a doubt an all powerful product, provided you spend the time to implement it and do it well. Your IT team need to understand the day-to-day operations and have the right people/vendors to pop the hood as required. You need to invest the time; like a lot of Microsoft products, you only get out what you put in.

As I spend more and more time working within Azure Cloud, I find myself less and less in love with System Center. I’ve found I can do nearly anything that System Center does, without the overhead, and easier. In this technical blog series, I’ll be covering how and why you should replace System Center with Azure native services, step-by-step.

Blog One? System Center Operations Manager.

What is System Center Operations Manager (SCOM)?

First cab off the rank for this series is System Center Operations Manager, because if we can’t tell whats going on in my environment, how can we manage it?

For the uninitiated, SCOM is a holistic monitoring solution for your on premise and cloud environment. Operations Manager has deep integration with Windows Services out of the box, and you can extend it using management packs (MP’s) to monitor just about anything. Last time I worked on SCOM, I was able to extend it for monitoring a mainframe from the early 1990’s.

SCOM deployed as a single server or full HA

While you can often get away with a single server deployment, SCOM is often deployed in a fully highly available manner. SCOM can also be deployed in a hierarchy like configuration manager, you can extend it into Azure and it works with multiple AD forests. 

How do we know we’ve succeeded in replacing SCOM?

Before we get into the nitty-gritty, we need to define our success criteria. Personally, I love up-time and cost as a measures of success.. After all, if I can’t consume my service at any time for low cost, have I really done a good job?

I’ll apply the generic Azure Service Level Agreement (SLA) of 99.9% up-time as my key requirement – about 45 minutes of downtime a month.

In order to satisfy this requirement, I estimate I will need a distributed deployment with a minimum of two servers in my management pool.  For demonstration purposes, we’ll assume I’m managing 500 servers in my environment. Using the Microsoft SCOM sizing planner and design advice, I think I’ll need three servers: Two Management servers and one database/reporting server. 

My management servers will be A4v2 ($667.53 ) sized, with my SQL Database server being a D4v3 ($835.91).

Cost for compute? $1503.44 per month. 

For storage, I’ll use the default recommendations for disk size and single disks where possible. 1000GB for my data, with a full year of retention. Three 250GB disks for my operating systems and installations. 

My storage cost? $250.96 per month.

Total cost to beat? 

$1754.39/month -> $21,052.68/per annum. 

Azure Monitor: Cloud native monitoring that’s cheaper than SCOM

Azure Monitor dashboard example

My replacement of choice for SCOM is Azure Monitoring, a cloud native, server-less monitoring solution. Xello has covered SCOM vs Azure Monitor’s key differences previously.

Azure Monitor is integrated into the Microsoft Azure public cloud platform by default and you’ve probably already used it if you’re an Azure administrator. You already see it under every virtual machines overview page, and as options under each services monitoring section.

Under the hood, Azure Monitoring solutions use analytics workspaces to store logs, Kusto Query Language (KQL) to search logs, and solutions to add in pre-built dashboards/queries.

Azure Monitor pricing can be a bit frustrating to predict in advance, but we are going to give it a crack using the calculator.

Our SCOM deployment was monitoring 500 servers, providing integrated alerting, emails, dashboards, reporting and log warehousing. Assuming we deliver the same, and ingest about 0.5GB of data per VM a month, we are going to start with a  $1134.99/month cost. Storage is free for the first 31 days, and remaining 334 days sets us back approximately $96 dollars a month.

We definitely want to monitor the core metrics of our virtual machines – CPU, RAM & disk.  These three alert rules is another $205.96 .

Now for my favourite part: The free stuff. 

  • Dashboards? Free.
  • 1000 ITSM alerts? Free.
  • 1000 emails? Free.
  • Push Notifications? 1000 for free.
  • Web hooks? 100 thousand for free. 

Aside from some metric costs, alerting and monitoring is largely cost free and integrates to a large number of services out of the box with Azure Monitor. Always a bonus when you’re selling a solution to management!

Total cost?


$1436.48/month -> $17,237.40/per annum

In summary, Azure Monitor is definitely cheaper than SCOM in the long-run.

If you’re struggling to understand the differences, Microsoft has an excellent webinar on this process with the top recommendation being to swap out your alerting mindset.

So, what’s the deployment like?

Getting started with Azure Monitor deployment vs SCOM

Getting started with an Azure monitoring deployment is an extremely simple three step process.

It’s also a lot faster than deploying SCOM out of the box, which is a much more tedious process, to say the least.

1. Deploy an OMS Workspace

There aren’t too many options here to be confused about. Simply select a name, location and resource group and you’re on your way. Pricing tiers can no longer be selected; expect an update from Microsoft to remove the option.

Capture


2. On-board your servers

If you’re in Azure, the portal can do all the work for you.

Capture-2
Capture-3

If you are migrating from SCOM, the monitoring agent is the same and can be re configured using your Workspace ID and key.

3. Setup data retention policy

This is currently hidden under “Usage and Estimated Costs” & you can retain up to 730 days within a workspace.

Capture-1

Longer term retention is available, but you can’t query the data on demand.

SCOM vs Azure Monitor: A more cost-effective type of monitoring

Switching to Azure Monitor comes with a change in thinking and a management shift in our approach to monitoring.

Treat your monitoring like a SIEM and tune it to the ninth degree. If you have more than the free 1000 alerts a month, you have tuning to do. Getting your alert levels down ensures that your engineers will only react when it matters, and to the right content.

Work to understand your monitoring needs better.  I’ve made heavy assumptions for our costings today, but there is a whole host of strategies when deploying long term alerting and monitoring.  Answer some basic questions about your business.

  • Are you regulated?
  • Do you have a tiered model for internal systems?
  • Do you require full administrative separation for your log data?
  • Can you collect less data? 

All of these questions help to reduce the cost and maintenance effort further, making your life easier. 

Replacing SCOM with Azure Monitor: Next steps

I hope you enjoyed my comparison of SCOM vs Azure Monitor – and why it’s time to replace System Center Operations Manager with the latest Azure services for lower costs.

Stay tuned for my next blog post, where I’ll work through visualising and analyzing my collected logs in a meaningful manner!

Originally Posted on xello.com.au