Capturing your knowledge/intelligence should be SIMPLE

November 22nd, 2014 by Daniele Muscetta

Lately this blog has been very personal. This post is about stuff I do at work, so if you are not one of my IT readers, don't worry.

For my IT readers, an interruptions from guitars and music on this blog to share some personal reflection on OpInsights and SCOM.

SCOM is very powerful. You know I have always been a huge fan of 2007 and worked myself on the 2012 release. But, compared to its predecessor – MOM – in SCOM it has always been very hard to author management packs – multiple tools, a lot of documentation… here we are, more than 6 years later, and the first 2 comments on an old post on the momteam blog still strike me hard every time I read it:

whatever happened to click,click,done?

You would think that things have changed, but SCOM is fundamentally complex, and even with the advances in tooling (VSAE, MPAuthor, etc) writing MPs is still black magic, if you ask some users.

I already blogged about me exporting and MP and converting its event-based alerting rules to OpInsights searches.

Well, writing those alerting rules in SCOM needs a lot of complex XML – you might not need to know how to write it (but you often have to attempt dechipering it) and even if you create rules with a wizard, it will produce a lot of complex XML for you.

In the screenshot below, the large XML chunk that is needed to pick up a specific eventId from a specific log and a specific source: the key/important information is only a small fraction of it, while the rest is ‘packaging’:

image

I want OpInsights to be SIMPLE.

If there is one thing I want the most for this project, is this.

That's why the same rule can now be expressed with a simple filter search in OpInsights, where all you need is just that key information

EventID=1037 Source="Microsoft-Windows-IIS-W3SVC" EventLog=System

and you essentially don't have to care about any sort of packaging nor mess with XML.

Click, click – filters/facets in the UI let you refine your criteria. And your saved searches too. And they execute right away, there is not even a ‘Done’ button to press. You might just be watching those searches pinned to tiles in your dashboard. All it took was identify the three key pieces of info, no complex XML wrapping needed!

Ok, granted – there ARE legitimate, more complex, scenarios for which you need complex data sources/collectors and specialized/well thought data shaping, not just events – and we use those powerful capabilities of the MMA agent in intelligence packs. But at its core, the simple search language and explor-ability of the data are meant to bring back SIMPLE to the modern monitoring world. Help us prioritize what data sources you need first!

PS – if you have no idea what I was talking about – thanks for making it till here, but don’t worry: either you are not an IT person, which means simply ignore this; or – if you are an IT person – go check out Azure Operational Insights!

Operations Manager 2012 SP1 CTP2 is out, and my TechED NA talk (MGT302)

June 16th, 2012 by Daniele Muscetta

As you might have already heard, this has been an amazing week at TechEd North America: System Center 2012 has been voted as the Best Microsoft Product at TechEd, and we have released the Community Technology Preview (CTP2) of all System Center 2012 SP1 components.

I wrote a (quick) list of the changes in Operations Manager CTP2 in this other blog post and many of those are related to APM (formerly AVIcode technology). I have also demoed some of these changes in my session on thursday – you can watch the recording here. I think one of the most-awaited change is support for monitoring Windows Services written in .NET – but there is more than that!

In the talk I also covered a bit of Java monitoring (which is the same as in 2012, no changes in SP1) and my colleague  Åke Pettersson talked about Synthetic Transactions, and how to bring all together (synthetic and APM) in a single new dashboard (also shipping in SP1 CTP2) that gives you a 360 degrees view of your applications. The CTP2 documentation covers both the changes to APM as well as how to light up this new dashboard.

When it comes to synthetics  – I know you have been using them from your own agents/watcher nodes – but to have a complete picture from the outside in (or last mile), we have now also announced the Beta of Global Service Monitoring (it was even featured in the Keynote!) – where essentially we extend your OpsMgr infrastructure to the cloud, and allow you to upload your tests to our Azure-based service and we will run those tests against your Internet-facing applications from our watcher nodes in various datacenters around the globe and feed back the data to your OpsMgr infrastructure, so that you can see how your application is available and responding from those locations. You can sign up for the consumer preview of GSM from the connect site.

Enjoy your beta testing! (Isn’t that what weekends are for, geeks?)

A couple of OpsMgr / APM Posts

January 24th, 2012 by Daniele Muscetta

Just some shameless personal plug here, pointing out that I recently wrote two technical posts on the momteam blog about the APM feature in Operations Manager 2012 – maybe you want to check them out:

  1. APM object model – describes the object model that gets created by the APM Template/Wizard when you configure .NET application monitoring
  2. Custom APM Rules for Granular Alerting – explains how you can leverage management pack authoring techniques to create alerting rules with super-granular criteria’s (building beyond what the GUI would let you do)

Hope you find them useful – if you are one of my “OpsMgr readers” Smile

I have been chosen; Farewell my friends…

July 7th, 2011 by Daniele Muscetta

I have been in Premier Field Engineering for nearly 7 years (it was not even called PFE when I joined – it was just "another type of support"…) and I have to admit that it has been a fun, fun ride: I worked with awesome people and managed to make a difference with our products and services for many customers – directly working with some of those customers, as well as indirectly thru the OpsMgr Health Check program – the service I led for the last 3+ years, which nowadays gets delivered hundreds of times a year around the globe by my other fellow PFEs.

But it is time to move on: I have decided to go thru a big life change for me and my family, and I won't be working as a Premier Field Engineer anymore as of next week.

But don't panic – I am staying at Microsoft!

I have actually never been closer to Microsoft than now: we are packing and moving to Seattle the coming weekend, and on July 18th I will start working as a Program Manager in the Operations Manager product team, in Redmond. I am hoping this will enable me to make a difference with even more customers.

Exciting times ahead – wish me luck!

 

That said – PFE is hiring! If you are interested in working for Microsoft – we have open positions (including my vacant position in Italy) for almost all the Microsoft technologies. Simply visit http://careers.microsoft.com and search on “PFE”.

As for the OpsMgr Health Check, don't you worry: it will continue being improved – I left it in the hands of some capable colleagues: Bruno Gabrielli, Stefan Stranger and Tim McFadden – and they have a plan and commitment to update it to OpsMgr 2012.

OpsMgr Agents and Gateways Failover Queries

December 23rd, 2010 by Daniele Muscetta

The following article by Jimmy Harper explains very well how to set up agents and gateways’ failover paths thru Powershell http://blogs.technet.com/b/jimmyharper/archive/2010/07/23/powershell-commands-to-configure-gateway-server-agent-failover.aspx . This is the approach I also recommend, and that article is great – I encourage you to check it out if you haven’t done it yet!

Anyhow, when checking for the actual failover paths that have been configured, the use of Powershell suggested by Jimmy is rather slow – especially if your agent count is high. In the Operations Manager Health Check tool I was also using that technique at the beginning, but eventually moved to the use of SQL queries just for performance reasons. Since then, we have been using these SQL queries quite successfully for about 3 years now.

But this the season of giving… and I guess SQL Queries can be a gift, right? Therefore I am now donating them as Christmas Gift to the OpsMrg community Smile

Enjoy – and Merry Christmas!

 

--GetAgentForWhichServerIsPrimary
SELECT SourceBME.DisplayName as Agent,TargetBME.DisplayName as Server
FROM Relationship R WITH (NOLOCK) 
JOIN BaseManagedEntity SourceBME 
ON R.SourceEntityID = SourceBME.BaseManagedEntityID 
JOIN BaseManagedEntity TargetBME 
ON R.TargetEntityID = TargetBME.BaseManagedEntityID 
WHERE R.RelationshipTypeId = dbo.fn_ManagedTypeId_MicrosoftSystemCenterHealthServiceCommunication() 
AND SourceBME.DisplayName not in (select DisplayName 
from dbo.ManagedEntityGenericView WITH (NOLOCK) 
where MonitoringClassId in (select ManagedTypeId 
from dbo.ManagedType WITH (NOLOCK) 
where TypeName = 'Microsoft.SystemCenter.GatewayManagementServer') 
and IsDeleted ='0') 
AND SourceBME.DisplayName not in (select DisplayName from dbo.ManagedEntityGenericView WITH (NOLOCK) 
where MonitoringClassId in (select ManagedTypeId from dbo.ManagedType WITH (NOLOCK) 
where TypeName = 'Microsoft.SystemCenter.ManagementServer') 
and IsDeleted ='0') 
AND R.IsDeleted = '0'


--GetAgentForWhichServerIsFailover
SELECT SourceBME.DisplayName as Agent,TargetBME.DisplayName as Server
FROM Relationship R WITH (NOLOCK) 
JOIN BaseManagedEntity SourceBME 
ON R.SourceEntityID = SourceBME.BaseManagedEntityID 
JOIN BaseManagedEntity TargetBME 
ON R.TargetEntityID = TargetBME.BaseManagedEntityID 
WHERE R.RelationshipTypeId = dbo.fn_ManagedTypeId_MicrosoftSystemCenterHealthServiceSecondaryCommunication() 
AND SourceBME.DisplayName not in (select DisplayName 
from dbo.ManagedEntityGenericView WITH (NOLOCK) 
where MonitoringClassId in (select ManagedTypeId 
from dbo.ManagedType WITH (NOLOCK) 
where TypeName = 'Microsoft.SystemCenter.GatewayManagementServer') 
and IsDeleted ='0') 
AND SourceBME.DisplayName not in (select DisplayName 
from dbo.ManagedEntityGenericView WITH (NOLOCK) 
where MonitoringClassId in (select ManagedTypeId 
from dbo.ManagedType WITH (NOLOCK) 
where TypeName = 'Microsoft.SystemCenter.ManagementServer') 
and IsDeleted ='0') 
AND R.IsDeleted = '0'


--GetGatewayForWhichServerIsPrimary
SELECT SourceBME.DisplayName as Gateway, TargetBME.DisplayName as Server
FROM Relationship R WITH (NOLOCK) 
JOIN BaseManagedEntity SourceBME 
ON R.SourceEntityID = SourceBME.BaseManagedEntityID 
JOIN BaseManagedEntity TargetBME 
ON R.TargetEntityID = TargetBME.BaseManagedEntityID 
WHERE R.RelationshipTypeId = dbo.fn_ManagedTypeId_MicrosoftSystemCenterHealthServiceCommunication() 
AND SourceBME.DisplayName in (select DisplayName 
from dbo.ManagedEntityGenericView WITH (NOLOCK) 
where MonitoringClassId in (select ManagedTypeId 
from dbo.ManagedType WITH (NOLOCK) 
where TypeName = 'Microsoft.SystemCenter.GatewayManagementServer') 
and IsDeleted ='0') 
AND R.IsDeleted = '0'
    

--GetGatewayForWhichServerIsFailover
SELECT SourceBME.DisplayName As Gateway, TargetBME.DisplayName as Server
FROM Relationship R WITH (NOLOCK) 
JOIN BaseManagedEntity SourceBME 
ON R.SourceEntityID = SourceBME.BaseManagedEntityID 
JOIN BaseManagedEntity TargetBME 
ON R.TargetEntityID = TargetBME.BaseManagedEntityID 
WHERE R.RelationshipTypeId = dbo.fn_ManagedTypeId_MicrosoftSystemCenterHealthServiceSecondaryCommunication() 
AND SourceBME.DisplayName in (select DisplayName 
from dbo.ManagedEntityGenericView WITH (NOLOCK) 
where MonitoringClassId in (select ManagedTypeId 
from dbo.ManagedType WITH (NOLOCK) 
where TypeName = 'Microsoft.SystemCenter.GatewayManagementServer') 
and IsDeleted ='0') 
AND R.IsDeleted = '0'


--xplat agents
select bme2.DisplayName as XPlatAgent, bme.DisplayName as Server
from dbo.Relationship r with (nolock) 
join dbo.RelationshipType rt with (nolock) 
on r.RelationshipTypeId = rt.RelationshipTypeId 
join dbo.BasemanagedEntity bme with (nolock) 
on bme.basemanagedentityid = r.SourceEntityId 
join dbo.BasemanagedEntity bme2 with (nolock) 
on r.TargetEntityId = bme2.BaseManagedEntityId 
where rt.RelationshipTypeName = 'Microsoft.SystemCenter.HealthServiceManagesEntity' 
and bme.IsDeleted = 0 
and r.IsDeleted = 0 
and bme2.basemanagedtypeid in (SELECT DerivedTypeId 
FROM DerivedManagedTypes with (nolock) 
WHERE BaseTypeId = (select managedtypeid 
from managedtype where typename = 'Microsoft.Unix.Computer') 
and DerivedIsAbstract = 0)

Got Orphaned OpsMgr Objects?

December 17th, 2010 by Daniele Muscetta

Have you ever wondered what would happen if, in Operations Manager, you’d delete a Management Server or Gateway that managed objects (such as network devices) or has agents pointing uniquely to it as their primary server?

The answer is simple, but not very pleasant: you get ORPHANED objects, which will linger in the database but you won’t be able to “see” or re-assign anymore from the GUI.

So the first thing I want to share is a query to determine IF you have any of those orphaned agents. Or even if you know, since you are not able to "see" them from the console, you might have to dig their name out of the database. Here's a query I got from a colleague in our reactive support team:


-- Check for orphaned health services (e.g. agent).
declare @DiscoverySourceId uniqueidentifier;
SET @DiscoverySourceId = dbo.fn_DiscoverySourceId_User();
SELECT TME.[TypedManagedEntityid], HS.PrincipalName
FROM MTV_HealthService HS
INNER JOIN dbo.[BaseManagedEntity] BHS WITH(nolock)
ON BHS.[BaseManagedEntityId] = HS.[BaseManagedEntityId]
-- get host managed computer instances
INNER JOIN dbo.[TypedManagedEntity] TME WITH(nolock)
ON TME.[BaseManagedEntityId] = BHS.[TopLevelHostEntityId]
AND TME.[IsDeleted] = 0
INNER JOIN dbo.[DerivedManagedTypes] DMT WITH(nolock)
ON DMT.[DerivedTypeId] = TME.[ManagedTypeId]
INNER JOIN dbo.[ManagedType] BT WITH(nolock)
ON DMT.[BaseTypeId] = BT.[ManagedTypeId]
AND BT.[TypeName] = N'Microsoft.Windows.Computer'
-- only with missing primary
LEFT OUTER JOIN dbo.Relationship HSC WITH(nolock)
ON HSC.[SourceEntityId] = HS.[BaseManagedEntityId]
AND HSC.[RelationshipTypeId] = dbo.fn_RelationshipTypeId_HealthServiceCommunication()
AND HSC.[IsDeleted] = 0
INNER JOIN DiscoverySourceToTypedManagedEntity DSTME WITH(nolock)
ON DSTME.[TypedManagedEntityId] = TME.[TypedManagedEntityId]
AND DSTME.[DiscoverySourceId] = @DiscoverySourceId
WHERE HS.[IsAgent] = 1
AND HSC.[RelationshipId] IS NULL;

Once you have identified the agent you need to re-assign to a new management server, this is doable from the SDK. Below is a powershell script I wrote which will re-assign it to the RMS. It has to run from within the OpsMgr Command Shell.
You still need to change the logic which chooses which agent – this is meant as a starting base… you could easily expand it into accepting parameters and/or consuming an input text file, or using a different Management Server than the RMS… you get the point.

  1. $mg = (get-managementgroupconnection).managementgroup
  2. $mrc = Get-RelationshipClass | where {$_.name –like "*Microsoft.SystemCenter.HealthServiceCommunication*"}
  3. $cmro = new-object Microsoft.EnterpriseManagement.Monitoring.CustomMonitoringRelationshipObject($mrc)
  4. $rms = (get-rootmanagementserver).HostedHealthService
  5. $deviceclass = $mg.getmonitoringclass(“HealthService”)
  6. $mc = Get-connector | where {$_.Name –like “*MOM Internal Connector*”}
  7. Foreach ($obj in $mg.GetMonitoringObjects($deviceclass))
  8. {
  9.     #the next line should be changed to pick the right agent to re-assign
  10.     if ($obj.DisplayName -match 'dsxlab')
  11.     {
  12.                 Write-host $obj.displayname
  13.                 $imdd = new-object Microsoft.EnterpriseManagement.ConnectorFramework.IncrementalMonitoringDiscoveryData
  14.                 $cmro.SetSource($obj)
  15.                 $cmro.SetTarget($rms)
  16.                 $imdd.Add($cmro)
  17.                 $imdd.Commit($mc)
  18.     }
  19. }

Similarly, you might get orphaned network devices. The script below is used to re-assign all Network Devices to the RMS. This script is actually something I have had even before the other one (yes, it has been sitting in my "digital drawer" for a couple of years or more…) and uses the same concept – only you might notice that the relation's source and target are "reversed", since the relationships are different:

  • the Management Server (source) "manages" the Network Device (target)
  • the Agent (source) "talks" to the Management Server (target)

With a bit of added logic it should be easy to have it work for specific devices.

  1. $mg = (get-managementgroupconnection).managementgroup
  2. $mrc = Get-RelationshipClass | where {$_.name –like "*Microsoft.SystemCenter.HealthServiceShouldManageEntity*"}
  3. $cmro = new-object Microsoft.EnterpriseManagement.Monitoring.CustomMonitoringRelationshipObject($mrc)
  4. $rms = (get-rootmanagementserver).HostedHealthService
  5. $deviceclass = $mg.getmonitoringclass(“NetworkDevice”)
  6. Foreach ($obj in $mg.GetMonitoringObjects($deviceclass))
  7. {
  8.                 Write-host $obj.displayname
  9.                 $imdd = new-object Microsoft.EnterpriseManagement.ConnectorFramework.IncrementalMonitoringDiscoveryData
  10.                 $cmro.SetSource($rms)
  11.                 $cmro.SetTarget($obj)
  12.                 $imdd.Add($cmro)
  13.                 $mc = Get-connector | where {$_.Name –like “*MOM Internal Connector*”}
  14.                 $imdd.Commit($mc)
  15. }

Disclaimer

The information in this weblog is provided "AS IS" with no warranties, and confers no rights. This weblog does not represent the thoughts, intentions, plans or strategies of my employer. It is solely my own personal opinion. All code samples are provided "AS IS" without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and/or fitness for a particular purpose.

OpsMgr Event IDs Spreadsheet

June 22nd, 2010 by Daniele Muscetta

I work in support (mostly with System Center Operations Manager, as you know), and I work with event logs every day. The following are typical situations:

  1. I get a colleague or a customer telling me “I am having a problem and the SCOM agent is showing 21037 events and 20002 events.  What’s wrong with it?”   
  2. I want to tune an OpsMgr environment and reduce load on the database by turning off a few event collections, as my friend Kevin Holman suggests here http://blogs.technet.com/kevinholman/archive/2009/11/25/tuning-tip-turning-off-some-over-collection-of-events.aspx .
  3. I am analyzing, sorting and grouping Events with Powershell like I have written on my blog lately http://www.muscetta.com/2009/12/16/opsmgr-eventlog-analysis-with-powershell/ but I can’t read those long descriptions properly.
  4. I exported an EVT from a customer environment and I load it on a machine that does not have OpsMgr message DLLs installed – all I see are EventIDs and type (Warning, Error) – but no real description – and I still want to figure out what those events are trying to tell me.

Getting to the point: I, like everyone – don’t have every OpsMgr event memorized.

This is why I thought of building this spreadsheet, and I hope it might come in handy to more people.

The spreadsheet contains an “AllEvents” list – and then the same events are broken down by event source as well:

clip_image002

When you want to search for an events (in one of the situations described above) just open up the spreadsheet, go to the “AllEvents” tab, hit CTRL+F (“Find”) and type in the Event ID you are searching for:

clip_image004

And this will take you to the row containing the event, so you can look up its description:

clip_image006

The description shows the event standard text (which is in the message DLL, therefore is the part you will not see if opening an EVT on another machine that does not have OpsMgr installed), and where the event parameters are (%1, %2, etc – which will be the strings you see in the EVT anyway).

That way you can get an understanding of what the original message would have looked like on the original machine.

This is just one possible usage pattern of this reference. It can also be useful to just read/study the events, learning about new ones you have never encountered, or remembering those you HAVE seen in the past but did not quite remember. And of course you can also find other creative ways to use it.

You can get it from here.

 

A few last words to give due credit: this spreadsheet has been compiled by using Eventlog Explorer (http://blogs.technet.com/momteam/archive/2008/04/02/eventlog-explorer.aspx ) to extract the event information out of the message DLLs on a OpsMgr2007 R2 installation. That info has been then copied and pasted in Excel in order to have an “offline” reference. Also I would like to thank Kevin Holman for pointing me to Eventlog Explorer first, and then for insisting I should not keep this spreadsheet in my drawer, as it could be useful to more people!

Audit Collection Services Database Partitions Size Report

May 5th, 2010 by Daniele Muscetta

A number of people I have talked to liked my previous post on ACS sizing. One thing that was not extremely easy or clear to them in that post was *how* exactly I did one thing I wrote:

[…] use the dtEvent_GUID table to get the number of events for that day, and use the stored procedure “sp_spaceused”  against that same table to get an overall idea of how much space that day is taking in the database […]

To be completely honest, I do not expect people to do this manually a hundred times if they have a hundred partitions. In fact, I have been doing this for a while with a script which will do the looping for me and run that sp_spaceused for me a number of time. I cannot share that script, but I do realize that this automation is very useful, therefore I wrote a “stand-alone” SQL query which, using a couple of temporary tables, produces a similar type of output. I also went a step further and packaged it into a SQL Server Reporting Services Report for everyone’s consumption. The report should look like the following screenshot, featuring a chart and the table with the numerical information about each and every partition in the database:

ACS Partitions Report

You can download the report from here.

You need to upload it to your report server, and change the data source to the shared Data Source that also the built-in ACS Reports use, and it should work.

[NOTE/UPDATE May 4th 2011: This report has a few bugs. I have posted the updated query on http://www.muscetta.com/2011/05/04/improved-acs-partitions-query/ . I am sorry I can't provide a ready made report with the fix right now. Make sure you understand this and don't implement it without testing.]

Enjoy!

A few thoughts on sizing Audit Collection System

March 18th, 2010 by Daniele Muscetta

People were already collecting logs with MOM, so why not the security log? Some people were doing that, but it did not scale enough; for this reason, a few years ago Eric Fitzgerald announced that he was working on Microsoft Audit Collection System. Anyhow, the tool as it was had no interface… and the rest is history: it has been integrated into System Center Operations Manager. Anyhow, ACS remains a lesser-known component of OpsMgr.

There are a number of resources on the web that is worth mentioning and linking to:

and, of course, many more, I cannot link them all.

As for myself, I have been playing with ACS since those early beta days (before I joined Microsoft and before going back to MOM, when I was working in Security), but I never really blogged about this piece.

Since I have been doing quite a lot of work around ACS lately, again, I thought it might be worth consolidating some thoughts about it, hence this post.

Anatomy of an “Online” Sizing Calculation

What I would like to explain here is the strategy and process I go thru when analyzing the data stored in a ACS database, in order to determine a filtering strategy: what to keep and what not to keep, by applying a filter on the ACS Collector.

So, the first thing I usually start with is using one of the many “ACS sizer” Excel spreadsheets around… which usually tell you that you need more space than it really is necessary… basically giving you a “worst case” scenario. I don’t know how some people can actually do this from a purely theoretical point of view, but I usually prefer a bottom up approach: I look at the actual data that the ACS is collecting without filters, and start from there for a better/more accurate sizing.

In the case of a new install this is easy – you just turn ACS on, set the retention to a few days (one or two weeks maximum), give the DB plenty of space to make sure it will make it, add all your forwarders… sit back and wait.

Then you come back 2 weeks later and start looking at the data that has been collected.

What/How much data are we collecting?

First of all, if we have not changed the default settings, the grooming and partitioning algorithm will create new partitioned tables every day. So my first step is to see how big each “partition” is.

But… what is a partition, anyway? A partition is a set of 4 tables joint together:

  1. dtEvent_GUID
  2. dtEventData_GUID
  3. dtPrincipal_GUID
  4. dtSTrings_GUID

where GUID is a new GUID every day, and of course the 4 tables that make up a daily partition will have the same GUID.

The dtPartition table contains a list of all partitions and their GUIDs, together with their start and closing time.

Just to get a rough estimate we can ignore the space used by the last three tables – which are usually very small – and only use the dtEvent_GUID table to get the number of events for that day, and use the stored procedure “sp_spaceused”  against that same table to get an overall idea of how much space that day is taking in the database.

By following this process, I come up with something like the following:

Partition ID Status Partition Start Time Partition Close Time Rows Reserved  KB Total GB
9b45a567_c848_4a32_9c35_39b402ea0ee2 0 2/1/2010 2:00 2/1/2010 2:00 29,749,366 7,663,488 7,484
8d8c8ee1_4c5c_4dea_b6df_82233c52e346 2 1/31/2010 2:00 2/1/2010 2:00 28,067,438 9,076,904 8,864
34ce995b_689b_46ae_b9d3_c644cfb66e01 2 1/30/2010 2:00 1/31/2010 2:00 30,485,110 9,857,896 9,627
bb7ea5d3_f751_473a_a835_1d1d42683039 2 1/29/2010 2:00 1/30/2010 2:00 48,464,952 15,670,792 15,304
ee262692_beae_4d81_8079_470a54567946 2 1/28/2010 2:00 1/29/2010 2:00 48,980,178 15,836,416 15,465
7984b5b8_ddea_4e9c_9e51_0ee7a413b4c9 2 1/27/2010 2:00 1/28/2010 2:00 51,295,777 16,585,408 16,197
d93b9f0e_2ec3_4f61_b5e0_b600bbe173d2 2 1/26/2010 2:00 1/27/2010 2:00 53,385,239 17,262,232 16,858
8ce1b69a_7839_4a05_8785_29fd6bfeda5f 2 1/25/2010 2:00 1/26/2010 2:00 55,997,546 18,105,840 17,681
19aeb336_252d_4099_9a55_81895bfe5860 2 1/24/2010 2:00 1/24/2010 2:00 28,525,304 7,345,120 7,173
1cf70e01_3465_44dc_9d5c_4f3700dc408a 2 1/23/2010 2:00 1/23/2010 2:00 26,046,092 6,673,472 6,517
f5ec207f_158c_47a8_b15f_8aab177a6305 2 1/22/2010 2:00 1/22/2010 2:00 47,818,322 12,302,208 12,014
b48dabe6_a483_4c60_bb4d_93b7d3549b3e 2 1/21/2010 2:00 1/21/2010 2:00 55,060,150 14,155,392 13,824
efe66c10_0cf2_4327_adbf_bebb97551c93 2 1/20/2010 2:00 1/20/2010 2:00 58,322,217 15,029,216 14,677
0231463e_8d50_4a42_a834_baf55e6b4dcd 2 1/19/2010 2:00 1/19/2010 2:00 61,257,393 15,741,248 15,372
510acc08_dc59_482e_a353_bfae1f85e648 2 1/18/2010 2:00 1/18/2010 2:00 64,579,122 16,612,512 16,223

If you have just installed ACS and let it run without filters with your agents for a couple of weeks, you should get some numbers like those above for your “couple of weeks” of analysis. If you graph your numbers in Excel (both size and number of rows/events per day) you should get some similar lines that show a pattern or trend:

Trend: Space user by day

Trend: Number of events by day

So, in my example above, we can clearly observe a “weekly” pattern (monday-to-friday being busier than the weekend) and we can see that – for that environment – the biggest partition is roughly 17GB. If we round this up to 20GB – and also considering the weekends are much quieter – we can forecast 20*7 = 140GB per week. This has an excess “buffer” which will let the system survive event storms, should they happen. We also always recommend having some free space to allow for re-indexing operations.

In fact, especially when collecting everything without filters, the daily size is a lot less predictable: imagine worms “trying out” administrator account’s passwords, and so on… those things can easily create event storms.

Anyway, in the example above, the customer would have liked to keep 6 MONTHS (180days) of data online, which would become 20*180 = 3600GB = THREE TERABYTE and a HALF! Therefore we need a filtering strategy – and badly – to reduce this size.

[edited on May 7th 2010 – if you want to automate the above analysis and produce a table and graphs like those just shown, you should look at my following post.]

Filtering Strategies

Ok, then we need to look at WHAT actually comprises that amount of events we are collecting without filters. As I wrote above, I usually run queries to get this type of information.

I will not get into HOW TO write a filter here – a collector’s filter is a WMI notification query and it is already described pretty well elsewhere how to configure it.

Here, instead, I want to walk thru the process and the queries I use to understand where the noise comes from and what could be filtered – and get an estimate of how much space we could be saving if filter one way or another.

Number of Events per User

–event count by User (with Percentages)
declare @total float
select @total = count(HeaderUser) from AdtServer.dvHeader
select count(HeaderUser),HeaderUser, cast(convert(float,(count(HeaderUser)) / (convert(float,@total)) * 100) as decimal(10,2))
from AdtServer.dvHeader
group by HeaderUser
order by count(HeaderUser) desc

In our example above, over the 14 days we were observing, we obtained percentages like the following ones:

#evt HeaderUser Account Percent
204,904,332 SYSTEM 40.79 %
18,811,139 LOCAL SERVICE 3.74 %
14,883,946 ANONYMOUS LOGON 2.96 %
10,536,317 appintrauser 2.09 %
5,590,434 mossfarmusr

Just by looking at this, it is pretty clear that filtering out events tracked by the accounts “SYSTEM”, “LOCAL SERVICE” and “ANONYMOUS”, we would save over 45% of the disk space!

Number of Events by EventID

Similarly, we can look at how different Event IDs have different weights on the total amount of events tracked in the database:

–event count by ID (with Percentages)
declare @total float
select @total = count(EventId) from AdtServer.dvHeader
select count(EventId),EventId, cast(convert(float,(count(EventId)) / (convert(float,@total)) * 100) as decimal(10,2))
from AdtServer.dvHeader
group by EventId
order by count(EventId) desc

We would get some similar information here:

Event ID Meaning Sum of events Percent
538 A user logged off 99,494,648 27.63
540 Successful Network Logon 97,819,640 27.16
672 Authentication Ticket Request 52,281,129 14.52
680 Account Used for Logon by (Windows 2000) 35,141,235 9.76
576 Specified privileges were added to a user's access token. 26,154,761 7.26
8086 Custom Application ID 18,789,599 5.21
673 Service Ticket Request 10,641,090 2.95
675 Pre-Authentication Failed 7,890,823 2.19
552 Logon attempt using explicit credentials 4,143,741 1.15
539 Logon Failure – Account locked out 2,383,809 0.66
528 Successful Logon 1,764,697 0.49

Also, do not forget that ACS provides some report to do this type of analysis out of the box, even if for my experience they are generally slower – on large datasets – than the queries provided here. Also, a number of reports have been buggy over time, so I just prefer to run queries and be on the safe side.

Below an example of such report (even if run against a different environment – just in case you were wondering why the numbers were not the same ones :-)):Event Counts ACS Default Report

The numbers and percentages we got from the two queries above should already point us in the right direction about what we might want to adjust in either our auditing policy directly on Windows and/or decide if there is something we want to filter out at the collector level (here you should ask yourself the question: “if they aren’t worth collecting are they worth generating?” – but I digress).

Also, a permutation of the above two queries should let you see which user is generating the most “noise” in regards to some events and not other ones… for example:

–event distribution for a specific user (change the @user) – with percentages for the user and compared with the total #events in the DB
declare @user varchar(255)
set @user = 'SYSTEM'
declare @total float
select @total = count(Id) from AdtServer.dvHeader
declare @totalforuser float
select @totalforuser = count(Id) from AdtServer.dvHeader where HeaderUser = @user
select count(Id), EventID, cast(convert(float,(count(Id)) / convert(float,@totalforuser) * 100) as decimal(10,2)) as PercentageForUser, cast(convert(float,(count(Id)) / (convert(float,@total)) * 100) as decimal(10,2)) as PercentageTotal
from AdtServer.dvHeader
where HeaderUser = @user
group by EventID
order by count(Id) desc

The above is particularly important, as we might want to filter out a number of events for the SYSTEM account (i.e. logons that occur when starting and stopping services) but we might want to keep other events that are tracked by the SYSTEM account too, such as an administrator having wiped the Security Log clean – which might be something you want to keep:

Event ID 517 Audit Log was cleared

of course the amount of EventIDs 517 over the total of events tracked by the SYSTEM account will not be as many, and we can still filter the other ones out.

Number of Events by EventID and by User

We could also combine the two approaches above – by EventID and by User:

select count(Id),HeaderUser, EventId

from AdtServer.dvHeader

group by HeaderUser, EventId

order by count(Id) desc

This will produce a table like the following one

SQL Query: Events by EventID and by User

which can be easily copied/pasted into Excel in order to produce a pivot Table:

Pivot Table

Cluster EventLog Replication

One more aspect that is less widely known, but I think is worth showing, is the way that clusters behave when in ACS. I don’t mean all clusters… but if you keep the “eventlog replication” feature of clusters enabled (you should disable it also from a monitoring perspective, but I digress), each cluster node’s security eventlog will have events not just for itself, but for all other nodes as well.

Albeit I have not found a reliable way to filter out – other than disabling eventlog replication altogether.

Anyway, just to get an idea of how much this type of “duplicate” events weights on the total, I use the following query, that tells you how many events for each machine are tracked by another machine:

–to spot machines that are cluster nodes with eventlog repliation and write duplicate events (slow)

select Count(Id) as Total,replace(right(AgentMachine, (len(AgentMachine) – patindex('%\%',AgentMachine))),'$',") as ForwarderMachine, EventMachine

from AdtServer.dvHeader

–where ForwarderMachine <> EventMachine

group by EventMachine,replace(right(AgentMachine, (len(AgentMachine) – patindex('%\%',AgentMachine))),'$',")

order by ForwarderMachine,EventMachine

Cluster Events

Those presented above are just some of the approaches I usually look into at first. Of course there are a number more. Here I am including the same queries already shown in action, plus a few more that can be useful in this process.

I have even considered building a page with all these queries – a bit like those that Kevin is collecting for OpsMgr (we actually wrote some of them together when building the OpsMgr Health Check)… shall I move the below queries on such a page? I though I’d list them here and give some background on how I normally use them, to start off with.

Some more Useful Queries

–top event ids
select count(EventId), EventId
from AdtServer.dvHeader
group by EventId
order by count(EventId) desc

–event count by ID (with Percentages)
declare @total float
select @total = count(EventId) from AdtServer.dvHeader
select count(EventId),EventId, cast(convert(float,(count(EventId)) / (convert(float,@total)) * 100) as decimal(10,2))
from AdtServer.dvHeader
group by EventId
order by count(EventId) desc

–which machines have ever written event 538
select distinct EventMachine, count(EventId) as total
from AdtServer.dvHeader
where EventID = 538
group by EventMachine

–machines
select * from dtMachine

–machines (more readable)
select replace(right(Description, (len(Description) – patindex('%\%',Description))),'$',")
from dtMachine

–events by machine
select count(EventMachine), EventMachine
from AdtServer.dvHeader
group by EventMachine

–rows where EventMachine field not available (typically events written by ACS itself for chekpointing)
select *
from AdtServer.dvHeader
where EventMachine = 'n/a'

–event count by day
select convert(varchar(20), CreationTime, 102) as Date, count(EventMachine) as total
from AdtServer.dvHeader
group by convert(varchar(20), CreationTime, 102)
order by convert(varchar(20), CreationTime, 102)

–event count by day and by machine
select convert(varchar(20), CreationTime, 102) as Date, EventMachine, count(EventMachine) as total
from AdtServer.dvHeader
group by EventMachine, convert(varchar(20), CreationTime, 102)
order by convert(varchar(20), CreationTime, 102)

–event count by machine and by date (distinuishes between AgentMachine and EventMachine
select convert(varchar(10),CreationTime,102),Count(Id),EventMachine,AgentMachine
from AdtServer.dvHeader
group by convert(varchar(10),CreationTime,102),EventMachine,AgentMachine
order by convert(varchar(10),CreationTime,102) desc ,EventMachine

–event count by User
select count(Id),HeaderUser
from AdtServer.dvHeader
group by HeaderUser
order by count(Id) desc

–event count by User (with Percentages)
declare @total float
select @total = count(HeaderUser) from AdtServer.dvHeader
select count(HeaderUser),HeaderUser, cast(convert(float,(count(HeaderUser)) / (convert(float,@total)) * 100) as decimal(10,2))
from AdtServer.dvHeader
group by HeaderUser
order by count(HeaderUser) desc

–event distribution for a specific user (change the @user) – with percentages for the user and compared with the total #events in the DB
declare @user varchar(255)
set @user = 'SYSTEM'
declare @total float
select @total = count(Id) from AdtServer.dvHeader
declare @totalforuser float
select @totalforuser = count(Id) from AdtServer.dvHeader where HeaderUser = @user
select count(Id), EventID, cast(convert(float,(count(Id)) / convert(float,@totalforuser) * 100) as decimal(10,2)) as PercentageForUser, cast(convert(float,(count(Id)) / (convert(float,@total)) * 100) as decimal(10,2)) as PercentageTotal
from AdtServer.dvHeader
where HeaderUser = @user
group by EventID
order by count(Id) desc

–to spot machines that write duplicate events (such as cluster nodes with eventlog replication enabled)
select Count(Id),EventMachine,AgentMachine
from AdtServer.dvHeader
group by EventMachine,AgentMachine
order by EventMachine

–to spot machines that are cluster nodes with eventlog repliation and write duplicate events (better but slower)
select Count(Id) as Total,replace(right(AgentMachine, (len(AgentMachine) – patindex('%\%',AgentMachine))),'$',") as ForwarderMachine, EventMachine
from AdtServer.dvHeader
–where ForwarderMachine <> EventMachine
group by EventMachine,replace(right(AgentMachine, (len(AgentMachine) – patindex('%\%',AgentMachine))),'$',")
order by ForwarderMachine,EventMachine

–which user and from which machine is target of elevation (network service doing "runas" is a 552 event)
select count(Id),EventMachine, TargetUser
from AdtServer.dvHeader
where HeaderUser = 'NETWORK SERVICE'
and EventID = 552
group by EventMachine, TargetUser
order by count(Id) desc

–by hour, minute and user
–(change the timestamp)… this query is useful to search which users are active in a given time period…
–helpful to spot "peaks" of activities such as password brute force attacks, or other activities limited in time.
select datepart(hour,CreationTime) as Hours, datepart(minute,CreationTime) as Minutes, HeaderUser, count(Id) as total
from AdtServer.dvHeader
where CreationTime < '2010-02-22T16:00:00.000'
and CreationTime > '2010-02-22T15:00:00.000'
group by datepart(hour,CreationTime), datepart(minute,CreationTime),HeaderUser
order by datepart(hour,CreationTime), datepart(minute,CreationTime),HeaderUser

SCX Evolutions

July 19th, 2009 by Daniele Muscetta

During the beta of the Cross-Platform extensions and of System Center Operations Manager 2007 R2, the product team had promised to eventually release the SCX Providers'source code.

Now that this promise has been mantained, and the SCX providers have been released on Codeplex at http://xplatproviders.codeplex.com/ it should be finally possible to entirely build your own unsupported agent package, starting from source code, without having to modify the original package as I have shown earlier on this blog.
Of course this will still be unsupported by Microsoft Product support, but will eventually work just fine!
This is an extraordinary event in my opinion, as it is not a common event that Microsoft releases code as open source, especially when this is part of one of the product it sells. I suspect we will see more of this as we going forward.

Also, at R2 release time, some official documentation about buildilng Cross-Plaform Management Packs has been published on Technet.

Anyway, I have in the past posted a number of posts on my blog under this tag http://www.muscetta.com/tag/xplat/ (I will continue to use that tag going forward) which show/describe how I hacked/modified both the existing MPs AND the SCX agent package to let it run on unsupported distributions (and I think they are still useful as they show a number of techniques about how to test, understand and troubleshoot the Xplat agent a bit. In fact, I have first learned how to understand and modify the RedHat MPs to monitor CentOS and eventually even modified the RPM package to run on Ubuntu (which also works on Debian 5/Lenny), eventually, as you can see because I am now using it to monitor – from home, across the Internet – the machine running this blog:

www.muscetta.com Performance in OpsMgr

Or even, with or without OpsMgr 2007 R2, you could write your own scripts to interact with those providers, by using your favourite Scripting Language.

After all, those experimentations with Xplat got me a fame of being a "Unix expert at Microsoft" (this expression still makes me laugh), as I was tweeting here:
Unix expert at Microsoft

But really, I have never hidden my interest for interoperability and the fact that I have been using Linux quite a bit in the past, and still do.

Also, one more related information is that the fine people at Xandros have released their Bridgeways Management Packs and at the same time also started their own blog at http://blog.xplatxperts.com/ where they discuss some troubleshooting techniques for the Xplat agent, both similar to what I have been writing about here and also – of course – specific to their own providers, that are in their XSM namespace.

Disclaimer

The information in this weblog is provided "AS IS" with no warranties, and confers no rights. This weblog does not represent the thoughts, intentions, plans or strategies of my employer. It is solely my own personal opinion. All code samples are provided "AS IS" without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and/or fitness for a particular purpose.
THIS WORK IS NOT ENDORSED AND NOT EVEN CHECKED, AUTHORIZED, SCRUTINIZED NOR APPROVED BY MY EMPLOYER, AND IT ONLY REPRESENT SOMETHING WHICH I'VE DONE IN MY FREE TIME. NO GUARANTEE WHATSOEVER IS GIVEN ON THIS. THE AUTHOR SHALL NOT BE MADE RESPONSIBLE FOR ANY DAMAGE YOU MIGHT INCUR WHEN USING THIS INFORMATION. The solution presented here IS NOT SUPPORTED by Microsoft.

Installing the OpsMgr 2007 R2 SCX Agent on Ubuntu

May 30th, 2009 by Daniele Muscetta

You know since the beta1 of Xplat I have been busy with modifying the Redhat management pack and monitor CentOS with OpsMgr. Now, CentOS is a distribution that is pretty similar to RedHat, so the RPM package just runs, and it is only a matter of hacking a modified MP.

I never went really further in my experiments, mostly due to lack of time… but then yesterday I got a comment to this older post asking about Ubuntu. Of course I know about Ubuntu, and have been using Debian-based distributions for years. I actually even prefer them over RPM-based distributions such as RedHat or SuSE (personal preference). Heck, even this weblog is running on Debian!

Anyway, I never really tried to see if one of the existing RPM packages for RedHat or SuSE could be modified to run on Ubuntu. I will eventually test this on Debian too, but for now I used Ubuntu which tends to have slightly newer packages and libraries, overall. The machine I tested on is a Ubuntu Server 8.04.2. Older/newer versions might slightly differ.

BEWARE THAT ALL THAT FOLLOWS BELOW IS NOT SUPPORTED BY MICROSOFT. It is only described here for EXPERIMENTAL (==fun) purpose. DO NOT USE THIS IN A PRODUCTION ENVIRONMENT.

So, you are warned. Now let’s hack it.

The first thing to do is to copy the Redhat agent’s RPM package off your OpsMgr2007 R2 server in the “usual” path “C:Program FilesSystem Center Operations manager 2007AgentManagementUnixAgents”. Let’s grab the RHEL5 agent, which is called scx-1.0.4-248.rhel.5.x86.rpm in R2 RTM.

First we need to CONVERT the RPM package to the DEB package format used by Ubuntu, by using the ALIEN package:

sudo apt-get update
sudo apt-get install alien
sudo bash
alien -k scx-1.0.4-248.rhel.5.x86.rpm –scripts
dpkg -i scx_1.0.4-248_i386.deb

image

The converted package will install… but the script execution will fail in a few places – most notably in the generation of the certificate, as it is not able to locate the right openssl libraries, as shown in the screenshot above.

If the libssl.so.6 file cannot be found, you might be missing the “libssl-dev” package, which you can install as follows:

apt-get install libssl-dev

But even if it is installed, you will find that the files are still missing. This is not really true: actually, the files are there, but on Ubuntu they have a different name than on RedHat, that’s all. You can therefore create hardlinks to the “right” files, so that they are aliased and get found afterwards:

cd /usr/lib
ln -s libcrypto.so.0.9.8 libcrypto.so.6
ln -s libssl.so.0.9.8 libssl.so.6

So now when installing the package, the certificate generation will work:

image

You are nearly ready to go. You have to start the service by using the init scripts – the “service” command is RedHat-specific, that will still fail.

/etc/init.d/scx-cimd start is the “standard” way of starting daemons from init on Unix.

But it still fails, as it seems that the init script provided in the RedHat package is really searching for a file called “functions” which is present on RedHat and on CentOS, which provides re-usable functions for startup scripts to include:

image

How do you fix this? I just copied the /etc/init.d/functions file from a CentOS box to my Ubuntu box.

I copied it via SCP from the CentOS box I have:

cd /etc/init.d

scp root@centos.huis.dom:/etc/init.d/functions .

You can probably also find and fetch the file from the Internet (both CentOS and RedHat should have accessible repositories with all the files in their distributions, since it is open sourced).

After you have the file in place, the init script will be able to include it, will find the functions it needs, and the daemon/service will now start (even if with minor errors I have not investigated for now, but that don’t seem to be causing troubles):

image

and here you can see it is finally running:

image

So let’s try to issue a few queries as shown in a previous posts:

image

IT WORKS!!!

But… there is a “but”: not all classes actually return instances and values just yet. Most notably the “SCX_OperatingSystem” class does not seem to return anything right awy. That is a very important class, because is the one we would use to first discover the Operating System object in the Management Packs. So we need to fix it. The reason why the class does not return anything, is that the SCX provider is looking into the /etc/redhat-release file to return what OS version/distribution the machine is running. And the file is obviously not there on Ubuntu.

On all Linuxes there is a similar file, called /etc/issue… which again, we can copy with the other name and trick the provider into working:

cd /etc

cp issue redhat-release

And NOW, the SCX_OperatingSystem Class also returns an instance:

image

The next step would be “cooking” an MP to discover Ubuntu. More on this on a later post (maybe). I did not test all classes and their implementation… you can try to poke at them by following the instructions and commands on my previous post here. But this should get you started.

Disclaimer

The information in this weblog is provided "AS IS" with no warranties, and confers no rights. This weblog does not represent the thoughts, intentions, plans or strategies of my employer. It is solely my own personal opinion. All code samples are provided "AS IS" without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and/or fitness for a particular purpose.
THIS WORK IS NOT ENDORSED AND NOT EVEN CHECKED, AUTHORIZED, SCRUTINIZED NOR APPROVED BY MY EMPLOYER, AND IT ONLY REPRESENT SOMETHING WHICH I'VE DONE IN MY FREE TIME. NO GUARANTEE WHATSOEVER IS GIVEN ON THIS. THE AUTHOR SHALL NOT BE MADE RESPONSIBLE FOR ANY DAMAGE YOU MIGHT INCUR WHEN USING THIS INFORMATION. The solution presented here IS NOT SUPPORTED by Microsoft.

Cross Platform in OpsMgr 2007 R2 Release Candidate

March 27th, 2009 by Daniele Muscetta

You have heard it all over the place, System Center Operations Manager 2007 R2 has reached the Release Candidate milestone and the RC bits have been made available on connect.microsoft.com.

As it is becoming a tradition for me with each new release, I want to take a look at the Unix Monitoring stuff like I did since beta1 of Xplat, passing thru beta2. I am an integration freak and I have always insisted that interoperability is key. I will leave the most obvious “release notes” kind of things out of here, such as saying that there are now agents for the x64 version of linux distro’s, and so on…. you can read this stuff in the release notes already and in a zillion of other places.

Let’s instead look at my first impression ( = I am amazed: this product is really getting awesome) and let’s do a bit of digging, mostly to note what changed since my previous posts on Xplat (which, by the way, is the MOST visited post on this blog I ever published) – of course there is A LOT more that has changed under the hood… but those are code changes, improvements, polishing of the product itself… while that would be interesting from a code perspective, here I am more interested in what the final user (the System Administrator) will ultimately interact with directly, and what he might need to troubleshoot and understand how the pieces fit together to realize Unix Monitoring in OpsMgr.

After having hacked the RedHat MP to work on my CentOS box (as usual), I started to take a look at what is installed on the Linux box. Here are the new services:

ps -Af | grep scx

You will notice the daemons have changed names and get launched with new parameters.

Of course when you see who uses port 1270 everything becomes clearer:

netstat -anp | grep 1270

Therefore I can place the two new names and understand that SCXCIMSERVER is the WSMAN implementation, while SCXCIMPROVAGT is the CIM/WBEM implementation.

There is one more difference at the “service” (or “daemon”) level: the fact that there is only ONE init script now: /etc/init.d/scx-cimd

/etc/init.d/scx-cimd

So basically the SCX “Agent” will start and stop as a single thing, even if it is composed of multiple executables that will spawn various processes.

Another difference: if we look in “familiar” locations like /etc/opt/microsoft/scx/bin/tools/ we see that a number of configuration files is either empty (0 bytes) or missing (like the one described on Ander’s blog to enable verbose logging of WSMan requests), when compared to earlier versions:

/etc/opt/microsoft/scx/conf

But that is because I have been told we now have a nice new tool called scxadmin under /opt/microsoft/scx/bin/tools/ , which will let you configure those things:

/opt/microsoft/scx/bin/tools/scxadmin

Therefore you would enable VERBOSE logging for all components by issuing the command

./scxadmin -log-set all verbose

and you will bring it back to a less noisy setting of logging only errors with

./scxadmin -log-set all errors

the logs will be written under /var/opt/microsoft/scx/log just like they did before.

Other than this, a lot of the troubleshooting techniques I showed in one of my previous posts, like how to query CIM classes directly or thru WSMAN remotely by using winrm – they should really stay the same. I will mention them again here for reference.

SCXCIMCLI is a useful and simple tool used to query CIM directly. You can roughly compare it to wbemtest.exe in the WIndows world (other than not having a UI). This utility can also be found in /opt/microsoft/scx/bin/tools

A couple of examples of the most common/useful things you would do with scxcimcli:

1) Enumerate all Classes whose name contains “SCX_” in the root/scx namespace (the classes our Management packs use):

./scxcimcli nc -n root/scx -di |grep SCX_ | sort

./scxcimcli nc -n root/scx -di |grep SCX | sort

2) Execute a Query

./scxcimcli xq "select * from SCX_OperatingSystem" -n root/scx

./scxcimcli xq "select * from SCX_OperatingSystem" -n root/scx

Also another thing that you might want to test when troubleshooting discoveries, is running the same queries through WS-Man (possibly from the same Management Server that will or should be managing that unix box). I already showed this in the past, it is the following command:

winrm enumerate http://schemas.microsoft.com/wbem/wscim/1/cim-schema/2/SCX_OperatingSystem?__cimnamespace=root/scx -username:root -password:password -r:https://linuxbox.mydomain.com:1270/wsman -auth:basic –skipCACheck

but if you launch it that way it will now return an error like the following (or at least it did in my test lab):

Fault
Code
Value = SOAP-ENV:Sender
Subcode
Value = wsman:EncodingLimit
Reason
Text = UTF-16 is not supported; Please use UTF-8
Detail
FaultDetail = http://schemas.dmtf.org/wbem/wsman/1/wsman/faultDetail/CharacterSet

Error number:  -2144108468 0x8033804C
The WS-Management service does not support the character set used in the request
. Change the request to use UTF-8 or UTF-16.

the error message is pretty self explanatory: you need to specify the UTF-8 Character set. You can do it by adding the “-encoding” qualifier:

winrm enumerate http://schemas.microsoft.com/wbem/wscim/1/cim-schema/2/SCX_OperatingSystem?__cimnamespace=root/scx -username:root -password:password -r:https://linuxbox.mydomain.com:1270/wsman -auth:basic –skipCACheck –encoding:UTF-8

Hope the above is useful to figure out the differences between the earlier beta releases of the System Center CrossPlatform extensions and the version built in OpsMgr 2007 R2 Release Candidate.

There are obviously a million of other things in R2 worth writing about (either related to the Unix monitoring or to everything else) and I am sure posts will start to appear on the many, more active, blogs out there (they have already started appearing, actually). I have not had time to dig further, but will likely do so AFTER Easter – as the next couple of weeks I will be travelling, working some of the time (but without my test environment and good connectivity) AND visiting relatives the rest of the time.

One last thing I noticed about the Unix/Cross Platform Management Packs in R2 Release Candidate… their current “release date” exposed by the MP Catalog Web Service is the 20th of March

image

…which happens to be my Birthday – therefore they must be a present for me! :-)

Disclaimer

The information in this weblog is provided "AS IS" with no warranties, and confers no rights. This weblog does not represent the thoughts, intentions, plans or strategies of my employer. It is solely my own personal opinion. All code samples are provided "AS IS" without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and/or fitness for a particular purpose.
THIS WORK IS NOT ENDORSED AND NOT EVEN CHECKED, AUTHORIZED, SCRUTINIZED NOR APPROVED BY MY EMPLOYER, AND IT ONLY REPRESENT SOMETHING WHICH I'VE DONE IN MY FREE TIME. NO GUARANTEE WHATSOEVER IS GIVEN ON THIS. THE AUTHOR SHALL NOT BE MADE RESPONSIBLE FOR ANY DAMAGE YOU MIGHT INCUR WHEN USING THIS INFORMATION. The solution presented here IS NOT SUPPORTED by Microsoft.

Programmatically Check for Management Pack updates in OpsMgr 2007 R2

November 29th, 2008 by Daniele Muscetta

One of the cool new features of System Center Operations Manager 2007 R2 is the possibility to check and update Management Packs from the catalog on the Internet directly from the Operators Console:

Select Management Packs from Catalog

Even if the backend for this feature is not yet documented, I was extremely curious to see how this had actually been implemented. Especially since it took a while to have this feature available for OpsMgr, I had the suspicion that it could not be as simple as one downloadable XML file, like the old MOM2005's MPNotifier had been using in the past.

Therefore I observed the console's traffic through the lens of my proxy, and got my answer:

ISA Server Log

So that was it: a .Net Web Service.

I tried to ask the web service itself for discovery information, but failed:

WSDL

Since there is no WSDL available, but I badly wanted to interact with it, I had to figure out: what kind of requests would be allowed to it, how should they be written, what methods could they call and what parameters should I pass in the call. In order to get started on this, I thought I could just observe its network traffic. And so I did… I fired up Network Monitor and captured the traffic:

Microsoft Network Monitor 3.2

Microsoft Network Monitor is beautiful and useful for this kind of stuff, as it lets you easily identify which application a given stream of traffic belongs to, just like in the picture above. After I had isolated just the traffic from the Operations Console, I then saved those captures packets in CAP format and opened it again in Wireshark for a different kind of analysis – "Follow TCP Stream":

Wireshark: Follow TCP Stream

This showed me the reassembled conversation, and what kind of request was actually done to the Web Service. That was the information I needed.

Ready to rock at this point, I came up with this Powershell script (to be run in OpsMgr Command Shell) that will:

1) connect to the web service and retrieve the complete MP list for R2 (this part is also useful on its own, as it shows how to interact with a SOAP web service in Powershell, invoking a method of the web service by issuing a specially crafted POST request. To give due credit, for this part I first looked at this PERL code, which I then adapted and ported to Powershell);

2) loop through the results of the "Get-ManagementPack" opsmgr cmdlet and compare each MP found in the Management Group with those pulled from the catalog;

3) display a table of all imported MPs with both the version imported in your Management Group AND the version available on the catalog:

Script output in OpsMgr Command Shell

Remember that this is just SAMPLE code, it is not meant to be used in production environment and it is worth mentioning again that OpsMgr2007 R2 this is BETA software at the time of writing, therefore this functionality (and its implementation) might change at any time, and the script will break. Also, at present, the MP Catalog web service still returns slightly older MP versions and it is not yet kept in sync and updated with MP Releases, but it will be ready and with complete/updated content by the time R2 gets released.

Disclaimer

The information in this weblog is provided "AS IS" with no warranties, and confers no rights. This weblog does not represent the thoughts, intentions, plans or strategies of my employer. It is solely my own personal opinion. All code samples are provided "AS IS" without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and/or fitness for a particular purpose.
THIS WORK IS NOT ENDORSED AND NOT EVEN CHECKED, AUTHORIZED, SCRUTINIZED NOR APPROVED BY MY EMPLOYER, AND IT ONLY REPRESENT SOMETHING WHICH I'VE DONE IN MY FREE TIME. NO GUARANTEE WHATSOEVER IS GIVEN ON THIS. THE AUTHOR SHALL NOT BE MADE RESPONSIBLE FOR ANY DAMAGE YOU MIGHT INCUR WHEN USING THIS INFORMATION. The solution presented here IS NOT SUPPORTED by Microsoft.

CentOS discovery in OpsMgr2007 R2 beta

November 23rd, 2008 by Daniele Muscetta

Here we go again. Now that the OpsMgr2007 R2 beta is out, with an improved and revamped version of the System Center Cross Platform Extensions, I faced the issue of how to upgrade my test lab.

I have to say that OpsMgr2007 R2 beta release notes explain the known issues, and I had no trouble whatsoever upgrading the windows part. It just took its time (I am running virtual machines in my test lab, that don't have the best performance), but it went smoothly and without a glitch. In a couple of hours I had everything upgraded: databases, RMS, reporting, agents, gateway. All right then. The new purple icons in System Center look cute, and the new UI has some great stuff, such as a long-awaited way to update your management packs directly from the Internet, better display of Overrides (kind of what we used to rely on Override Explorer for)… and  A LOT more new stuff that I won't be wasting my Sunday writing about since everybody else has already done it two days ago:

opsmgr aggregated feed on Twitter

Therefore let's get back to my upgrade, which is a lot more interesting (to me) than the marketing tam-tam :-)

As part of the upgrade to R2, I had to first uninstall the Xplat beta refresh bits, which I had installed, including all Unix Management Packs. Including my CentOS Management Pack I had improvised.

So this is the new start page of the integrated Discovery Wizard:

Discovery Wizard

Looks nice and integrates the functionality of discovering and deploying Windows machines, SNMP Devices, and Unix/Linux machines.

Of course, my CentOS machine would not be discovered, and showed up as an unsupported platform. Of course my old Management Pack I had hacked together in XPlat Beta 1 did not work anymore. Therefore, I figured out I had to see what changes were there, and how to make it work again (of course it IS possible – It is NOT SUPPORTED, but I don't care, as long as it works).

Since the existing agent could not be discovered, the first step I took was logging on the Linux box, un-install the old agent, and install the new one:

XPlat Agent RPM Install on CentOS

There I tried to discover again, but of course it still failed.

At that point I started taking a look at the new layout of things on the unix side. Most stuff is located in the same directories where beta1 was installed, and there are a bunch of useful commands under /opt/microsoft/scx/bin/tools.
You can check out the Open Pegasus version used:

[root@centos tools]# ./scxcimconfig –version
Version 2.7.0

Let's take a look at what SCX classes we have available:

./scxcimcli nc -n root/scx -di |grep SCX | sort

./scxcimcli nc -n root/scx -di |grep SCX | sort

Nice. That's the stuff we will be querying over WS-Man from the Management Server.

So let's look at the OS Discovery, and we test it from the OpsMgr 2007 box:

winrm enumerate http://schemas.microsoft.com/wbem/wscim/1/cim-schema/2/SCX_OperatingSystem?__cimnamespace=root/scx -username:root -password:password -r:https://centos:1270/wsman -auth:basic -skipCACheck

it returns results:

OS WS-Man Query

At first I assumed this worked like in Beta1, therefore I exported RedHat management pack and I made my own version of it, replacing the strings it is expecting to find to discover CentOS instead than Redhat.

While the MP was syntactically correct and would import fine, the Discovery wizard still didn't work.

I took one more look at the discoveries in the MP, and I found there are two more, targeted to Management Server, which is probably what gets used by the Discovery Wizard to understand what kind of agent kit needs to be deployed.

MP XML - Discoveries

So basically this discovery checks for the returned value from the module to determine if the discovered platform is a supported one:

Discovery Settings

But how does the module get its data?

Look at the layout of the /AgentManagement/UnixAgents folder on the Management Server:

/AgentManagement/unixAgents

That's it: GetOSVersion.sh – a shell script. A nice, open, clear text, hackable shell script. Let's take a look at it:

Discovery Script Hack

So that's it, and how my modification looks like. What happens during the discovery wizard is that we probably copy the script over SCP to the box, execute it, look at a number of things, and return the discovery data we need.

If you do those steps manually, you see how the script returns something very similar to a PropertyBag, just like discoveries done by VBScript on Windows machines:

Discovery Script Output

So after modifying the script… here we go. The Wizard now thinks CentOS is Red Hat, and can install an agent on it:

Discovery Wizard

Deploying Agent

Only when the Management Server discovery finally considers the CentOS machine worth managing, then the other discoveries that use WS-Man queries start kicking in, like the old one did, and find the OS objects and all the other hosted objects. In order for this to work you don't only need to hack the shell script, but to have a hacked MP – the "regular" Red Har one won't find CentOS, which is and remains an UNSUPPORTED platform.

CentOS Health Model

Disclaimer

The information in this weblog is provided "AS IS" with no warranties, and confers no rights. This weblog does not represent the thoughts, intentions, plans or strategies of my employer. It is solely my own personal opinion. All code samples are provided "AS IS" without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and/or fitness for a particular purpose.
THIS WORK IS NOT ENDORSED AND NOT EVEN CHECKED, AUTHORIZED, SCRUTINIZED NOR APPROVED BY MY EMPLOYER, AND IT ONLY REPRESENT SOMETHING WHICH I'VE DONE IN MY FREE TIME. NO GUARANTEE WHATSOEVER IS GIVEN ON THIS. THE AUTHOR SHALL NOT BE MADE RESPONSIBLE FOR ANY DAMAGE YOU MIGHT INCUR WHEN USING THIS INFORMATION. The solution presented here IS NOT SUPPORTED by Microsoft.

Protecting custom Resolution State in OpsMgr 2007

September 13th, 2008 by Daniele Muscetta

In System Center Operations Manager 2007, you can add and remove resolution states for your alerts at will. Other than states "0" ("New") and "255" ("Closed") you can create other 254 resolution states to suit your needs. This is a simple feature that was already present in previous MOM versions, and it is very useful to do a kind of tricks with your alerts. The amount of possible states you can create should be able to satisfy any kind of alert and incident management process you might have in place, and any kind of filtering or forwarding or escalation need you might want to perform by using resolution states.

image

By default, only OpsMgr Administrators can change these settings, with the exception of the two built-in states of "New" and "Closed": those two states are REQUIRED if you want the product to continue working, therefore the GUI won't let you change, edit or delete them. Which is good.

This is not true for your own resolution states, which can be edited or even deleted any time. All that is really saved in an alert when you change an alert's resolution state is the NUMBER associated with it. In fact you even use that number when querying for alerts in the Command Shell:

get-alert | where {$_.resolutionstate -eq 0}

That means that if by accident you delete a resolution state you have defined, you won't see its description anymore in the GUI. Also, if you try to re-organize your resolution state, you can easily change the IDs for existing ones… Sure, you need to have the permissions in order to change or delete them, but what if you have implemented your important Alert and Incident management process by using resolution states and you want a bit of extra protection from mistakes or unintended deletion for them?

Then you can protect them by making the product think they were "built-in" too, just like "New" and "Closed".

How would you do this? In an UNSUPPORTED WAY: editing the database :-) In fact, those resolution states are written in a table in the database, called "ResolutionState" (who would have guessed it?), that looks like the following picture:

dbo.ResolutionState

Can you see the "IsPredefined" column? That can be set to "True" or "False" and that value is used by the SDK service to tell the GUI if that Resolution State can be edited/deleted or not.

Of course changing the database directly IS NOT SUPPORTED by Microsoft. You do this at your own risk, and if it was me, I would *NEVER* touch, change or remove the default two states ("New" and "Closed") as THAT really would BREAK the product. For example, Alerts that are not set to "Closed" (255) won't be ever groomed. And that is VERY BAD. NEVER, NEVER DO THAT.

On the other end, changing a custom Resolution State to make the product believe it is Predefined/Built-in has not had any negative impact in my (limited) testing so far, and has added the advantage of "protecting" my resolution state from unintended deletion, as shown below:

image

As usual, do this at your own risk. Remember what's written in my Disclaimer:

The information in this weblog is provided "AS IS" with no warranties, and confers no rights. This weblog does not represent the thoughts, intentions, plans or strategies of my employer. It is solely my own personal opinion. All code samples are provided "AS IS" without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and/or fitness for a particular purpose.
THIS WORK IS NOT ENDORSED AND NOT EVEN CHECKED, AUTHORIZED, SCRUTINIZED NOR APPROVED BY MICROSOFT, AND IT ONLY REPRESENT SOMETHING WHICH I'VE DONE IN MY FREE TIME. NO GUARANTEE WHATSOEVER IS GIVEN ON THIS. THE AUTHOR SHALL NOT BE MADE RESPONSIBLE FOR ANY DAMAGE YOU MIGHT INCUR WHEN USING THIS HACK.

CentOS 5 Management Pack for OpsMgr SCX

May 13th, 2008 by Daniele Muscetta

As I mentioned here, I have been testing the SCX beta.

Not having one of the "supported" platforms pushed me into playing with the provided Management Packs, and in turn I managed to use the MP for Red Hat Enterprise Linux 5 as a base, and replaced a couple of strings in the discoveries in order to get a working CentOS 5 Management Pack.

CentOS_HealthExplorer01_NEW

I still have not looked into the "hardware" monitors and health model / service model, so those are not currently monitored. But it is a start.

A lot of people have asked me a lot of information and would like to get the file – both in the blog's comment, on the newsgroup, or via mail. I am sorry, but I cannot provide you with the file, because it has not been throughly tested and might render your systems unstable, and also because there might be licensing and copyright issues that I have not checked within Microsoft.

Keep also in mind that using CentOS as a monitored platform is NOT a SUPPORTED scenario/platform for SCX. I only used it because I did not have a Suse or Redhat handy that day, and because I wanted to understand how the Management Packs using WS-Man worked.

This said, should you wish to try to do the same "MP Hacking" I did,  I pretty much explained all you need to know in my previous post and its comments, so that should not be that difficult.

Actually, I still think that the best way to figure out how things are done is by looking at the actual implementation, so I encourage you to look at the management packs and figure out how those work. There are a few mature tools out there that will help you author/edit Management Packs if you don't want to edit the XML directly: the Authoring Console, and Silect MP Studio Lite, for example. If you want to delve in the XML details, instead, then I suggest you read the Authoring Guide and peek at Steve Wilson's AuthorMPs.com site.

Disclaimer
The information in this weblog is provided "AS IS" with no warranties, and confers no rights. This weblog does not represent the thoughts, intentions, plans or strategies of my employer. It is solely my own personal opinion. All code samples are provided "AS IS" without warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability and/or fitness for a particular purpose.
THIS WORK IS NOT ENDORSED AND NOT EVEN CHECKED, AUTHORIZED, SCRUTINIZED NOR APPROVED BY MY EMPLOYER, AND IT ONLY REPRESENT SOMETHING WHICH I'VE DONE IN MY FREE TIME. NO GUARANTEE WHATSOEVER IS GIVEN ON THIS. THE AUTHOR SHALL NOT BE MADE RESPONSIBLE FOR ANY DAMAGE YOU MIGHT INCUR WHEN USING THIS PROGRAM.

Create a Script-Based Unit Monitor in OpsMgr2007 via the GUI

May 10th, 2007 by Daniele Muscetta

Warning for people who landed here: this post is VERY OLD. It was written in the early days of struggling with OpsMgr 2007, and when nobody really knew how to do things.
I found that this way was working – and it surely does – but what is described here is NOT the recommended way to do things nowadays. This post was only meant to fill in a gap I was feeling existed, back in 2007.
But as time passes, and documentation gets written, knowledge improves.
Therefore, I recommend you read the newly released Composition chapter of the MP Authoring Guide instead
http://technet.microsoft.com/en-us/library/ff381321.aspx – and start building your custom modules to embed scripts as Brian Wren describes in there, so that you can share them between multiple rules and monitors.

This said, below is the original post.

Create a Script-Based Unit Monitor in OpsMgr2007 via the GUI

There is not a lot of documentation for System Center Operations Manager 2007 yet.
It is coming, but there's a lot of things that changed since the previous release and I think some more would only help. Also, a lot of the content I am seeing is either too newbie-oriented or too developer-oriented, for some reason.

I have not yet seen a tutorial, webcast or anything that explains how to create a simple unit monitor that uses a VBS script using the GUI.

So this is how you do it:

Go to the "Authoring" space of OpsMgr 2007 Operations Console.
Select the "Management Pack objects", then "Monitors" node. Right click and choose "Create a monitor" -> "Unit Monitor".

You get the "Create a monitor" wizard open:
wizard02

Choose to create a two-states unit monitor based on a script. Creating a three- state monitor would be pretty similar, but I'll show you the most simple one.
Also, choose a Management pack that will contain your script and unit monitor, or create a new management pack.
wizard03

Choose a "monitor target" (object classes or instances – see this webcast about targeting rules and monitors: www.microsoft.com/winme/0703/28666/Target_Monitoring_Edit… ) and the aggregate rollup monitor you want to roll the state up to.

Choose a schedule, that is: how often would you like your script to run. For demonstration purposes I usually choose a very short interval such a two or three minutes. For production environments, tough, choose a longer time range.
wizard04

Choose a name for your script, complete with a .VBS extension, and write the code of the script in the rich text box:
wizard05

As the sample code and comments suggest, you should use a script that checks for the stuff you want it to check, and returns a "Property Bag" that can be later interpreted by OpsMgr workflow to change the monitor's state.
This is substantially different than scripting in MOM 2005, where you could only launch scripts as responses, loosing all control over their execution.

For demonstration purpose, use the following script code:
 

On Error Resume Next
Dim oAPI, oBag
Set oAPI = CreateObject("MOM.ScriptAPI")
Set oBag = oAPI.CreateTypedPropertyBag(StateDataType)
Const FOR_APPENDING = 8
strFileName = "c:\testfolder\testfile.txt"
strContent = "test "
Set objFS = CreateObject("Scripting.FileSystemObject")
Set objTS = objFS.OpenTextFile(strFileName,FOR_APPENDING)
If Err.Number <> 0 Then
Call oBag.AddValue("State","BAD")
Else
Call oBag.AddValue("State","GOOD")
objTS.Write strContent
End If
Call oAPI.Return(oBag)

[edited on 29th of May as pointed out by Ian: if you cut and paste the example script you might need to change the apostrophes (“) as that causes the script to fail when run – it is an issue with the template of this blog.] [edited on 30th of May: I fixed the blog so that now post content shows just plain, normal double quotes instead than fancy ones. It seems like a useful thing when from time to time I post code…]

The script will try to write into the file c:\testfolder\testfile.txt.
If it finds the file and manages to write (append text) to it, it will return the property "State" with a value of "GOOD".
If it fails (for example if the file does not exist), it will return the property "State" with a value of "BAD".

In MOM 2005 you could only let script generate Events or Alerts directly as a mean to communicate their results back to the monitoring engine. In OpsMgr 2007 you can let your script spit out a property bag and then continue the monitoring workflow and decide what to do depending on the script's result.

wizard06

So the next step is to go and check for the value of the property we return in the property bag, to determine which status the monitor will have to assume.

We use the syntax Property[@Name='State'] in the parameter field, and we search for a content that means an unhealthy condition:

wizard07

Or for the healty one:
wizard08

Then we decide which status will the monitor have to assume in the healty and unhealty conditions (Green/Yellow or Green/Red usually)
wizard09

Optionally, we can decide to raise an Alert when the status changes to unhealthy, and close it again when it goes back to healty.

wizard10

Now our unit monitor is done.
All we have to do is waiting it gets pushed down to the agent(s) that should execute it, and wait for its status to change.
In fact it should go to the unhealthy state first.
To test that it works, just create the text file it will be searching for, and wait for it to run again, and the state should be reset to Healthy.

Have fun with more complex scripts!