Submit a ticketCall us

Bridging the ITSM Divide
Integrated help desk and remote support software for faster resolution

Join us on Wednesday, November 29, 2017 at 11 a.m. CT, as we discuss the benefits of effectively integrating your help desk software with remote support solutions to help increase the efficiency of IT administration, improve communication, and decrease mean time to resolution (MTTR) for IT issues of all sizes. This directly impacts end-user satisfaction and your business’ bottom line. Register Now.

Home > Success Center > Log & Event Manager (LEM) > LEM performance checks

LEM performance checks

Updated August 31st, 2016

Overview

This article provides guidelines to help if rules are firing or if the system is running slow.

Environment

All LEM versions            

Steps to Verify

 

For information

  • Because of the dynamic changes in auditing, LEM is provisioned as thin provisioning, but LEM performance will typically increase if thick provisioning is used in VMware.
  • Mis-configuration will adversely affect LEM performance.
  • No reservations or lack of reservations can result in file or filesystem corruption, in addition to other errors.
         (support will need to verify files and filesystems. ie... /  or  /tmp  or  /usr/local/contego)
     

Performance reports run from Reports Console  (Save all as Crystal Report .rpt)

  • Event Summary - Top Level Statistics - Report - for last 30 days > save in Crystal Reports format (rpt) (lists volume of traffic, hourly & daily, does not include FileAudit traffic.)
  • Event Summary - Graphs - Report - for last 30 days > save in Crystal Reports format (rpt) (quick graphical to traffic volume)
  • Agent Maintenance Report  -  for last 24 hours  >  save in Crystal Reports format (rpt). (agent connection issues, showing duplicate and unknown agents.)
  • SolarWinds Actions Report  - for last 24 hours > save in Crystal Reports format (rpt) (often overlooked, and you may see a high volume of emails sent, that weren’t shown in Inferred alerts.)
  • Tool Maintenance by Alias Report  - for last 24 hours > save in Crystal Reports format (rpt) (high volume of unmatched data will consume resources to process the error.)
  • Database Maintenance Report  -  no timeframe (shows retention and traffic volume for last 20 days on bar graph)
  • Inferred Alerts by Inference Rule  -  for last 24 hours  > save in Crystal Reports format (rpt)  ***Can take 30 minutes*** (these are the rules that fired to “infer”, not necessarily rules firing and sending emails.)

nDepth Searches (created from GUI-console, Explore > nDepth)

  • Rules - InternalRuleFired for the last 24 hours
  • Agent  - InternalUnknownAgent and InternalDuplicateNode  for the last 24 hours
  • Un-matched/Mis-matched data - AnyAlert.EventInfo=*unmatched*  for the last 24 hours
  • Incorrect Security Connector - AnyAlert.EventInfo=*vista alert*  for the last 24 hours
  • Windows noise - AnyAlert.EventInfo=*windows filtering platform* for the last 24 hours

Connector Configurations

  • Verify that all connectors (syslog & agent) are not logging to the nDepth "RAW" database, unless you have the RAW database enabled. RAW is not needed to meet auditing requirements, but an auditor may require it.
    To verify RAW database is enabled:  the presence of orange/blue slider to the right of search-time drop-down under Explore > nDepth.
    We apologize for the confusion, connectors refer to "ndepth" which is the same thing as the "RAW" database.
    By default, log data is unchanged but placed into separate fields/tables in the Alert database.
  • A connector must be configured for every type of data received, in order for data to insert to an LEM database.
    Check the agent connectors (Manage > Nodes) and syslog connectors (Manage > Appliances, gear on the left) to verify what connector is configured to read which log file.
    Pay close attention to firewalls sending data to one log file (& 1-connector), but many devices can easily share log files.
    Those that cannot share log files:  an example is the five most common Cisco devices like IOS, CatOS, WLC, VPN, Nexus. Each of these cannot share log files and will need to be separated.

Example of the 5 devices:

Cisco Pix and IOS - local2.log

Cisco VPN - local3.log

Cisco WLC - local4.log

Cisco Nexus - local5.log

Cisco CatOS - local6.log

View the following Web Console monitor filters

  1. All Alerts, looking for anything unusual (look for WFP, very high FileAudit, unmatched data, Vista connector alert)
  2. Rule Activity (if time is off by 5 minutes, rules will not fire. If you see no rules firing, this could mean that an overload of rules are firing and LEM can't keep the console up to date.)
  3. SolarWinds (Trigeo) Alerts (unmatched data, agent connection issues, wrong security event connector used “vista alert”, and so on).

Check policies  

Verify the Event Distribution Policy (exercise caution, as this can categorically drop data without inserting into database)

  1. To view WFP that is dropped, search on “security” under Manage > Appliance -{Policy and search for Security. It is Difficult to tell volume of WFP if it is dropped. Requested/preferred is to remove WFP in Windows AD GPOs.
  2. Audit Alert is the area where most events will go through. (Take caution on changes in any policies. Even though we drop it, resources are still needed.)

 

***Assuming RAW database is disabled, little to no unmatched data, no agent connection issues, no WFP noise, rules are properly constructed, and only a few hundred rules firing per day ***

 

Setting LEM reservations if typically defined by the volume of traffic received. Note that a misconfigured LEM can result in the need for higher reservations. In addition, if LEM is queueing data, something is misconfigured. Queueing is noticed under the CMC/putty session "appliance" menu, "diskusage" command. (only expect to see "Database Queue ... xxxx alerts waiting in memory", and no other waiting or queued.)

Here are the reservations needed to handle the specified events per day:

  •     0->15   million events/day => set 8GB of RAM & 2-CPU’s (this is default, any of processors/cores/sockets)
  •   15->35   million events/day => set 16GB of RAM & 4-CPU’s
  •   35->60   million events/day => set 24GB of RAM & 6-CPU’s
  •   60->90   million events/day => set 32GB of RAM & 8-CPU's
  •  90->130  million events/day => set 64GB of RAM & 10-CPU’s
  • 130->200 million events/day => set 128GB of RAM & 12-CPU’s

Examples to watch for and be aware:

- If enabling RAW database, increase reservations to the next level.
- For every few thousand rules firing per day, increase reservations to the next level.
- For agent connection issues (unknown/duplicate) exceeding 500 per day, increase reservations to the next level.

- For every few thousand unmatched data errors, increase reservations to the next level.

- For every few thousand "vista alert detected" error, increase reservations to the next level.

- When 25% of the data is WFP,  increase reservations to the next level.

http://www.solarwinds.com/documentation/LEM/Docs/LEMDeploymentGuide.pdf

Verify the VM storage performance (need support to run the IOPs test, until available from console)

 

  1. Open putty/root session,
    cd /tmp
    dd if=/dev/zero of=/tmp/test bs=64k count=16k conv=fdatasync
    rm /tmp/test
  2. Record the results. For example, 12MB/sec is very bad, 400MB/sec is very good.
  3. Here are some arbitrary numbers, but be aware that these levels are dynamic:
    <20MB/sec – unacceptable level, and the VM needs faster storage access.
    30->100MB/sec -  can probably handle up to 15 million events/sec (speed is still marginal)
    100->200MB/sec – can probably handle up to 30 million events/sec
    200->300MB/sec – can probably handle up to 60 million events/sec
    300->400MB/sec – can probably handle up to 120 million events/sec
    >400MB/sec - can probably handle up to 200 million events/sec

 

Disk Latency Issue in VM environment
***If disk latency issues exist, it may impact the IOPs as well. Latency issues will affect writing and/or reading to the disk.***

  1. Get performance results from the VM host (VSphere console or Hyper-V console). 
    /Success_Center/New_Articles/LEM_Performance_-_How_can_I_test_Disk_Latency
  2. Collect a debug from cmc -> manager -> debug and upload to LeapFile
  3. Look for errors/problems, asks others in support, send to Jira to have DEV read the debug.

 

From Manager

  1. Type rcc
  2. Type pumpstatus

 

PUTTY

  1. Verify if LEM is queueing data -> open a putty/cmc session to the LEM
  2. Run diskusage (appliance menu) (du -h) and note any queueing in “alerts queued” or “alerts in memory” (ignore Database Queue -> “xxx alerts in memory”.
  3. Run top (appliance menu) è note “load average” at very top. (less than 1.0 is very good, less than 2.0 is ok, very-bad-lem-locked is 9->10).
  4. Run the checklogs (appliance menu) to see if any one facility is overloaded with data.
  5. (if “local2.log” receives more than 10GB per day, it may not be able to rotate. Use the setlogrotate & limitsyslog to manage this.)
  6. Run the viewsysinfo (manager) - check reservations
  7. If using VMware, If Hyper-V, view the reservation in Hyper-V (memory requirements are consistent, but keep in mind that hardware CPU’s used will vary quite a bit.) We do not specify hardware requirements (including disk), we do specify the virtual machine requirements.

 

If they have Vmare ESX and ESXi make sure they are only using the ESXi connector otherwise the ESX steals it and will have unmatched data

 

  • Database Maintenance Report  -->  no timeframe  --> save in PDF format.
  • Agent Maintenance Report  -->  for last 24 hours  -->  save in Crystal Reports format (rpt).
  • Inferred Alerts by Inference Rule  -->  for last 24 hours  --> save in Crystal Reports format (rpt).
  • Tool Maintenance by Alias Report  --> for last 24 hours  -->  save in Crystal Reports format (rpt).
  • Solarwinds Actions Report  -->  for last 24 hours  -->  save in Crystal Reports format (rpt).
  • Event Summary - Graphs - Report --> for last 30 days --> save in Crystal Reports format (rpt).
  • Event Summary - Top Level Statistics - Report --> for last 30 days --> save in Crystal Reports format (rpt).
Last modified

Tags

Classifications

Public