Before you begin:
By default, devices monitored by NPM are polled for data every nine minutes. It might take some time before all the nodes you added have data you can review.
In the topic Identify and troubleshoot a node that has a problem, alerts are triggered when a node goes down. Alerts can also be triggered when an interface has a problem, such as high utilization or the interface going down.
The Nodes with Problems resource provides information about the interfaces associated with each node. A square in the bottom-right corner of the node icon indicates that the node has an interface with a problem:
- In this example, a red square indicates that one or more interfaces are down.
- In this example, a gray square indicates that the status of one or more interfaces is unknown.
In your environment, you might not have any down interfaces. To find an interface with issues that need to be investigated, click My Dashboards > Network > Network Top 10 to open the Network Top 10 view. Review the following resources on this page.
This resource shows the interface's transmit and receive utilization as a percent of total interface speed. By default, utilization rates from 70 - 90% are yellow (warning), and utilization over 90% is red (danger). These thresholds are configurable.
Any interface with high utilization deserves more investigation.
This resource shows how much actual traffic is on an interface. Usually, WAN interfaces will be on this list because of the volume of traffic they process.
This resource shows:
If an interface is down (red), that generally means there is no connection:
Once you have found an interface with a problem (or, if all your interfaces are healthy, an interface with high utilization, errors, or discards), click the interface name in any resource. The Interface Details page opens.
The Node Details page can help you diagnose an interface problem. Click the node name at the top of the Interface Details page to open the Node Details page.
Examine the following resources on this page.
This resource shows the average load on the CPU for this node. In this case, the load spiked dramatically around 1:30 PM, which warrants further investigation.
This resource shows the latency (response time) and packet loss for the entire node. A spike in response time occurred at the same time as the spike in the average CPU load (shown above), implying correlation between the events.
These resources indicate an unknown increase in traffic that occurred at approximately 1:30 PM, leading to higher interface utilization, CPU load, and dropped packets. Since values are not yet critical and no alerts have been triggered, it might not be a concern, but if you wanted to continue troubleshooting, you could perform the following actions: