Updated May 23rd, 2016
Information about the failover process and how to recover from a failover.
A failover should not be confused with a switchover.
- A switchover is a controlled switch (initiated from the SolarWinds Orion Failover Manager) between the primary and secondary servers.
- A failover may happen when one or all of the following have suffered a failure on the active server: power, hardware, or communications.
- The passive server will wait the preconfigured length of time between missed Heartbeats before beginning a failover, and when this happens, it will automatically assume the active role and start to execute the protected applications.
The failover process
When the passive server detects that the active server is no longer running properly, it assumes the role of the active server by initiating a failover and takes the following steps:
- It applies any intercepted updates that are currently saved in the receive queue, that is, the log of update records that have been saved on the passive server, but not applied to the replicated files.
The size of the receive queue influences the length of time it takes to complete the failover process. If the receive queue is large, the system must wait for all of the receive queue updates to be applied before the rest of the process can take place. When there are no more complete update records to be applied, any incomplete update records will be discarded. An update record can only be applied if all earlier update records were applied, and the completion status for the update is in the receive queue.
- The passive server changes its role and mode of operation from passive to active.
The server’s principal identity is enabled. This principal IP address can only be enabled on one of the two servers at any time. When the principal identity is enabled, any clients that were connected to the server before failover will now be able to reconnect.
- The newly active server starts intercepting updates to the protected data. Any updates to the protected data will be saved in the local send queue.
- The now active server starts all protected applications. The applications will be able to use the replicated application data to recover, and then accept re-connections from any clients. Any updates that the applications make to the protected data will be intercepted and logged. At this stage, the originally active server is “offline”. The originally passive server has taken over the role of the active server and is running the protected applications. As the originally active server stopped abruptly, the protected applications may have lost some data. The application clients can reconnect to the application and continue running as before.
- NOTE: During a failover, the data held in the send queue is lost.
How to recover from a failover
A failover has occurred and the secondary server is now running as the active server.
- Event logs should be checked at this point, (on both servers) to determine the cause of the failover. If you are unsure how to do this, please use the Orion Failover Engine Log Collector tool to collect information and send the output to SolarWinds Support. See SWREFID - 1950 How to Retrieve the Orion Failover Engine Logs and Other Useful Information for Support Purposes.
If any of the following has occurred (on the primary server), performing a switchover back to the primary server may not be possible until other important actions are carried out. Orion Failover Engine should not be restarted until these issues have been resolved:
- Hard Disk Failure - Disk may need replacing.
- Power Failure - Power may need to be restored to the primary server.
- Virus - Server should be cleaned of all viruses before starting Orion Failover Engine.
- Communications - Physical network hardware may need replaced.
- Blue Screen - Cause should be determined and resolved. This may require you to submit the Blue Screen dump file to SolarWinds Support for analysis.
- Run the Server Configuration wizard and check the server is set to primary and passive. Click Finish to accept the changes.
- Disconnect the SolarWinds Channel network cables or disable the network card.
- Resolve the problem – list of possible failures etc.
- Reboot this server and reconnect or again enable the network card.
- After the reboot, check that the Taskbar icon now reflects the changes by showing P / - (primary and passive)
- On the secondary active server or from a remote client, Launch the SolarWinds Orion Failover Manager and confirm that the secondary server is reporting as active.
If the secondary server is not displaying as active, follow the steps below:
- If the SolarWinds Orion Failover Manager is unable to connect remotely, then try running it locally. If you are still unable to connect locally then check the service is running via the Service Control Manager. If it is not, check the event logs for a cause.
- Run the Server Configuration wizard and check that the server is set to secondary and active. Click Finish to accept the changes.
- Determine if the protected application is accessible from clients. If it is, then start Orion Failover Engine on the secondary server.
If the application is not accessible, check the application logs to determine why the application is not running.
- Run the Server Configuration wizard and check that the server is set to secondary and active.
- Click Finish to accept any changes. At this stage, you should now be ready to start Orion Failover Engine on the secondary active server.
NOTE: The data on this server should be the most up to date and this server should also be the live server on your network. Once Orion Failover Engine starts, it will overwrite all the protected data (configured in the File Filter list) on the primary passive server. If you are not sure that the data on the active server is 100% up to date, please contact SolarWinds Support. Only go on to the next step if you are sure that you want to overwrite the protected data on the passive server.
- Start Orion Failover Engine on the secondary active server and check that the Taskbar icon now reflects the correct status by showing S / A (secondary and active).