In a project actually we are looking for several methods how to monitor the Oracle Private Cloud Appliance hardware stack. There are some metrics who are gathered in the Grafana environment like memory and CPU usage, but we are not able to recognize when a compute or management node has a hardware failure. Some of the hardware events are visible in the Service Enclave User Interface, but there is still no mechanism at work to get an alert. And as there is still no EM13c plugin available for the Oracle Private Cloud Appliance X9-2 (no clue if this ever will released as it worked for X8), we need another solution to get informed.
IPMItool
The ipmitool
is a powerful command-line utility used to manage and retrieve information from the Integrated Lights Out Manager (ILOM) of Oracle servers. It allows administrators to perform tasks such as monitoring hardware status, managing power settings, and accessing system event logs. By leveraging IPMI (Intelligent Platform Management Interface) commands, ipmitool
can interact directly with the ILOM to gather critical server data and perform remote management operations.
And this is what we need.
Node ILOM – Enable IPMI V2.0 Sessions
To use the ipmitool from command line, we have to enable IPMI v2.0 session protocol first. The compute or management node ILOM web interfaces have PCA internal IP adresses. The get access, we must tunnel the https-protocol via management node VIP.
If you don’t enable v2.0 – you will get this error when trying to connect to an ILOM:
[root@pcamn01 ~]# ipmitool -I lanplus -H 100.96.0.64 -U root sdr elist full Password: Error: Unable to establish IPMI v2 / RMCP+ session
Commands to get the internal ILOM ip adress of the nodes when connected by SSH to a management node:
-- for compute nodes # pca-admin compute node list -- for management nodes # pca-admin management node list
As result, you get a list of internal IP addresses of the ILOMs. This IP can be uses to create a tunnel. In this example, I open a tunnel in my workspace VDI Windows environment command shell, and try to get access from my local browser to the machines ILOM. The tunnel forwards the port 443 (which is the ILOM address of the PCA compute node 01 by the way). Login into management node with root password.
C:\Users\Homer> ssh root@vip-ip-of-the-management-node> -L 443:100.96.0.64:443 root@vip-ip-of-the-management-node>'s password: Last login: Wed May 15 18:51:32 2024 from pcamn01 [root@pcamn01 ~]#
The open a browser and set URL to https://127.0.0.1 – the ILOM login screen is here. Login with the root password.
ILOM Administration :: Management Access :: IPMI – enable checkbos for v.2.0 Sessions and save settings.
Query the ILOM with the IPMItool
Once enabled, you can use the IPMItool from one of the management nodes to gather data from the hardware. use the parameter. Sample command to gather chassis information, the same root password is used than for ILOM login above:
[root@pcamn01 ~]# ipmitool -I lanplus -H 100.96.0.64 -U root chassis status –v Password: System Power : on Power Overload : false Power Interlock : inactive Main Power Fault : false Power Control Fault : false Power Restore Policy : always-off Last Power Event : Chassis Intrusion : inactive Front-Panel Lockout : inactive Drive Fault : false Cooling/Fan Fault : false
To get the list of events, you can leverage the power of sunoem command line tool.
[root@pcamn01 ~]# ipmitool -I lanplus -H 100.96.0.64 -U root sunoem cli "show /SP/logs/event/list" Password: Connected. Use ^D to exit. -> show /SP/logs/event/list Event ID Date/Time Class Type Severity ----- ------------------------ -------- -------- -------- 289 Mon Apr 22 05:52:02 2024 IPMI Log minor ID = 116 : 04/22/2024 : 05:52:02 : System Firmware Progress : BIOS : System boot initiated : Asserted 288 Mon Apr 22 05:51:53 2024 IPMI Log minor ID = 115 : 04/22/2024 : 05:51:53 : System Firmware Progress : BIOS : Hard-disk initialization : Asserted 287 Mon Apr 22 05:51:52 2024 IPMI Log minor ID = 114 : 04/22/2024 : 05:51:52 : System Firmware Progress : BIOS : Keyboard controller initialization : Asserted 286 Mon Apr 22 05:51:52 2024 IPMI Log minor ID = 113 : 04/22/2024 : 05:51:52 : System Firmware Progress : BIOS : Video initialization : Asserted 285 Mon Apr 22 05:51:51 2024 IPMI Log minor ID = 112 : 04/22/2024 : 05:51:51 : System Firmware Progress : BIOS : USB resource configuration : Asserted 284 Mon Apr 22 05:51:48 2024 IPMI Log minor ID = 111 : 04/22/2024 : 05:51:48 : System Firmware Progress : BIOS : PCI resource configuration : Asserted 283 Mon Apr 22 05:51:42 2024 IPMI Log minor Paused: press any key to continue, or 'q' to quitSession closed Disconnected
See rack temperatures:
[root@pcamn01 ~]# ipmitool -I lanplus -H 100.96.0.64 -U root sdr type temperature Password: T_IN_ZONE01 | 36h | ok | 7.0 | 43 degrees C T_IN_ZONE1 | 37h | ok | 7.0 | 41 degrees C T_IN_ZONE23 | 38h | ok | 7.0 | 44 degrees C T_OUT_ZONE2 | 35h | ok | 7.0 | 42 degrees C PS0/T_IN | D8h | ok | 10.0 | 40 degrees C PS0/T_OUT | E0h | ok | 10.0 | 44 degrees C PS1/T_IN | DFh | ok | 10.1 | 43 degrees C PS1/T_OUT | E1h | ok | 10.1 | 47 degrees C T_AMB | 39h | ok | 23.0 | 28 degrees C
Other sunoem commands like show faulty or show /SP/faultmgmt are very helpful to gather the hardware events.
Next Level
Now we know how to gather the ILOM information of a Private Cloud Appliance X9-2 compute or management node. The final goal would be to run check jobs on a regular base on a virtual machine an push values to see them in Grafana. I am thinking about a combo of Telegraf, Influx and Grafana. And we need another user than root. But this a topic for a future blog post 🙂
Links: