Upgrading OS to Oracle Linux 8

The Big Data Service (BDS) cluster OS is upgraded from Oracle Linux 7.9 to Oracle Linux 8.10. All packages in the cluster are upgraded to the equivalent package built for OL8.

Leapp utility is used for the upgrade. For more information, see Upgrading Systems With Leapp.

Prerequisites

OL8 patch is an OS patch (ol8.10-x86_64-2.0.0.0-0.0) that shows up after the following version prerequisites are fulfilled:


Version type	Required version before upgrade	Version after upgrade
bds version	3.0.29.5	3.1.0.2
odh version	2.0.10	2.0.10
os version	ol7.9-x86_64-1.29.1.999-0.0	ol8.10-x86_64-2.0.0.0-0.0

Other OS patch prerequisites also apply. See Updating Big Data Service Clusters.

OL8 Upgrade Restrictions

OL8 upgrade doesn't support the following scenarios:

FIPS enabled clusters.
Clusters with bare metal shape nodes.
Clusters with Cloud SQL nodes.

Features Supported for OL8 Upgrade

Dryrun only mode: Allows customer to dry run the upgrade and stop to identify potential risks and blockers related to the upgrade.

Dry run checks prerequisites such as network and storage requirement and any configuration that renders the node unusable after the upgrade. When dryrunOnly option is selected, patching stops after the dry run so that customer can review warnings and actual package upgrade that's performed. If dryrunOnly isn't selected, patching continues if no error is detected during the dry run, else patching stops.

After the dry run is successful, an aggregated report is generated in MN0 node under /opt/oracle/bds/ospatch/ol7-8-upgrade/dryrun/:

leapp-preupgrade-report-aggregated.json: Shows potential risks for the upgrade, classified by severity and aggregated by nodes affected. These risks needs a review but it's not required to resolve them unless it's classified as an "high (inhibitor)" level risk. Customer must resolve upgrade inhibitors and warnings they deem important.
package-diff-aggregated.json: Shows the difference between actual package upgrade plan and the prepared upgrade plan as shown in the Console or API response.

Patching by batch of nodes: OL8 upgrade supports patching by batch. Allows minimum downtime patching and controls blast radius when patching fails.

On an average, one batch of nodes takes an hour to complete patching. Supported configurations are:


Patching configuration	Patching behaviour	Number of patches	Approximate time to complete patching
Downtime patching	All nodes are taken down for patching	1	1 hour
Patching by Availability Domain (AD) or Fault Domain (FD)	Patch the following nodes individually in a sequence: mn0, un0, wn0, mn1, un1 Next patch nodes by the AD they're assigned to (or FD if in a single AD region).	8	9 hours
Patch by a specified batch size	Patch the following nodes individually in a sequence: mn0, un0, wn0, mn1, un1 Next patch nodes based on the specified batch size.	Customer decides	NOT APPLICABLE

Failure or Rollback Behavior

During the OL8 upgrade, the nodes being patched reboots. If any failure happens before the node reboot, an automatic rollback is performed and the node isn't affected. After the reboot, rollback isn't possible.

After patching of each batch, BDS does a health check for the cluster regardless of whether the batch succeeded or not. If the number of nodes failed exceeds customer specified failure tolerance, patching stops, else patching continues to the next batch.

After patch work request is finished, if no failure occurs, the work request is marked as successful, cluster bds version is updated to 3.1.0.2.

Failure might occur because of the following reasons:


Scenario	Action required
Cluster is healthy	If all nodes are healthy: Retrigger patching. Patching continues on nodes that aren't updated. If some nodes are unhealthy: Retrigger patching. BDS retries patching on failed nodes to bring them up. If nodes still fail after the retry, remove nodes from the Oracle Cloud Console and retry patching.
Cluster fails	If some nodes fails, fix issue on the node or remove the nodes from the cluster in the Oracle Cloud Console, and then try to fix the service health in Ambari. if cluster is fixed, run `/home/opc/cloud/flask-microservice/bigdataservice/devops/reset_cluster_state.py` to set the cluster to `ACTIVE`. Cluster becomes active in 10 mins, and the retry the OS patch. if cluster can't be fixed, contact Oracle support.

OL8 Upgrade Log locations