Skip to content

Commit

Permalink
Merge pull request #2065 from swapdisk/lvm_exercises
Browse files Browse the repository at this point in the history
LVM exercises
  • Loading branch information
IPvSean authored Jan 19, 2024
2 parents 41c3193 + 066426c commit 4f2f71f
Show file tree
Hide file tree
Showing 13 changed files with 98 additions and 12,830 deletions.
Binary file modified decks/ansible_ripu.pdf
Binary file not shown.
6 changes: 4 additions & 2 deletions exercises/ansible_ripu/1.5-custom-modules/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
- [Step 1 - What are Custom Modules?](#step-1---what-are-custom-modules)
- [Step 2 - Install a Leapp Custom Actor](#step-2---install-a-leapp-custom-actor)
- [Step 3 - Generate a New Pre-upgrade Report](#step-3---generate-a-new-pre-upgrade-report)
- [Step 4 - Learn More About Developing Leapp Custom Actors](#step-4---learn-more-about-developing-leapp-custom-actors)
- [Step 4 - Learn More About Customizing the In-place Upgrade](#step-4---learn-more-about-customizing-the-in-place-upgrade)
- [Conclusion](#conclusion)

## Optional Exercise
Expand Down Expand Up @@ -139,7 +139,9 @@ We are now ready to try running a pre-upgrade report including the checks from o

- Now generate another pre-upgrade report after rebooting. Verify that this inhibitor finding has disappeared with the new report.

### Step 4 - Learn More About Developing Leapp Custom Actors
### Step 4 - Learn More About Customizing the In-place Upgrade

Read the knowledge article [Customizing your Red Hat Enterprise Linux in-place upgrade](https://access.redhat.com/articles/4977891) to understand best practices for handling the upgrade of third-party packages using custom repositories for an in-place upgrade or custom actors.

The gritty details of developing Leapp custom actors are beyond the scope of this workshop. Here are some resources you can check out to learn more on your own:

Expand Down
24 changes: 12 additions & 12 deletions exercises/ansible_ripu/2.2-snapshots/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,25 +62,29 @@ The following sections explain the pros and cons in detail.

The Logical Volume Manager (LVM) is a set of tools included in RHEL that provide a way to create and manage virtual block devices known as logical volumes. LVM logical volumes are typically used as the block devices from which RHEL OS filesystems are mounted. The LVM tools support creating and rolling back logical volume snapshots. Automating these actions from an Ansible playbook is relatively simple.

Logical volumes are contained in a storage pool known as a volume group. The storage available in a volume group comes from one or more physical volumes, that is, block devices underlying actual disks or disk partitions. Typically, the logical volumes where the RHEL OS is installed will be in a "rootvg" volume group. If best practices are followed, logical volumes for applications and app data will be isolated in a separate volume group, "appvg" for example.
> **Note**
>
> The snapshot and rollback automation capability implemented for our workshop lab environment creates LVM snapshots managed using Ansible roles from the [`infra.lvm_snapshots`](https://github.com/swapdisk/infra.lvm_snapshots#readme) collection.
Logical volumes are contained in a storage pool known as a volume group. The storage available in a volume group comes from one or more physical volumes, that is, block devices underlying actual disks or disk partitions. Typically, the logical volumes where the RHEL OS is installed will be in a "rootvg" volume group. If best practices are followed, applications and app data will be isolated in their own logicial volumes either in the same volume group or a separate volume group, "appvg" for example.

To create logical volume snapshots, there must be free space in the volume group. That is, the total size of the logical volumes in the volume group must be less than the total size of the volume group. The `vgs` command can be used query volume group free space. For example:

```
# vgs
VG #PV #LV #SN Attr VSize VFree
rootvg 1 3 0 wz--n- 950.06g 422.06g
VG #PV #LV #SN Attr VSize VFree
VolGroup00 1 7 0 wz--n- 29.53g 9.53g
```

In the example above, the rootvg volume group total size is about 950 Gb and there is about 422 Gb of free space in the volume group. There is plenty of free space to allow for creating snapshot volumes in this volume group.
In the example above, the `VolGroup00` volume group total size is 29.53 GiB and there is 9.53 GiB of free space in the volume group. This should be enough free space to support rolling back a RHEL upgrade.

If there is not enough free space in the volume group, there are a few ways we can make space available:

- Adding another physical volume to the volume group (i.e., `pvcreate` and `vgextend`). For a VM, you would first configure an additional virtual disk.
- Temporarily remove a logical volume you don't need. For example, on bare metal servers, there is often a large /var/crash empty filesystem. Removing this filesystem from `/etc/fstab` and then using `lvremove` to remove the logical volume from which it was mounted will free up space in the volume group.
- Reducing the size of one or more logical volumes. This is tricky because first the filesystem in the logical volume needs to be shrunk. XFS filesystems do not support shrinking. EXT filesystems do support shrinking, but not while the filesystem is mounted. This option can be difficult and should only be considered as a last resort and trusted to a very experienced Linux admin.
- Reducing the size of one or more logical volumes. This is tricky because first the filesystem in the logical volume needs to be shrunk. XFS filesystems do not support shrinking. EXT filesystems do support shrinking, but not while the filesystem is mounted. Until recently, this way of freeing up volume group space was considered a last resort to be attempted by only the most skilled Linux admin, but it now possible to safely automate shrinking logical volumes using the [`shrink_lv`](https://github.com/swapdisk/infra.lvm_snapshots/tree/main/roles/shrink_lv#readme) role of the aforementioned `infra.lvm_snapshots` collection.

After a snapshot is created, COW data will start to utilize the free space of the snapshot logical volume as blocks are written to the origin logical volume. Unless the snapshot is create with the same size as the origin, there is a chance that the snapshot could fill up and become invalid. Testing should be performed during the development of the LVM snapshot automation to determine snapshot sizings with enough cushion to prevent this. The `snapshot_autoextend_percent` and `snapshot_autoextend_threshold` settings in lvm.conf can also be used to reduce the risk of snapshots running out of space.
After a snapshot is created, COW data will start to utilize the free space of the snapshot logical volume as blocks are written to the origin logical volume. Unless the snapshot is create with the same size as the origin, there is a chance that the snapshot could fill up and become invalid. Testing should be performed during the development of the LVM snapshot automation to determine snapshot sizings with enough cushion to prevent this. The `snapshot_autoextend_percent` and `snapshot_autoextend_threshold` settings in lvm.conf can also be used to reduce the risk of snapshots running out of space. The [`lvm_snapshots`](https://github.com/swapdisk/infra.lvm_snapshots/tree/main/roles/lvm_snapshots#readme) role of the `infra.lvm_snapshots` collection supports variables that may be used to automatically configure the autoextend settings.

Unless you have the luxury of creating snapshots with the same size as their origin volumes, LVM snapshot sizing needs to be thoroughly tested and free space usage carefully monitored. However, if that challenge can be met, LVM snapshots offer a reliable snapshot solution without the headache of depending on external infrastructure such as VMware.

Expand All @@ -102,13 +106,9 @@ VMware snapshots work great when they can be automated. If you are considering t

Amazon Elastic Block Store (Amazon EBS) provides the block storage volumes used for the virtual disks attached to AWS EC2 instances. When a snapshot is created for an EBS volume, the COW data is written to Amazon S3 object storage.

> **Note**
>
> The snapshot and rollback automation capability implemented for our workshop lab environment uses EBS snapshots.
While EBS snapshots operate independently from the guest OS running on the EC2 instance, the similarity to VMware snapshots ends there. An EBS snapshot saves the data of the source EBS volume, but does not save the state or memory of the EC2 instance to which the volume is attached. Also unlike with VMware, EBS snapshots can be created for an OS volume only while leaving any separate application volumes as is.

Automating EBS snapshot creation and rollback is fairly straightforward assuming your playbooks can access the required AWS APIs. The tricky bit of the automation is identifying the EC2 instance and attached EBS volume that corresponds to the target host in the Ansible inventory managed by AAP. For the snapshot automation we implemented for our workshop lab environment, we solved this by setting tags on our EC2 instances.
Automating EBS snapshot creation and rollback is fairly straightforward assuming your playbooks can access the required AWS APIs. The tricky bit of the automation is identifying the EC2 instance and attached EBS volume that corresponds to the target host in the Ansible inventory managed by AAP, but this can be solved by setting identifying tags on your EC2 instances.

#### Break Mirror

Expand All @@ -128,7 +128,7 @@ Read the article [ReaR: Backup and recover your Linux server with confidence](ht

### Step 3 - Snapshot Scope

The best practice for allocating the local storage of a RHEL servers is to configure volumes that separate the OS from the apps and app data. For example, the OS filesystems would be under a "rootvg" volume group while the apps and app data would be in an "appvg" volume group. This separation helps isolate the storage usage requirements of these two groups so they can be manged based on their individual requirements and are less likely to impact each other. For example, the backup profile for the OS is likely different than for the apps and app data.
The best practice for allocating the local storage of a RHEL servers is to configure volumes that separate the OS from the apps and app data. For example, the OS filesystems would be under a "rootvg" volume group while the apps and app data would be in an "appvg" volume group or at least in their own dedicated logical volumes. This separation helps isolate the storage usage requirements of these two groups so they can be manged based on their individual requirements and are less likely to impact each other. For example, the backup profile for the OS is likely different than for the apps and app data.

This practice helps to enforce a key tenet of the RHEL in-place upgrade approach: that is that the OS upgrade should leave the applications untouched with the expectation that system library forward compatibility and middleware runtime abstraction reduces the risk of the RHEL upgrade impacting app functionality.

Expand Down
41 changes: 2 additions & 39 deletions exercises/ansible_ripu/3.1-rm-rf/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,32 +51,7 @@ In the next exercise, we will be rolling back the RHEL upgrade on one of our ser

Verify you see a root prompt like the example above.

### Step 2 - Choose your Poison

The `rm -rf /*` command appears frequently in the urban folklore about Unix disasters. The command recursively and forcibly tries to delete every directory and file on a system. When it is run with root privileges, this command will quickly break everything on your pet app server and render it unable to reboot ever again. However, there are much less spectacular ways to mess things up.

Mess up your app server by choosing one of the following suggestions or dream up your own.

#### Delete everything

- As mentioned already, `rm -rf /*` can be fun to try. Expect to see lots of warnings and error messages. Even with root privileges, there will be "permission denied" errors because of read-only objects under pseudo-filesystem like `/proc` and `/sys`. Don't worry, irreparable damage is still being done.

You might be surprised that you will get back to a shell prompt after this. While all files have been deleted from the disk, already running processes like your shell will continue to be able to access any deleted files to which they still have an open file descriptor. Built-in shell commands may even still work, but most commands will result in a "command not found" error.

If you want to reboot the instance to prove that it will not come back up, you will not be able to use the `reboot` command, however, the `echo b > /proc/sysrq-trigger` might work.

#### Uninstall glibc

- The command `rpm -e --nodeps glibc` will uninstall the glibc package, removing the standard C library upon which all other libraries depend. The damage done by this command is just as bad as the the previous example, but without all the drama. This package also provides the dynamic linker/loader, so now commands will fail with errors like this:

```
[root@cute-bedbug ~]# reboot
-bash: /sbin/reboot: /lib64/ld-linux-x86-64.so.2: bad ELF interpreter: No such file or directory
```

If you want to do a `reboot` command, use `echo b > /proc/sysrq-trigger` instead.

#### Break the application
### Step 2 - Break your application

- In [Exercise 1.6: Step 5](../1.6-my-pet-app/README.md#step-5---run-another-pre-upgrade-report), we observed a pre-upgrade finding warning of a potential risk that our `temurin-17-jdk` 3rd-party JDK runtime package might be removed during the upgrade in case it had unresolvable dependencies. Of course, we know this did not happen because our pet app is still working perfectly.

Expand All @@ -97,23 +72,11 @@ Mess up your app server by choosing one of the following suggestions or dream up

This is a realistic example of application impact that can be reversed by rolling back the upgrade.

#### Wipe the boot record

- The `dd if=/dev/zero of=/sys/block/* count=1` command will clobber the master boot record of your instance. It's rather insidious because you will see that everything continues to function perfectly after running this command, but after you do a `reboot` command, the instance will not come back up again.

#### Fill up your disk

- Try the `while fallocate -l9M $((i++)); do true; done; yes > $((i++))` command. While there are many ways you can consume all the free space in a filesystem, this command gets it done in just a couple seconds. Use a `df -h /` command to verify your root filesystem is at 100%.

#### Set off a fork bomb

- The shell command `:(){ :|:& };:` will trigger a [fork bomb](https://en.wikipedia.org/wiki/Fork_bomb). When this is done with root privileges, system resources will be quickly exhausted resulting in the server entering a "hung" state. Use the fork bomb if you want to demonstrate rolling back a server that has become unresponsive.

## Conclusion

Congratulations, you have trashed one of your app servers. Wasn't that fun?

In the next exercise, you will untrash it by rolling back.
In the next exercise, you will untrash it by rolling back the upgrade.

---

Expand Down
14 changes: 8 additions & 6 deletions exercises/ansible_ripu/3.2-rollback/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
- [Table of Contents](#table-of-contents)
- [Objectives](#objectives)
- [Guide](#guide)
- [Step 1 - Launch the Rollback Workflow Job Template](#step-1---launch-the-rollback-workflow-job-template)
- [Step 1 - Launch the Rollback Job Template](#step-1---launch-the-rollback-job-template)
- [Step 2 - Observe the Rollback Job Output](#step-2---observe-the-rollback-job-output)
- [Step 3 - Check the RHEL Version](#step-3---check-the-rhel-version)
- [Conclusion](#conclusion)
Expand All @@ -26,21 +26,23 @@ We are now here in our exploration of the RHEL in-place automation workflow:

After rolling back, the pet app server will be restored to as it was just before entering the upgrade phase of the workflow.

### Step 1 - Launch the Rollback Workflow Job Template
### Step 1 - Launch the Rollback Job Template

In this step, we will be rolling back the RHEL in-place upgrade on one of our pet application servers.

- Return to the AAP Web UI tab in your web browser. Navigate to Resources > Templates and then open the "AUTO / 03 Rollback" job template. Here is what it looks like:

![AAP Web UI showing the rollback job template details view](images/rollback_template.svg)

- Click the "Launch" button which will bring up a the survey prompt. We only want to do a rollback of one server. To do this, choose the "ALL_rhel" option under "Select inventory group" and then enter the hostname of your chosen pet app server under the "Enter server name" prompt. For example:
- Click the "Launch" button which will bring up the prompts for submitting the job starting with the limit and variables prompts. We only want to do a rollback of one server. To do this, enter the hostname of your chosen pet app server under the "Limit" prompt. For example:

![AAP Web UI showing the rollback job survey prompt](images/rollback_survey.svg)
![AAP Web UI showing the rollback job limit and variables prompts](images/rollback_prompts.svg)

Click the "Next" button to proceed.

- Next you will see the job preview prompt, for example:
![AAP Web UI showing the rollback job survey prompt](images/rollback_survey.svg)

- Next we see the job template survey prompt asking us to select an inventory group. We already limited the job to one server, so just choose the "ALL_rhel" option and click the "Next" button. This will bring you to the preview of the selected job options and variable settings, for example:

![AAP Web UI showing the rollback job preview prompt](images/rollback_preview.svg)

Expand All @@ -56,7 +58,7 @@ After launching the rollback playbook job, the AAP Web UI will navigate automati

![Rollback job "PLAY RECAP" as seen at the end of the job output](images/rollback_job_recap.svg)

Notice in the example above, rolling back was done in just under 2 minutes.
Notice in the example above, we see the job completed in just under 3 minutes. However, most of that time was spent in the final "Wait for the snapshot to drain" task which holds the job until the snapshot merges finish in the background. The instance was actually rolled back and service ready in just under a minute. Impressive, right?

### Step 3 - Check the RHEL Version

Expand Down
Loading

0 comments on commit 4f2f71f

Please sign in to comment.