Enhancement for alarms #215

pixelsoccupied · 2024-09-24T17:13:08Z

This enhancement talks about re-architecting the Alarm server as specified in InfrastructureMonitoring Service API o-ran spec.

Notable changes include:

Data returned from API calls follow closely to what's defined by o-ran
Dynamically checking and gathering cluster resources during init such as PrometheuseRule
Combining servers into the same code base
Introducing persistence storage via Postgres

This enhancement includes all the code tested during spike, the k8s resources needed to deploy through operator and other libraries/tool that can be used to quickly develop this.

co-authored with @browsell and @Jennifer-chen-rh

openshift-ci · 2024-09-24T17:13:12Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md

Jennifer-chen-rh

General question about DB serial number. It will increase with the entry of DB rows or increase with DB row update?

docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md

pixelsoccupied · 2024-09-25T14:22:28Z

@Jennifer-chen-rh on entry. More on SERIAL datatypes https://www.postgresql.org/docs/current/datatype-numeric.html#DATATYPE-SERIAL. But we can have custom type which be incremented on insertion or update.

Jennifer-chen-rh

@pixelsoccupied aha, I felt something missing in PR. Now I figured out that the section we discussed about history table entry age out not here.

openshift-ci · 2024-09-25T22:06:26Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from jennifer-chen-rh. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Jennifer-chen-rh

@pixelsoccupied

the datastructures are still public
Missing the the alarm notification event object structure

docs/enhancements/infrastructure-monitoring-service-api/alarms.md

…deploying

docs/enhancements/infrastructure-monitoring-service-api/alarms.md

…rver and add david as reviewer

mlguerrero12 · 2024-10-22T14:23:05Z

docs/enhancements/infrastructure-monitoring-service-api/alarms.md

+For a given OCP release, the alarmDefinitions and probableCauses are fixed, so these can be built up front. For CaaS alarms only one resource type, “NodeCluster”, all alarms map to it.
+
+1. Query all managed clusters to get list of unique major.minor versions
+   - Need to monitor for major.minor new versions


I think that instead of relying of restarts, we could define a communication channel between the Resource server and the Alarm server. We could define some internal endpoints for the alarm dictionary here and the Resource server can send operations that follow the lifecycle of a ResourceType. If a ResourceType is created, a Post here can instruct the alarm server to follow the steps you mention below to add a new dictionary for this new ResourceType. It can also delete and update accordingly. This covers better what it is mentioned in 3.7.9 O-RAN USE-CASES

Yeah that's a great point! Touched on this briefly few weeks ago but for now we can start with this approach for now and maybe move this code resource server eventually? Basically someone has to get the managedCluster CR and read the content....and in this case it's us for now.

But yeah I agree....ultimately Resource server should watch/get cluster resources and we (alarms) should simply be notified about the change and update our db as needed.

And on the other hand resource server exposes endpoints that needs to return resourceType with its alarm dictionary right? Does that mean we need expose an additional internal GET endpoint for it to retrieve that (assuming we are not sharing the table)? probably a question for #262

cc @bartwensley @browsell @alegacy

No, I'm assuming we share the same table. What I meant was to have internal endpoints in the Alarm server so that the Resource server can notify operations on ResourceTypes. For instance, if a new ResourceType is added, then they could send us a Post (with all needed information) for us to determine whether we need to add a new dictionary for this new resource type. If so, then we inspect the managed cluster and get the rules to add a new dictionary. Same for Delete and Update. The idea is to modify the DB accordingly. They still need to go to the DB whenever a query on a resource type is performed.

What I'm looking with this is to have a more dynamic way to update the dictionaries instead on relying only on restarts.

I've updated my PR with info related to our discussion with respect to the interactions between the two components.

mlguerrero12 · 2024-10-22T14:51:19Z

docs/enhancements/infrastructure-monitoring-service-api/alarms.md

+    `entity_type` data should be coming from Inventory API but for now we can hard-code it. `telco-model-OpenShift-<Full Version>`
+- `alarm_definitions` reflects Rules in PromRule CR. We only grab the full set based on unique entries in `Versions` table
+    - Use ACM to get credentials of the unique major.minor clusters and retrieve all the PrometheusRules from them to parse. 
+      E.g if we are managing 3 clusters 4.16.2, 4.17.2 and 4.16.8, Pick 4.16.8 and 4.17.2 which effectively represents all the rules in 4.16.z and 4.17.z clusters. 


There exist the possibility that the cluster 4.16.2 has some custom rules (added manually). These won't be added to the dictionary but alertmanager will still receive them. We could still create an alarm record and send a notification for these, there will be some empty fields. What do you think of this?

bartwensley · 2024-10-22T20:24:52Z

docs/enhancements/infrastructure-monitoring-service-api/alarms.md

+### Jobs to Initialize Alarm Server DB
+We need two Jobs that can help with DB
+
+- One job that creates a Database using `CREATE DATBASE` cmd


nit: spelling "DATBASE"

alegacy · 2024-10-28T16:17:08Z

docs/enhancements/infrastructure-monitoring-service-api/alarms.md

+-- Table: alarm_dictionary
+DROP TABLE IF EXISTS alarm_dictionary CASCADE;
+CREATE TABLE alarm_dictionary (
+          resource_type_id UUID PRIMARY KEY,


i realize that there should be only 1 tuple per resource type but we should still maintain a separate UUID for the "alarm_dictionary_id" to decouple those two tables.

Addressing all the new comments for DB in #294 (keeping everything here unresolved till)

alegacy · 2024-10-28T16:18:31Z

docs/enhancements/infrastructure-monitoring-service-api/alarms.md

+           management_interface_id ManagementInterfaceID[] DEFAULT ARRAY['O2IMS']::ManagementInterfaceID[],
+           pk_notification_field TEXT[] DEFAULT ARRAY['alarm_definition_id']::TEXT[],
+           created_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP,
+           CONSTRAINT fk_alarm_definitions_version FOREIGN KEY (alarm_dictionary_version) REFERENCES versions(version_number),


This method of linking these two tables seems a bit more complicated than necessary. I would prefer to see a column in this table for alarm_dictionary_id and a FK constraint pointing directly to the other table since these are 1:N rather than M:N.

alegacy · 2024-10-28T16:19:14Z

docs/enhancements/infrastructure-monitoring-service-api/alarms.md

+          alarm_dictionary_schema_version VARCHAR(50) DEFAULT 'TBD-O-RAN-DEFINED' NOT NULL,
+          entity_type VARCHAR(255) NOT NULL,
+          vendor VARCHAR(255) NOT NULL,
+          management_interface_id ManagementInterfaceID[] DEFAULT ARRAY['O2IMS']::ManagementInterfaceID[],


management_interface_id is defined as a list in the spec, but we will only ever produce rows with 'O2IMS' as a value so there is no need to store it as an array.

alegacy · 2024-10-28T16:20:38Z

docs/enhancements/infrastructure-monitoring-service-api/alarms.md

+           alarm_last_change VARCHAR(50) NOT NULL,
+           alarm_description TEXT NOT NULL,
+           proposed_repair_actions TEXT NOT NULL,
+           alarm_dictionary_version VARCHAR(50) NOT NULL, -- Links alarm_dictionary and alarm_definitions


alarm_dictionary_version should be replaced by alarm_dictionary_id as mentioned below (see constraint)

alegacy · 2024-10-28T16:21:11Z

docs/enhancements/infrastructure-monitoring-service-api/alarms.md

+           alarm_additional_fields JSONB,
+           alarm_change_type AlarmLastChangeType DEFAULT 'added' NOT NULL,
+           clearing_type ClearingType DEFAULT 'automatic' NOT NULL,
+           management_interface_id ManagementInterfaceID[] DEFAULT ARRAY['O2IMS']::ManagementInterfaceID[],


similar comment to above. This will only ever be "O2IMS" so no need to store as array.

mlguerrero12 · 2024-11-05T11:12:17Z

docs/enhancements/infrastructure-monitoring-service-api/alarms.md

+6. Apply the required CR to activate the internal endpoint for alertmanager notification. See [here](#steps-for-internalv1caas-alertsalertmanager) for more.
+
+7. The server should now be ready to take requests. 
+


I know it is implicit above, but I think we should clearly say somewhere that we will assume rules of all resources of a specific resource type are the same. This means we won't be able to see added rules to a specific resource, unless this one is selected for inspection (not deterministic).

Also, inspired from #262, we could/should implement a sync mechanism for the alarm dictionary. This way we could overcome the limitation of not seeing newly added (or deleted) rules. In the documentation, we could say that once the system is up, rule modifications should be applied to all resources of a specific resource type and that the system will pick up these changes every X (i.e. 5 minutes) time.

openshift-ci · 2024-11-29T02:05:32Z

@pixelsoccupied: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/markdownlint	`c543980`	link	true	`/test markdownlint`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-merge-robot · 2024-11-29T02:05:43Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 24, 2024

Jennifer-chen-rh reviewed Sep 24, 2024

View reviewed changes

docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md Outdated Show resolved Hide resolved