From e60ffbc57163ed79d7a77fed47c2ea18a83329d2 Mon Sep 17 00:00:00 2001 From: Nahian Pathan Date: Tue, 3 Sep 2024 16:31:25 -0400 Subject: [PATCH] Enhancement for alarms --- .gitignore | 3 + .../alarms.md | 667 ++++++++++++++++++ 2 files changed, 670 insertions(+) create mode 100644 docs/enhancements/infrastructure-monitoring-service-api/alarms.md diff --git a/.gitignore b/.gitignore index dac0af27..7912c37a 100644 --- a/.gitignore +++ b/.gitignore @@ -6,3 +6,6 @@ # Folders .idea/ + +# macOS auto generated +.DS_Store diff --git a/docs/enhancements/infrastructure-monitoring-service-api/alarms.md b/docs/enhancements/infrastructure-monitoring-service-api/alarms.md new file mode 100644 index 00000000..0267acb4 --- /dev/null +++ b/docs/enhancements/infrastructure-monitoring-service-api/alarms.md @@ -0,0 +1,667 @@ +--- +title: lifecycle-of-infrastructure-monitoring-alarms-api +authors: + - @browsell + - @Jennifer-chen-rh + - @pixelsoccupied +reviewers: # Include a comment about what domain expertise a reviewer is expected to bring and what area of the enhancement you expect them to focus on. For example: - "@networkguru, for networking aspects, please look at IP bootstrapping aspect" + - TBD +approvers: # A single approver is preferred, the role of the approver is to raise important questions, help ensure the enhancement receives reviews from all applicable areas/SMEs, and determine when consensus is achieved such that the EP can move forward to implementation. Having multiple approvers makes it difficult to determine who is responsible for the actual approval. + - @browsell +api-approvers: # In case of new or modified APIs or API extensions (CRDs, aggregated apiservers, webhooks, finalizers). If there is no API change, use "None" + - TBD +creation-date: 2024-08-26 +last-updated: yyyy-mm-dd +tracking-link: # link to the tracking ticket (for example: Jira Feature or Epic ticket) that corresponds to this enhancement + - TBD +see-also: + - "None" +replaces: + - "None" +superseded-by: + - "None" +--- + +# Lifecycle of infrastructure monitoring alarms api + +# Table of Contents +- [Summary](#Summary) +- [Goals](#Goals) +- [Key o-ran data structures](#key-o-ran-data-structures) +- [InfrastructureMonitoring Service API](#Infrastructure-Monitoring-Service-Alarms-API) +- [Database schema](#schema) +- [Init behaviour](#init) +- [Ready behaviour](#ready) + - [Find AlarmDefinitionID and ProbableCauseID from current Alerts](#for-a-given-resourcetypeid-and-alarmname-coming-from-am-alert-find-the-alarmdefinitionid-and-probablecauseid) + - [Notification tracking](#notification-tracking) + - [Cleaning historical data](#daily-archive-cleanup) +- [Kubernetes](#k8s-resources) +- [Tooling](#tooling-and-general-dev-guidelines) +- [Future Updates](#future-updates) + +## Summary + +`o-ran` requires `InfrastructureMonitoring Service API` which is a collection of APIs that can be queried by client to +monitor the health of the `o-cloud`. This enhancement describes initialization steps and ready steps for `InfrastructureMonitoring Service API`. + +More specifically we will describe everything needed for `Alarms`. + +At a high level, this service can be viewed as a thin wrapper of ACM observability stack which translates +OCP cluster resources to data structures understood and defined by `o-ran` spec. Among other things +the service exposes APIs, configures Alertmanager deployment, read PrometheusRules from managedclusters and finally +store data in a persistence storage. + +### Goals +- Define steps to initialize and for when ready serve API calls +- Define database schema +- Define K8s CRs +- Define developer tools + +## Key o-ran data structures +`InfrastructureMonitoring Service API Alarms`, primarily deals with the following o-ran data structures during initialization. +Comments for each attribute is taken from o-ran spec doc. + +Please note that this is not an exhaustive list but are here to help the reader get a feel for the Alarm specific data we are dealing with. + +- AlarmDictionary + + This is primarily the link between Alarms and Inventory. A ResourceType (currently we are mostly dealing with type "cluster") can have exactly one AlarmDictionary. + + ```go + // 3.2.6.2.8-1 + type AlarmDictionary struct { + AlarmDictionaryVersion string `json:"alarmDictionaryVersion"` // M, 1, Version of the Alarm Dictionary. Version is vendor defined such that the version of the dictionary can be associated with a specific version of the software delivery of this product. + AlarmDictionarySchemaVersion string `json:"alarmDictionarySchemaVersion"` // M, 1, Version of the Alarm Dictionary Schema to which this alarm dictionary conforms. Note: The specific value for this should be defined in the IM/DM specification for the Alarm Dictionary Model Schema when it is published at a future date + EntityType string `json:"entityType"` // M, 1, O-RAN entity type emitting the alarm: This shall be unique per vendor ResourceType.model and ResourceType.version + Vendor string `json:"vendor"` // M, 1, Vendor of the Entity Type to whom this Alarm Dictionary applies. This should be the same value as in the ResourceType.vendor attribute. + ManagementInterfaceID []ManagementInterfaceID `json:"managementInterfaceId"` // M, 1..N, List of management interface over which alarms are transmitted for this Entity Type. RESTRICTION: For the O-Cloud IMS Services this value is limited to O2IMS. + PKNotificationField []string `json:"pkNotificationField"` // M, 1..N, Identifies which field or list of fields in the alarm notification contains the primary key (PK) into the Alarm Dictionary for this interface; i.e. which field contains the Alarm Definition ID. + AlarmDefinition []AlarmDefinition `json:"alarmDefinition"` // M, 1..N, List of alarms that can be detected against this ResourceType + } + ``` + +- AlarmDefinition + + AlarmDefinition is what stores rules and it's metadata that is always evaluated to see if an alert is fired by a Resource Type. For caas, this is effectively the content of `PrometheusRules`. + + ```go + // 3.2.6.2.9-1 + type AlarmDefinition struct { + AlarmDefinitionID uuid.UUID `json:"alarmDefinitionID"` // M, 1, Provides a unique identifier of the alarm being raised. This is the Primary Key into the Alarm Dictionary. + AlarmName string `json:"alarmName"` // M, 1, Provides short name for the alarm. + AlarmLastChange string `json:"alarmLastChange"` // M, 1, Indicates the Alarm Dictionary Version in which this alarm last changed. + AlarmChangeType AlarmLastChangeType `json:"alarmChangeType"` // M, 1, Indicates the type of change that occurred during the alarm last change; added, deleted, modified. + AlarmDescription string `json:"alarmDescription"` // M, 1, Provides a longer descriptive meaning of the alarm condition and a description of the consequences of the alarm condition. This is intended to be read by an operator to give an idea of what happened and a sense of the effects, consequences, and other impacted areas of the system + ProposedRepairActions string `json:"proposedRepairActions"` // M, 1, Provides guidance for proposed repair actions. + ClearingType ClearingType `json:"clearingType"` // M, 1, Whether alarm is cleared automatically or manually + ManagementInterfaceID []ManagementInterfaceID `json:"managementInterfaceId,omitempty"` // M, 0..N, List of management interface over which alarms are transmitted for this Entity Type. RESTRICTION: For the O-Cloud IMS Services this value is limited to O2IMS. + PKNotificationField []string `json:"pkNotificationField,omitempty"` // M, 0..N, Identifies which field or list of fields in the alarm notification contains the primary key (PK) into the Alarm Dictionary for this interface; i.e. which field contains the Alarm Definition ID. + AlarmAdditionalFields []AttributeValuePair `json:"alarmAdditionalFields,omitempty"` // M, 0..N, List of metadata key-value pairs used to associate meaningful metadata to the related resource type. + } + ``` + +- ProbableCause + + ProbableCause is a subset of data present in AlarmDefinition + + ```go + // 2.1.3.3 + type ProbableCause struct { + ProbableCauseID uuid.UUID `json:"probableCauseId"` // M, Identifier of the ProbableCause. + Name string `json:"name"` // M, Human readable text of the probable cause. + Description string `json:"description"` // M, Any additional information beyond the name to describe the probableCause + } + ``` + +- AlarmEventRecord + + AlarmEventRecord how we represent an alert that is fired or resolved. An alert coming from Alertmanager is mapped 1:1 with an instance of AlarmEventRecord. + + ```go + type AlarmEventRecord struct { + AlarmEventRecordID uuid.UUID `json:"alarmEventRecordId"` // M, Identifier of an entry in the AlarmEventRecord. Locally unique within the scope of an O-Cloud instance. + ResourceID uuid.UUID `json:"resourceId"` // M, A reference to the resource instance which caused the alarm. + AlarmDefinitionID uuid.UUID `json:"alarmDefinitionId"` // M, A reference to the Alarm Definition record in the Alarm Dictionary associated with the referenced Resource Type. + ProbableCauseID uuid.UUID `json:"probableCauseId"` // M, A reference to the ProbableCause of the Alarm. + AlarmRaisedTime time.Time `json:"alarmRaisedTime"` // M, This field is populated with a Date/Time stamp value when the AlarmEventRecord is created. + AlarmChangedTime *time.Time `json:"alarmChangedTime,omitempty"` // M, This field is populated with a Date/Time stamp value when any value of the AlarmEventRecord is modified. + AlarmClearedTime *time.Time `json:"alarmClearedTime,omitempty"` // M, This field is populated with a Date/Time stamp value when the alarm condition is cleared. + AlarmAcknowledgedTime *time.Time `json:"alarmAcknowledgedTime,omitempty"` // M, This field is populated with a Date/Time stamp value when the alarm condition is acknowledged. + AlarmAcknowledged bool `json:"alarmAcknowledged"` // M, This is a Boolean value defaulted to FALSE. When a system acknowledges an alarm, it is then set to TRUE. + PerceivedSeverity PerceivedSeverity `json:"perceivedSeverity"` // M, This is an enumerated set of values which identify the perceived severity of the alarm. + Extensions []KeyValue `json:"extensions"` // M, These are unspecified (not standardized) properties (keys) which are tailored by the vendor or operator to extend the information provided about the O-Cloud Alarm. + } + ``` + +- AlarmSubscriptionInfo + This what stores info about subscription who needs to be notified when an Alarm is raised + + 3.3.6.2.3 + ```go + type AlarmSubscriptionInfo struct { + SubscriptionID uuid.UUID `json:"subscriptionID"` // M, Identifier of the subscription. Locally unique within the scope of an O-Cloud instance. + ConsumerSubscriptionID uuid.UUID `json:"consumerSubscriptionId"` // O, The consumer may provide its identifier for tracking, routing, or identifying the subscription used to report the event. + Filter string `json:"filter"` // O, Criteria for events which do not need to be reported or will be filtered by the subscription notification service. Therefore, if a filter is not provided then all events are reported. + Callback string `json:"callback"` // M, The fully qualified URI to a consumer procedure which can process a Post of the AlarmEventNotification. + } + ``` + +## Infrastructure Monitoring Service Alarms API +| **Endpoint** | **HTTP Method** | **Description** | **Input Payload** | **Returned Data** | +|-----------------------------------------------------------------------------------------|-----------------|---------------------------------------------------------------------|------------------------------------------------------------------------------|-------------------------------------| +| `/O2ims_infrastructureMonitoring/{apiVersion}/alarms` | GET | Retrieve the list of alarms. | Optional query parameters `filter` | A list of `AlarmEventRecord` | +| `/O2ims_infrastructureMonitoring/{apiVersion}/alarms/{alarmEventRecordId}` | GET | Retrieve exactly one alarm identified by `alarmEventRecordId`. | None | Exactly one `AlarmEventRecord` | +| `/O2ims_infrastructureMonitoring/{apiVersion}/alarms/{alarmEventRecordId}` | PATCH | Modify exactly one alarm identified by `alarmEventRecordId` to ack. | `AlarmEventRecordModifications`(no perceivedSeverity only alarmAcknowledged) | `AlarmEventRecordModifications` | +| `/O2ims_infrastructureMonitoring/{apiVersion}/alarmSubscriptions` | GET | Retrieve the list of alarm subscriptions. | Optional query parameters `filter` | A list of `AlarmSubscriptionInfo` | +| `/O2ims_infrastructureMonitoring/{apiVersion}/alarmSubscriptions` | POST | Create a new alarm subscriptions. | `AlarmSubscriptionInfo` | Exactly one `AlarmSubscriptionInfo` | +| `/O2ims_infrastructureMonitoring/{apiVersion}/alarmSubscriptions/{alarmSubscriptionId}` | GET | Retrieve exactly one subscription using `alarmSubscriptionId`. | None | Exactly one `AlarmSubscriptionInfo` | +| `/O2ims_infrastructureMonitoring/{apiVersion}/alarmSubscriptions/{alarmSubscriptionId}` | DELETE | Delete exactly one subscription using `alarmSubscriptionId`. | None | None | +| `/O2ims_infrastructureMonitoring/{apiVersion}/probableCause` | GET | Retrieve all probable causes | None | A list of `ProbableCause` | +| `/O2ims_infrastructureMonitoring/{apiVersion}/probableCause/{probableCauseId}` | GET | Retrieve exactly one probable cause using `probableCauseId`. | None | Exactly one `ProbableCause` | + + + +| **Internal Endpoint** | **HTTP Method** | **Description** | **Input Payload** | **Returned Data** | +|-------------------------------------------------|-----------------|----------------------------------------------|--------------------------------------------------------------------------|-------------------| +| `/internal/v1/caas-alerts/alertmanager` | POST | Alertmanager notifications come through here | https://prometheus.io/docs/alerting/latest/configuration/#webhook_config | None | +| `/internal/v1/hardware-alerts/{hw-vendor-name}` | POST | TBD | TBD | TBD | + + +### `alarms` family + +#### Steps for `/O2ims_infrastructureMonitoring/{apiVersion}/alarms` with GET +1. Get all alarms from `alarm_event_record` table (optionally using `?filter` param values) +2. Response with retrieved list of AlarmEventRecord and appropriate code + +#### Steps for `/O2ims_infrastructureMonitoring/{apiVersion}/alarms/{alarmEventRecordId}` with GET +1. Client calls with an AlarmEventRecordID +2. Search in both `alarm_event_record` and `alarm_event_record_archive` with `AlarmEventRecordID` +3. Response with retrieved instance of AlarmEventRecord and appropriate code + +#### Steps for `/O2ims_infrastructureMonitoring/{apiVersion}/alarms/{alarmEventRecordId}` with PATCH +1. Client calls with an AlarmEventRecordID and `AlarmEventRecordModifications` as patch payload +2. If `AlarmEventRecordModifications.alarmAcknowledged` is True, update `alarm_event_record` table +3. Response with `AlarmEventRecordModifications` and appropriate code + +### `alarmSubscriptions` family + +#### Steps for `/O2ims_infrastructureMonitoring/{apiVersion}/alarmSubscriptions` with GET +1. Query the storage `alarm_subscription_info` (optionally using `?filter` param values) +2. Response with a list of `AlarmSubscriptionInfo` and appropriate code + +#### Steps for `/O2ims_infrastructureMonitoring/{apiVersion}/alarmSubscriptions` with POST +1. Client calls with `AlarmSubscriptionInfo` payload +2. Validate the filter (e.g check if the columns actually exist) +3. Insert `alarm_subscription_info`, for now we limit to 5 (if already 5, return with an error). +4. Response with `AlarmSubscriptionInfo` and appropriate code + +#### Steps for `/O2ims_infrastructureMonitoring/{apiVersion}/alarmSubscriptions/{alarmSubscriptionId}` with GET +1. Client calls with an `alarmSubscriptionId` (update to new standerd) +2. Query the storage `alarm_subscription_info` table using `alarmSubscriptionId` +3. Response with retrieved instance of `AlarmSubscriptionInfo` and appropriate code + +#### Steps for `/O2ims_infrastructureMonitoring/{apiVersion}/alarmSubscriptions/{alarmSubscriptionId}` with DELETE +1. Client calls with an `alarmSubscriptionId` +2. Delete the row `alarm_subscription_info` using `alarmSubscriptionId` +3. No special response (only appropriate code) + +### `probableCause` family + +#### Steps for `/O2ims_infrastructureMonitoring/{apiVersion}/probableCause` with GET +1. Query the tables `probable_causes` and `alarm_definitions` to fetch probableCause ID, alarm name and alarm description +2. Response with a list of `probableCause` and appropriate code + +#### Steps for `/O2ims_infrastructureMonitoring/{apiVersion}/probableCause/{probableCauseId}` with GET +1. Query the tables `probable_causes` and `alarm_definitions` using `probableCauseId` to fetch probableCause ID (should be same as probableCauseId from input ), alarm name and alarm description +2. Response with retrieved instance of `probableCause` and appropriate code + + +### `internal` family + +NOTE: These APIs are not exposed to client and available only internally + +#### Steps `internal/v1/caas-alerts/alertmanager` +This API is activated only after configuring ACM alertmanager that can it can call back alarm service + +A minimal configuration +```yaml +route: + receiver: webhook_receiver +receivers: + - name: webhook_receiver + webhook_configs: + - url: "o-ran-inventory-api-alarms.kubernetes.svc/internal/v1/caasAlerts/alertmanager" + send_resolved: true +``` + +```shell +oc -n open-cluster-management-observability create secret generic alertmanager-config --from-file=alertmanager.yaml --dry-run -o=yaml | oc -n open-cluster-management-observability replace secret --filename=- +``` + +1. Example payload + ```json + "receiver": "webhook_receiver", + "status": "firing" + "alerts": [ + { + "status": "firing", + "labels": { + "alertname": "UpdateAvailable", + "channel": "stable-4.16", + "managed_cluster": "89070983-a62f-4dbe-9457-7e0c27832c63", + "namespace": "openshift-cluster-version", + "openshift_io_alert_source": "platform", + "prometheus": "openshift-monitoring/k8s", + "severity": "info", + "upstream": "" + }, + "annotations": { + "description": "For more information refer to 'oc adm upgrade' or https://console-openshift-console.apps.cnfdf27.sno.telco5gran.eng.rdu2.redhat.com/settings/cluster/.", + "summary": "Your upstream update recommendation service recommends you update your cluster." + }, + "startsAt": "2024-08-28T11:39:17.958Z", + "endsAt": "0001-01-01T00:00:00Z", + "generatorURL": "https://console-openshift-console.apps.cnfdf27.sno.telco5gran.eng.rdu2.redhat.com/monitoring/graph?g0.expr=sum+by+%28channel%2C+namespace%2C+upstream%29+%28cluster_version_available_updates%29+%3E+0&g0.tab=1", + "fingerprint": "91406cd113ad87e5" + }, + ] + ``` +2. Sync `alarm_event_record` Table + - Update rows to "resolved" that are missing in the current payload (i.e previously seen but somehow missed the "resolved" notification) + ```sql + UPDATE alarm_event_record + SET status = 'resolved', alarm_cleared_time = CURRENT_TIMESTAMP + WHERE (finger_print, alarm_raised_time) NOT IN ( + ('Something-Thats-Only-Now-Available', '2023-09-01 10:02:00+00') + ); + ``` + - Create new AlarmEventRecord. Alert entry payload will require Alarm Definition ID and Probable Cause ID which can retrieved as shown [here](#for-a-given-resourcetypeid-and-alarmname-coming-from-am-alert-find-the-alarmdefinitionid-and-probablecauseid) + - Upsert all AlarmEventRecord all the "firing" and "resolved" alerts as indicated by `alerts[].status` i.e a bulk insert + update operation + ```sql + -- ...INSERT, follwed by... + ON CONFLICT ON CONSTRAINT unique_finger_print_alarm_raised_time DO UPDATE -- defined in schema + SET status = EXCLUDED.status, + alarm_cleared_time = EXCLUDED.alarm_cleared_time, + alarm_changed_time = EXCLUDED.alarm_changed_time; + ``` +3. Grab the Subscriptions and send notification. + - Database interaction is further explained [here](#notification-tracking) +4. Move all the `status: resolved` rows from `alarm_event_record` to `alarm_event_record_archive` + +Eventually data in `alarm_event_record_archive` will be cleared (hardcoded to 24hr) as seen [here](#daily-archive-cleanup-) + +## Schema +We only take care of Alarms* data contained within a specific DB, this approach allows for +- Decoupling: each microservice can independently manage its own data allowing them to evolve schema as needed +- Scale: No cross service dependency + +Each table is modeled after o-ran data structures. DB in our case maybe called `o-ran-infrastructure-monitoring-alarms` + +Init SQL may look like the following: +```sql +CREATE DATABASE infrastructure-monitoring-alarms; + +-- ENUM for ManagementInterfaceID +DROP TYPE IF EXISTS ManagementInterfaceID CASCADE; +CREATE TYPE ManagementInterfaceID AS ENUM ('O1', 'O2DMS', 'O2IMS', 'OpenFH'); + +-- ENUM for AlarmLastChangeType +DROP TYPE IF EXISTS AlarmLastChangeType CASCADE; +CREATE TYPE AlarmLastChangeType AS ENUM ('added', 'deleted', 'modified'); + +DROP TYPE IF EXISTS ClearingType CASCADE; +CREATE TYPE ClearingType AS ENUM ('automatic', 'manual'); + +-- Table: versions, link alarm_dictionary and alarm_definitions using version. +DROP TABLE IF EXISTS versions CASCADE; +CREATE TABLE versions ( + version_number VARCHAR(50) PRIMARY KEY +); + +-- Table: alarm_dictionary +DROP TABLE IF EXISTS alarm_dictionary CASCADE; +CREATE TABLE alarm_dictionary ( + resource_type_id UUID PRIMARY KEY, + alarm_dictionary_version VARCHAR(50) NOT NULL, -- Links alarm_dictionary and alarm_definitions + alarm_dictionary_schema_version VARCHAR(50) DEFAULT 'TBD-O-RAN-DEFINED' NOT NULL, + entity_type VARCHAR(255) NOT NULL, + vendor VARCHAR(255) NOT NULL, + management_interface_id ManagementInterfaceID[] DEFAULT ARRAY['O2IMS']::ManagementInterfaceID[], + pk_notification_field TEXT[] DEFAULT ARRAY['resource_type_id']::TEXT[], + created_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP, + CONSTRAINT fk_alarm_dictionary_version FOREIGN KEY (alarm_dictionary_version) REFERENCES versions(version_number) +); + +-- Table: alarm_definitions +DROP TABLE IF EXISTS alarm_definitions CASCADE; +CREATE TABLE alarm_definitions ( + alarm_definition_id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + alarm_name VARCHAR(255) NOT NULL, + alarm_last_change VARCHAR(50) NOT NULL, + alarm_description TEXT NOT NULL, + proposed_repair_actions TEXT NOT NULL, + alarm_dictionary_version VARCHAR(50) NOT NULL, -- Links alarm_dictionary and alarm_definitions + alarm_additional_fields JSONB, + alarm_change_type AlarmLastChangeType DEFAULT 'added' NOT NULL, + clearing_type ClearingType DEFAULT 'automatic' NOT NULL, + management_interface_id ManagementInterfaceID[] DEFAULT ARRAY['O2IMS']::ManagementInterfaceID[], + pk_notification_field TEXT[] DEFAULT ARRAY['alarm_definition_id']::TEXT[], + created_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP, + CONSTRAINT fk_alarm_definitions_version FOREIGN KEY (alarm_dictionary_version) REFERENCES versions(version_number), + CONSTRAINT unique_alarm_name_last_change UNIQUE (alarm_name, alarm_last_change) +); + +-- Table: probable_causes +DROP TABLE IF EXISTS probable_causes CASCADE; +CREATE TABLE probable_causes ( + probable_cause_id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + alarm_definition_id UUID UNIQUE, + FOREIGN KEY (alarm_definition_id) REFERENCES alarm_definitions(alarm_definition_id) ON DELETE CASCADE +); + +-- Create a new entry for probable cause for each new alarm_definition_id +CREATE OR REPLACE FUNCTION insert_probable_cause() + RETURNS TRIGGER AS $$ +BEGIN + -- Insert a new row into probable_causes with the alarm_definition_id from the new alarm_definition + INSERT INTO probable_causes (alarm_definition_id) + VALUES (NEW.alarm_definition_id); + + RETURN NEW; +END; +$$ LANGUAGE plpgsql; + +CREATE TRIGGER trg_insert_probable_cause + AFTER INSERT ON alarm_definitions + FOR EACH ROW +EXECUTE FUNCTION insert_probable_cause(); + + +-- Table: alarm_subscription_info +DROP TABLE IF EXISTS alarm_subscription_info CASCADE; +CREATE TABLE alarm_subscription_info ( + subscription_id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + consumer_subscription_id UUID, + filter TEXT, + callback TEXT NOT NULL, + largest_number_alarm_event_seen_so_far BIGINT NOT NULL DEFAULT 0, + created_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP +); + + +CREATE TYPE Status AS ENUM ('resolved', 'firing'); +CREATE TYPE PerceivedSeverity AS ENUM ('CRITICAL', 'MAJOR', 'MINOR','WARNING', 'INDETERMINATE', 'CLEARED'); + +-- SEQUENCE: Counter to keep track of latest events and use it notify only the latest +CREATE SEQUENCE alarm_sequence_seq + START WITH 1 + INCREMENT BY 1 + NO MINVALUE + NO MAXVALUE + CACHE 1; + +-- Table: alarm_event_record +DROP TABLE IF EXISTS alarm_event_record CASCADE; +CREATE TABLE alarm_event_record ( + alarm_event_record_id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + alarm_definition_id UUID NOT NULL, + probable_cause_id UUID, + status Status DEFAULT 'firing' NOT NULL, + alarm_raised_time TIMESTAMPTZ NOT NULL, + alarm_changed_time TIMESTAMPTZ, + alarm_cleared_time TIMESTAMPTZ, + alarm_acknowledged_time TIMESTAMPTZ, + alarm_acknowledged BOOLEAN NOT NULL default FALSE, + perceived_severity PerceivedSeverity NOT NULL, + extensions JSONB, + created_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP, + finger_print TEXT not null, + alarm_sequence_number BIGINT DEFAULT nextval('alarm_sequence_seq'), + resource_id UUID NOT NULL, + resource_type_id UUID NOT NULL, + CONSTRAINT fk_resource_type FOREIGN KEY (resource_type_id) REFERENCES alarm_dictionary (resource_type_id), + CONSTRAINT unique_finger_print_alarm_raised_time UNIQUE (finger_print, alarm_raised_time) +); + +-- Ownership of alarm_sequence_seq +ALTER SEQUENCE alarm_sequence_seq OWNED BY alarm_event_record.alarm_sequence_number; + + +-- Update the sequence if resolved or alarm_changed_time changed +CREATE OR REPLACE FUNCTION set_alarm_sequence_on_update() + RETURNS TRIGGER AS $$ +BEGIN + IF (NEW.status = 'resolved' AND OLD.status IS DISTINCT FROM 'resolved') + OR (NEW.alarm_changed_time IS DISTINCT FROM OLD.alarm_changed_time) THEN + NEW.alarm_sequence_number := nextval('alarm_sequence_seq'); + END IF; + RETURN NEW; +END; +$$ LANGUAGE plpgsql; + + +-- Attach the trigger to alarm_event_record +CREATE TRIGGER trg_alarm_sequence_update + BEFORE UPDATE ON alarm_event_record + FOR EACH ROW +EXECUTE FUNCTION set_alarm_sequence_on_update(); + + +-- Table: alarm_event_record_archive, is identical to alarm_event_record. Use to this to eventually store events that are considered historical (i.e status: resolved) +DROP TABLE IF EXISTS alarm_event_record_archive CASCADE; +CREATE TABLE alarm_event_record_archive +(LIKE alarm_event_record INCLUDING ALL); +``` + +Please note that script above was used test and need some updates for production e.g use `CREATE TABLE IF NOT EXISTS` and remove `DROP TABLE IF EXISTS`. +Apply this during deployment as specified [here](#k8s-resources) via a migration tool as specified [here](#tooling-and-general-dev-guidelines). + +## Init +Notes: +- Interacting with the following: `versions`, `alarm_dictionary` and `alarm_definitions` +- `alarm_dictionary` has a primary key `resource_type_id`, ensuring that each ResourceType has a unique AlarmDictionary. +- There's a one-to-many relation between `alarm_dictionary` and `alarm_definitions` via major.minor OCP version. The referential integrity and normalization version is done through `versions` table + +```sql +INSERT INTO versions (version_number) +VALUES + ('4.16'), + ('4.17'); + +-- Most of the data in this table will come from k8s calls for now +INSERT INTO alarm_dictionary (alarm_dictionary_version, resource_type_id, entity_type, vendor) +VALUES + ('4.16', 'b3e7149e-d471-4d0f-aaa6-d5e9aa9e713a', 'telco-model-OpenShift-4.16.2', 'Red Hat'), + ('4.16', '481688c8-2782-4534-a9de-88ca5154411d', 'telco-model-OpenShift-4.16.8', 'Red Hat' ), + ('4.17', 'f8b9e100-fd9f-4923-b96f-89418e9c2560', 'telco-model-OpenShift-4.17.2', 'Red Hat'); + +-- Insert into alarm_definitions +INSERT INTO alarm_definitions (alarm_name, alarm_last_change, alarm_description, proposed_repair_actions , alarm_additional_fields) +VALUES + ('NodeClockNotSynchronising','4.16', 'Clock not synchronising.','Clock at {{ $labels.instance }} is not synchronising. Ensure NTP is configured on this host.','{"CustomKey": "CustomValue"}'), + ('LowMemory','4.16','Low memory.','Low memory, with a longer description to help user fix the issue','{"CustomKey2": "CustomValue"}'), + ('NodeClockNotSynchronising','4.17','Clock not synchronising.','Clock at {{ $labels.instance }} is not synchronising. Ensure NTP2 is configured on this host.', '{"CustomKey3": "CustomValue"}'); + +-- probable_cause will be auto populated +``` +Notes on Init phase +- `versions` table reflects unique `major-minor` version `ManagedClusters` currently deployed. + To get available `major-minor` managed cluster, we can list `ManagedCluster` CR and look for label `openshiftVersion-major-minor`. + ```shell + oc get managedclusters + ``` + ```yaml + apiVersion: cluster.open-cluster-management.io/v1 + kind: ManagedCluster + metadata: + labels: + openshiftVersion-major-minor: "4.16" + ``` + +- `alarm_dictionary` essentially links `ResourceTypeID` and `Version`. The conversion can be seen below. + + | managed_cluster
`from alerts`
| resourceID
`same as managed_cluster`
| resourceTypeID for caas
`derived from (resourceID + ResourceKindLogical + ResourceClassCompute)`
| + |----------------------------------------|-------------------------------------------------|-----------------------------------------------------------------------------------------------------------| + | f90561e2-6420-4924-b081-f4f8eaf50618 | f90561e2-6420-4924-b081-f4f8eaf50618 | 4586f964-6c6f-407b-9b18-cb3c9a712ec4 | + + `entity_type` data should be coming from Inventory API but for now we can hard-code it. `telco-model-OpenShift-` +- `alarm_definitions` reflects Rules in PromRule CR. We only grab the full set based on unique entries in `Versions` table + ```yaml + # Partial PrometheusRule to show NodeClockNotSynchronising + apiVersion: monitoring.coreos.com/v1 + kind: PrometheusRule + metadata: + name: node-exporter-rules + namespace: openshift-monitoring + spec: + groups: + - name: node-exporter + rules: + - alert: NodeClockNotSynchronising #### (alarm_definitions.alarm_name) + annotations: + description: Clock at {{ $labels.instance }} is not synchronising. Ensure NTP is configured on this host. + runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/NodeClockNotSynchronising.md #### (alarm_definitions.proposed_repair_actions) + summary: Clock not synchronising. #### (alarm_definitions.alarm_description) + expr: | + min_over_time(node_timex_sync_status{job="node-exporter"}[5m]) == 0 + and + node_timex_maxerror_seconds{job="node-exporter"} >= 16 + for: 10m + labels: + severity: critical #### (alarm_definitions.extensions) + ``` + - Use ACM to get credentials of the unique major.minor clusters and retrieve all the PrometheusRules from them to parse. + E.g if we are managing 3 clusters 4.16.2, 4.17.2 and 4.16.8, Pick 4.16.8 and 4.17.2 which effectively represents all the rules in 4.16.z and 4.17.z clusters. +- Build out a mapping between cluster ID, resource type ID and resource ID in memory as needed for quick lookup during runtime. + +## Ready + +### For a given ResourceTypeID and AlarmName (coming from AM alert), find the AlarmDefinitionID and ProbableCauseID +- Find resourceType ID and alert name from current Alertmanager payload + - Get managedcluster-id and alert name from alertmanager alerts (e.g NodeClockNotSynchronising2) + - Ask inventory for ResourceType ID using managedcluster-id (e.g b3e7149e-d471-4d0f-aaa6-d5e9aa9e713a) + ```go + // should be something we compute + func getResourceTypeID(managedClusterId uuid.UUID, class ResourceClass, resourceType ResourceType) uuid.UUID { + return v5UUID(fmt.Sprintf("%s-%s-%s", managedClusterId.String(), class, resourceType)) + } + ``` +- Run following. `resource_type_id` should be enough to uniquely identify the corresponding alarm dictionary + and its associated alarm definitions and finally use `alarm_name` to filter out the exact rule. + ```sql + SELECT + ad.alarm_definition_id, + pc.probable_cause_id + FROM + alarm_definitions ad + JOIN + alarm_dictionary adict + ON ad.alarm_dictionary_version = adict.alarm_dictionary_version + JOIN + probable_causes pc + ON ad.alarm_definition_id = pc.alarm_definition_id + WHERE + adict.resource_type_id = '481688c8-2782-4534-a9de-88ca5154411d' + AND ad.alarm_name = 'LowMemory'; + ``` + + +### Notification tracking +- Collect all subscription info including ID, callback and filter +- For each subscription, collect all the `AlarmEventRecord` rows based on sequence number and optionally the filter. Here we are collecting everything that's "CRITICAL" + ```sql + SELECT aer.* + FROM alarm_event_record aer + JOIN alarm_subscription_info asi ON asi.alarm_subscription_id = 'a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11' + WHERE aer.alarm_sequence_number > asi.largest_number_alarm_event_seen_so_far + and aer.perceived_severity = 'CRITICAL' + ORDER BY aer.alarm_sequence_number; + ``` +- Process and notify by deriving `AlarmEventRecordModifications` o-ran DS + callback. +- Update sequence for subscription indicating the latest event sent so far + ```go + var largestProcessedSequenceNumber int64 + // for each alarms Update the largest sequence number we've processed + largestProcessedSequenceNumber = max(alarm.SequenceNumber, largestProcessedSequenceNumber)) + // The last row in the query is always the largest (ORDER BY default is accending) + ``` + And finally update + ```sql + UPDATE alarm_subscription_info + SET largest_number_alarm_event_seen_so_far = $largestProcessedSequenceNumber + WHERE alarm_subscription_id = 'a2eebc99-9c0b-4ef8-bb6d-6bb9bd380a13'; + ``` +- NOTE: `alarm_sequence_number` is automatically handled from inside the DB. When the sequence increments, subscriber is notified. + See notification conditions [here](#conditions-for-notifying-subscriber) + +### Conditions for Notifying subscriber +- When alarm is `firing` for the first time +- When alarm is updated to `resolved` +- When `alarm_changed_time` changes (could be multiple times) + +### Daily archive cleanup +Run this using a k8s CronJob CR at the start of every hour + +```sql +DELETE FROM alarm_event_record_archive +WHERE alarm_cleared_time < NOW() - INTERVAL '24 hour' and status = 'resolved'; +``` +We can apply the CR before server starts and remove it during shutdown as part of teardown e.g inside `server.RegisterOnShutdown` + +## K8s resources +We will need few K8s resources that will be eventually applied by the Operator. + +#### Alarm server +This is essentially a typical CRUD app and we need the following + +- Deployment: It should have one initContainers that runs to completion to perform DB migration and one main container which starts the main server. + No HA, so set replica to 1. It should also contain all the ENV variables needed to talk to postgres deployment DB_HOST, DB_PORT, DB_NAME, DB_USER etc. + Suitable resources should be provided but not much memory and CPU is need for CRUD apps. +- Secrets and ConfigMap: DB creds and configs should be read from here. We can probably use the configs from Postgres (seen below) for user and password. +- Service: Expose and balance using `ClusterIP` (though to start with we will set replica to 1) +- Ingress: Expose service so that it can be called by users from outside the cluster + + +#### Postgres +This deployment can be leveraged by many microservices by creating their own Database. + +- Deployment: One replica should be good enough for our case. + Default username, password and dbname (note this is simply to allow postgres pod initialize) should be provided. + Deployment should also mount to the right path `/var/lib/postgresql/data` (please check the latest docs). +- PersistentVolumeClaim: At least 20G PVC should be used. Using test, we used about 60k rows DB which took about ~2Gb +- Service: ClusterIP should be good as it's only used within the cluster. +- Secrets and Config: default creds needed to spin postgres + +## Tooling and general dev guidelines +- The HTTP server should be built with latest Go 1.22 `net/http` std lib. The latest update in the package brings in + many requested features including mapping URI pattern. This allows to drop third party lib `gorilla/mux`. +- Prefer creating structs to hold HTTP data for idiomatic Go code. +- OpenAPI spec should be the source of truth. Other than standardization, free validation and documentation, + with it, we can leverage a code generator such [this](https://github.com/oapi-codegen/oapi-codegen), allowing us to avoid writing boilerplate code. +- For Postgres communication use library [pgx](https://github.com/jackc/pgx) v5. This Go Postgres driver and lots of + important features such as automatic type mapping, detailed error reporting (capture performance info) etc. + There's also many ORM and SQL query builder libraries but pgx looks like the best of both worlds. +- DB migration is generally handled with a different tool. [golang-migrate](https://github.com/golang-migrate/migrate) is generally used for this which we can call during service init. +- Cobra CLI should be used to have better control of the servers. Each microservice should own its verbs, allowing them to develop independently. E.g + ```shell + oran-o2ims alarms -h + oran-o2ims alarms start -h + oran-o2ims alarms db-migration -h + ``` + +## Future Updates +Phase 2: CaaS + H/W Alarms +- Add H/W alarms with phase 1 capabilities + +Phase 3: MNO Support +- Phase 1 + 2 capabilities +- Scale target: 2 MNO clusters with 7 nodes + +Phase 4: GA +- Scale to 3500 SNO clusters per hub +- Scale to `TBD` MNO clusters per hub +- Increased number of subscriptions `TBD` +- Support alarm suppression +- Support configurable historical alarm retention period +- HA `TBD` +