-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancement for alarms #215
base: main
Are you sure you want to change the base?
Conversation
Skipping CI for Draft Pull Request. |
docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md
Outdated
Show resolved
Hide resolved
docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md
Outdated
Show resolved
Hide resolved
docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md
Outdated
Show resolved
Hide resolved
docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md
Outdated
Show resolved
Hide resolved
docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md
Outdated
Show resolved
Hide resolved
docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md
Outdated
Show resolved
Hide resolved
docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md
Outdated
Show resolved
Hide resolved
docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md
Outdated
Show resolved
Hide resolved
docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md
Outdated
Show resolved
Hide resolved
docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General question about DB serial number. It will increase with the entry of DB rows or increase with DB row update?
docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md
Outdated
Show resolved
Hide resolved
docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md
Outdated
Show resolved
Hide resolved
docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md
Outdated
Show resolved
Hide resolved
docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md
Outdated
Show resolved
Hide resolved
docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md
Outdated
Show resolved
Hide resolved
docs/enhancements/infrastructure-monitoring-service-api/lifecycle.md
Outdated
Show resolved
Hide resolved
@Jennifer-chen-rh on entry. More on SERIAL datatypes https://www.postgresql.org/docs/current/datatype-numeric.html#DATATYPE-SERIAL. But we can have custom type which be incremented on insertion or update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pixelsoccupied aha, I felt something missing in PR. Now I figured out that the section we discussed about history table entry age out not here.
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- the datastructures are still public
- Missing the the alarm notification event object structure
9e69f7c
to
ce256ad
Compare
ce256ad
to
e60ffbc
Compare
docs/enhancements/infrastructure-monitoring-service-api/alarms.md
Outdated
Show resolved
Hide resolved
docs/enhancements/infrastructure-monitoring-service-api/alarms.md
Outdated
Show resolved
Hide resolved
docs/enhancements/infrastructure-monitoring-service-api/alarms.md
Outdated
Show resolved
Hide resolved
docs/enhancements/infrastructure-monitoring-service-api/alarms.md
Outdated
Show resolved
Hide resolved
docs/enhancements/infrastructure-monitoring-service-api/alarms.md
Outdated
Show resolved
Hide resolved
docs/enhancements/infrastructure-monitoring-service-api/alarms.md
Outdated
Show resolved
Hide resolved
…rver and add david as reviewer
For a given OCP release, the alarmDefinitions and probableCauses are fixed, so these can be built up front. For CaaS alarms only one resource type, “NodeCluster”, all alarms map to it. | ||
|
||
1. Query all managed clusters to get list of unique major.minor versions | ||
- Need to monitor for major.minor new versions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that instead of relying of restarts, we could define a communication channel between the Resource server and the Alarm server. We could define some internal endpoints for the alarm dictionary here and the Resource server can send operations that follow the lifecycle of a ResourceType. If a ResourceType is created, a Post here can instruct the alarm server to follow the steps you mention below to add a new dictionary for this new ResourceType. It can also delete and update accordingly. This covers better what it is mentioned in 3.7.9 O-RAN USE-CASES
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's a great point! Touched on this briefly few weeks ago but for now we can start with this approach for now and maybe move this code resource server eventually? Basically someone has to get the managedCluster CR and read the content....and in this case it's us for now.
But yeah I agree....ultimately Resource server should watch/get cluster resources and we (alarms) should simply be notified about the change and update our db as needed.
And on the other hand resource server exposes endpoints that needs to return resourceType with its alarm dictionary right? Does that mean we need expose an additional internal GET endpoint for it to retrieve that (assuming we are not sharing the table)? probably a question for #262
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I'm assuming we share the same table. What I meant was to have internal endpoints in the Alarm server so that the Resource server can notify operations on ResourceTypes. For instance, if a new ResourceType is added, then they could send us a Post (with all needed information) for us to determine whether we need to add a new dictionary for this new resource type. If so, then we inspect the managed cluster and get the rules to add a new dictionary. Same for Delete and Update. The idea is to modify the DB accordingly. They still need to go to the DB whenever a query on a resource type is performed.
What I'm looking with this is to have a more dynamic way to update the dictionaries instead on relying only on restarts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated my PR with info related to our discussion with respect to the interactions between the two components.
`entity_type` data should be coming from Inventory API but for now we can hard-code it. `telco-model-OpenShift-<Full Version>` | ||
- `alarm_definitions` reflects Rules in PromRule CR. We only grab the full set based on unique entries in `Versions` table | ||
- Use ACM to get credentials of the unique major.minor clusters and retrieve all the PrometheusRules from them to parse. | ||
E.g if we are managing 3 clusters 4.16.2, 4.17.2 and 4.16.8, Pick 4.16.8 and 4.17.2 which effectively represents all the rules in 4.16.z and 4.17.z clusters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There exist the possibility that the cluster 4.16.2 has some custom rules (added manually). These won't be added to the dictionary but alertmanager will still receive them. We could still create an alarm record and send a notification for these, there will be some empty fields. What do you think of this?
### Jobs to Initialize Alarm Server DB | ||
We need two Jobs that can help with DB | ||
|
||
- One job that creates a Database using `CREATE DATBASE` cmd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: spelling "DATBASE"
-- Table: alarm_dictionary | ||
DROP TABLE IF EXISTS alarm_dictionary CASCADE; | ||
CREATE TABLE alarm_dictionary ( | ||
resource_type_id UUID PRIMARY KEY, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i realize that there should be only 1 tuple per resource type but we should still maintain a separate UUID for the "alarm_dictionary_id" to decouple those two tables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressing all the new comments for DB in #294 (keeping everything here unresolved till)
management_interface_id ManagementInterfaceID[] DEFAULT ARRAY['O2IMS']::ManagementInterfaceID[], | ||
pk_notification_field TEXT[] DEFAULT ARRAY['alarm_definition_id']::TEXT[], | ||
created_at TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP, | ||
CONSTRAINT fk_alarm_definitions_version FOREIGN KEY (alarm_dictionary_version) REFERENCES versions(version_number), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method of linking these two tables seems a bit more complicated than necessary. I would prefer to see a column in this table for alarm_dictionary_id
and a FK constraint pointing directly to the other table since these are 1:N rather than M:N.
alarm_dictionary_schema_version VARCHAR(50) DEFAULT 'TBD-O-RAN-DEFINED' NOT NULL, | ||
entity_type VARCHAR(255) NOT NULL, | ||
vendor VARCHAR(255) NOT NULL, | ||
management_interface_id ManagementInterfaceID[] DEFAULT ARRAY['O2IMS']::ManagementInterfaceID[], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
management_interface_id
is defined as a list in the spec, but we will only ever produce rows with 'O2IMS' as a value so there is no need to store it as an array.
alarm_last_change VARCHAR(50) NOT NULL, | ||
alarm_description TEXT NOT NULL, | ||
proposed_repair_actions TEXT NOT NULL, | ||
alarm_dictionary_version VARCHAR(50) NOT NULL, -- Links alarm_dictionary and alarm_definitions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alarm_dictionary_version
should be replaced by alarm_dictionary_id
as mentioned below (see constraint)
alarm_additional_fields JSONB, | ||
alarm_change_type AlarmLastChangeType DEFAULT 'added' NOT NULL, | ||
clearing_type ClearingType DEFAULT 'automatic' NOT NULL, | ||
management_interface_id ManagementInterfaceID[] DEFAULT ARRAY['O2IMS']::ManagementInterfaceID[], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar comment to above. This will only ever be "O2IMS" so no need to store as array.
6. Apply the required CR to activate the internal endpoint for alertmanager notification. See [here](#steps-for-internalv1caas-alertsalertmanager) for more. | ||
|
||
7. The server should now be ready to take requests. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it is implicit above, but I think we should clearly say somewhere that we will assume rules of all resources of a specific resource type are the same. This means we won't be able to see added rules to a specific resource, unless this one is selected for inspection (not deterministic).
Also, inspired from #262, we could/should implement a sync mechanism for the alarm dictionary. This way we could overcome the limitation of not seeing newly added (or deleted) rules. In the documentation, we could say that once the system is up, rule modifications should be applied to all resources of a specific resource type and that the system will pick up these changes every X (i.e. 5 minutes) time.
@pixelsoccupied: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
This enhancement talks about re-architecting the Alarm server as specified in
InfrastructureMonitoring Service API
o-ran spec.Notable changes include:
This enhancement includes all the code tested during spike, the k8s resources needed to deploy through operator and other libraries/tool that can be used to quickly develop this.
co-authored with @browsell and @Jennifer-chen-rh