Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Cluster API deployment method for TAS #108

Open
wants to merge 23 commits into
base: master
Choose a base branch
from

Conversation

criscola
Copy link

This PR adds a new deployment method for the Telemetry Aware Scheduler using Cluster API.

It would be great if someone could try it out and let me know if there's any issue/suggestion to improve it. The deployment was tested on Kubernetes v1.25 with GCP provider.

@criscola
Copy link
Author

criscola commented Nov 4, 2022

I've seen that in the meantime you uploaded a CONTRIBUTORS.md. It asks to sign-off using our real name, but I signed-off using my pseudonym "criscola". Should I amend/resubmit the PR?

@madalazar
Copy link
Contributor

madalazar commented Nov 4, 2022

Hi!
If it's not too much of a bother, I would prefer if you were to amend the commits to use your full name and then just update the PR.
Thank you!

@criscola
Copy link
Author

FYI I stumbled upon CAPD, or CAPI in Docker. It's different because it uses the new ClusterClass resource, so requires a few different steps. Since I'm tasked to deploy it locally, I will also submit my findings here. This is nice because it doesn't require users to spin up instances in providers (and spend money), so it becomes more accessible to deploy the TAS for local testing.

@criscola criscola force-pushed the feature/cluster-api branch 2 times, most recently from 903efc3 to 63ea8d3 Compare November 18, 2022 11:16
criscola and others added 3 commits November 18, 2022 14:06
Avoids failures when applying the TAS Service Account through ClusterResourceSet. Relevant error message was: "failed to create object /v1, Kind=ServiceAccount /telemetry-aware-scheduling-service-account: an empty namespace may not be set during creation".

Signed-off-by: Cristiano Colangelo <git@criscola.dev>
Signed-off-by: Cristiano Colangelo <git@criscola.dev>
Signed-off-by: Madalina Lazar <madalina.lazar@intel.com>
@criscola criscola force-pushed the feature/cluster-api branch from a15fd67 to 4d7d6df Compare November 18, 2022 13:06
@criscola
Copy link
Author

criscola commented Nov 18, 2022

I amended my commits and signed-off with my name. Let me know if there is anything else. I managed to make CAPD work as well but discovered something that will likely require a PR on CAPD side, so I'm holding it off for now as I could be adding outdated information by then.

madalazar added a commit to madalazar/platform-aware-scheduling that referenced this pull request Nov 22, 2022
Signed-off-by: Madalina Lazar <madalina.lazar@intel.com>
@madalazar
Copy link
Contributor

Hi,
Everything looks ok, from what I could see. Just wanted to also give you an update, we started looking into this PR and between this and other small things, we plan on working and this PR this year.
I will keep you updated as we work through this.
Thank you!

@criscola
Copy link
Author

criscola commented Jan 3, 2023

Quick update: I will add also the instructions on the CAPD (Cluster API Docker) deployment for local testing/development soon.

@criscola
Copy link
Author

Hi! I added the specific notes for a Docker deployment, it's very similar to a deployment on a generic provider like GCP, but there are some caveats because it uses an experimental feature called ClusterClass, plus a few steps for the local KinD cluster setup/kubeconfig export. Let me know if there is any comment on this. I'm always using the Docker provider now for local development (saves us a few bucks and it's also a nicer user experience as it's faster).

Copy link
Contributor

@madalazar madalazar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had issues setting up TAS with cluster API after following this wiki. There are one/two issues that might be the problem (i.e. naming of capi-quickstart.yaml / your-manifest.yaml).
The rest of these comments are regarding the readability/structure of the wiki.

Comment on lines 140 to 141
ClusterResourceSets resources are already given to you in `clusterresourcesets.yaml`.
Apply them to the management cluster with `kubectl apply -f clusterresourcesets.yaml`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this refer to docker/clusterresourcesets.yaml? And for generic, it also refers to it's own clusterresourcesets.yaml

Copy link
Author

@criscola criscola Jan 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CRS resources are actually the same. Do you think it would make more sense to have only one file for both vs duplicating it, if yes any idea where to put the CRS file in the folder tree?
Edit: I created a shared folder and referenced the common resources in generic and docker guides. Should be more maintainable. If they go out of sync we can always move them out of the shared folder.

Copy link
Contributor

@madalazar madalazar Jan 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having the shared folder and just referencing the files is a good idea. Do we still need the clusterresourcessets.yaml file in both (docker, generic) folders?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope I think we can just have it in the shared folder, it's also linked in both guides so users should find it easily.

@criscola
Copy link
Author

@madalazar I addressed all the comments - thanks for the detailed feedback. I also improved a few parts/solved a few problems. Feel free to continue with the review.

> capi-quickstart.yaml
```

If Kind was running correctly, and the Docker provider was initialized with the previous command, the command will return nothing to indicate success.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is needed, seems like a remnant from the Docker/Kind installation

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be something I added after this comment, should it be removed?

@madalazar
Copy link
Contributor

I want to go through the installation steps in the Readme one last time just to check that everything is in order. I'm planning to do that this week/Tuesday the latest. I will update the PR

@madalazar
Copy link
Contributor

@criscola HI, sorry it took so long for me to get back to you with a reply. Between then and now I've been looking at the security side of this solution (the new components that are brought in are they secure(d), if not what's missing, the infrastructure that we are creating is that secure etc.).

The biggest problem that I found (and that is because I have been able to test only the CAPI Docker provider part of the solution) is with a component that CAPD is introducing: capi-quickstart-lb. This component opens up a port to the outside world on the host that it's installed on and even though the connection is secure, the cyphers use are out-of-date.
I cut kubernetes-sigs/cluster-api#8245 to the CAPI team to see if I they would fix it or propose other solutions. This is still WIP.

There are 5 more issues, but I think they should be more manageable to address:

ConfigMap/custom-metrics-configmap
ConfigMap/custom-metrics-tls-secret-configmap
ConfigMap/tas-configmap
ConfigMap/tas-tls-secret-configmap

With most of the issues being the presence of 'secret', 'key' in the config maps:

HIGH: ConfigMap 'custom-metrics-configmap' in 'default' namespace stores secrets in key(s) or value(s) '{"        secret"}'
HIGH: ConfigMap 'custom-metrics-tls-secret-configmap' in 'default' namespace stores sensitive contents in key(s) or value(s) '{"  tls.crt", "  tls.key"}'
HIGH: ConfigMap 'custom-metrics-tls-secret-configmap' in 'default' namespace stores secrets in key(s) or value(s) '{"custom-metrics-tls-secret.yaml"}'
HIGH: ConfigMap 'tas-configmap' in 'default' namespace stores sensitive contents in key(s) or value(s) '{"                  - key", "            - -key", "         key"}'
HIGH: ConfigMap 'tas-configmap' in 'default' namespace stores secrets in key(s) or value(s) '{"          secret"}'
HIGH: ConfigMap 'tas-tls-secret-configmap' in 'default' namespace stores sensitive contents in key(s) or value(s) '{"  tls.crt", "  tls.key"}'
HIGH: ConfigMap 'tas-tls-secret-configmap' in 'default' namespace stores secrets in key(s) or value(s) '{"tas-tls-secret.yaml"}'

For this particular issues, could we instead of config maps just install the components as they come:

helm install node-exporter deploy/charts/prometheus_node_exporter_helm_chart/
helm install prometheus deploy/charts/prometheus_helm_chart/
--- create key ----
--- create tls secret from key ----
helm install prometheus-adapter deploy/charts/prometheus_custom_metrics_helm_chart/
--- configure scheduler  using configuration script ---
--- create extender secret  from key ----
kubectl apply -f deploy/
kubectl apply -f ../shared/clusterresourcesets.yaml`
kubectl apply -f capi-quickstart.yaml

I'm in the process of releasing a new version of TAS and tried to get this PR added to it, but because of these issues it might not happen.

Let me know what you think of my suggestions and apologies again for the delay,
Madalina

@criscola
Copy link
Author

criscola commented Mar 10, 2023

Hi! Yeah I see about the docker container. Keep in mind this is not supposed to be used for any production release, only development, so I wouldn't worry about it (but good you opened an issue, let's see if anything moves there).

For the configmaps, those are necessary if you wanted to automate the deployment through CRS resources. The user could gitignore them if it wants. Not sure if I would remove them since the purpose of this is a bit to have everything in declarative configuration. Moreover I put a disclaimer on the docs to tell the user to not commit TLS secrets to their repository as it's bad security practice.

@madalazar
Copy link
Contributor

Hi, here are a couple of updates from my side:

In the capi-quickstart.yaml, we need to make the following changes:

    • pass the ciphers through the "extraArgs" parameter of the KubeadmControlPlaneTemplate component
   256         clusterConfiguration:
    257           apiServer:
    258             certSANs:
    259               - localhost
    260               - 127.0.0.1
    261               - 0.0.0.0
    262               - host.docker.internal
    263             extraArgs:
    264               tls-cipher-suites: TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
    265           controllerManager:
    • update the podSecurityStandard property of the Cluster component to "enabled: false"
364  apiVersion: cluster.x-k8s.io/v1beta1
365  kind: Cluster
...
    381  topology:
    382     class: quick-start
    383     controlPlane:
    384       metadata: {}
    385       replicas: 1
    386     variables:
    387       - name: imageRepository
    388         value: ""
    389       - name: etcdImageTag
    390         value: ""
    391       - name: coreDNSImageTag
    392         value: ""
    393       - name: podSecurityStandard
    394         value:
    395           audit: restricted
    396           enabled: false
    397           enforce: baseline
    398           warn: restricted
...

Apologies for the delay,
Madalina

@criscola
Copy link
Author

criscola commented Apr 20, 2023

Hi Madalina, thanks for your answer. Sorry I'm a bit tied up with another task right now but as soon as I'm done with it I want to address your points/update the PR. Thanks for the patience.

@criscola
Copy link
Author

Hi Madalina, one question about your last point:

I thought about this for a while and I think it's best for us to not use config maps to deploy TAS at all. We don't support it and don't have plans to do so in the near future. Instead, I would like to replace steps 5 and 6 with links to https://github.com/intel/platform-aware-scheduling/blob/master/telemetry-aware-scheduling/docs/custom-metrics.md#quick-install and https://github.com/intel/platform-aware-scheduling/tree/master/telemetry-aware-scheduling#deploy-tas.

In my opinion, this defeats a bit the purpose of the Cluster API deployment option, because it prevents a fully automatic deployment of the TAS. When we use ClusterResourceSet, we effectively store the resources to deploy in ConfigMap which will then used to deploy to workload clusters from the management cluster. Everything is then stored in git, in a true GitOps way. There are other ways to do this, but since CRS are from Kubernetes SIG, this seems a good default method to include here, in our case, we use ArgoCD to manage Prometheus/Adapter but we started from CRS.

Is the concern mainly with an increased maintenance burden on the contributors' side? As long as helm charts are used, I believe the user will always need to render the resources in the same way, so there should be no additional work to be done in the future on our side. Of course if something changes to the other configs (like the scheduler's) then it needs to be changed in the CAPI manifests but the change should be well localized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants