In T339890, we made adjustments to the deployment settings of the recommendation-api-ng until we reached a state where it could run on LiftWing. In this task, our goal is to deploy the rec-api on the staging environment first. If everything goes smoothly, we will then deploy it to the production environment.
Description
Details
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Open | None | T296994 Observations from research study for Section Translation on Thai Wikipedia | |||
Open | None | T293648 Content Translation Recommendations API | |||
Resolved | kevinbazira | T308164 Migrate Content Translation Recommendation API to Lift Wing | |||
Resolved | kevinbazira | T339890 Host the recommendation-api container on LiftWing | |||
Resolved | kevinbazira | T347015 Deploy the recommendation-api-ng on LiftWing | |||
Resolved | klausman | T347262 Set SLO for the recommendation-api-ng service hosted on LiftWing | |||
Resolved | klausman | T347263 Create external endpoint for recommendation-api-ng hosted on LiftWing | |||
Resolved | kevinbazira | T354601 Fix rec-api-ng relative paths handling | |||
Resolved | kevinbazira | T347475 Investigate recommendation-api-ng internal endpoint failure | |||
Resolved | kevinbazira | T348607 Configure envoy settings to enable rec-api-ng container to access endpoints external to k8s/LiftWing |
Event Timeline
The attempt to deploy the rec-api on LW staging returned:
kevinbazira@deploy1002:/srv/deployment-charts/helmfile.d/ml-services/recommendation-api-ng$ helmfile -e ml-staging-codfw sync Affected releases are: main (wmf-stable/python-webapp) UPDATED skipping missing values file matching "values-main.yaml" helmfile.yaml: basePath=. Upgrading release=main, chart=wmf-stable/python-webapp Release "main" does not exist. Installing it now. FAILED RELEASES: NAME main in ./helmfile.yaml: failed processing release main: command "/usr/bin/helm3" exited with non-zero status: PATH: /usr/bin/helm3 ARGS: 0: helm3 (5 bytes) 1: upgrade (7 bytes) 2: --install (9 bytes) 3: --reset-values (14 bytes) 4: main (4 bytes) 5: wmf-stable/python-webapp (24 bytes) 6: --timeout (9 bytes) 7: 600s (4 bytes) 8: --atomic (8 bytes) 9: --namespace (11 bytes) 10: recommendation-api-ng (21 bytes) 11: --values (8 bytes) 12: /tmp/values285361875 (20 bytes) 13: --values (8 bytes) 14: /tmp/values965689366 (20 bytes) 15: --values (8 bytes) 16: /tmp/values329015421 (20 bytes) 17: --values (8 bytes) 18: /tmp/values041198520 (20 bytes) 19: --history-max (13 bytes) 20: 10 (2 bytes) 21: --kubeconfig=/etc/kubernetes/recommendation-api-ng-deploy-ml-staging-codfw.config (81 bytes) ERROR: exit status 1 EXIT STATUS 1 STDERR: WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/recommendation-api-ng-deploy-ml-staging-codfw.config Error: release main failed, and has been uninstalled due to atomic being set: timed out waiting for the condition COMBINED OUTPUT: WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/recommendation-api-ng-deploy-ml-staging-codfw.config Release "main" does not exist. Installing it now. Error: release main failed, and has been uninstalled due to atomic being set: timed out waiting for the condition
On IRC, @elukey advised that we use deploy2002 due to a recent datacenter switchover. The second attempt to deploy on staging returned:
kevinbazira@deploy2002:/srv/deployment-charts/helmfile.d/ml-services/recommendation-api-ng$ helmfile -e ml-staging-codfw sync Affected releases are: main (wmf-stable/python-webapp) UPDATED skipping missing values file matching "values-main.yaml" helmfile.yaml: basePath=. Upgrading release=main, chart=wmf-stable/python-webapp Release "main" does not exist. Installing it now. FAILED RELEASES: NAME main in ./helmfile.yaml: failed processing release main: command "/usr/bin/helm3" exited with non-zero status: PATH: /usr/bin/helm3 ARGS: 0: helm3 (5 bytes) 1: upgrade (7 bytes) 2: --install (9 bytes) 3: --reset-values (14 bytes) 4: main (4 bytes) 5: wmf-stable/python-webapp (24 bytes) 6: --timeout (9 bytes) 7: 600s (4 bytes) 8: --atomic (8 bytes) 9: --namespace (11 bytes) 10: recommendation-api-ng (21 bytes) 11: --values (8 bytes) 12: /tmp/values722100947 (20 bytes) 13: --values (8 bytes) 14: /tmp/values904989206 (20 bytes) 15: --values (8 bytes) 16: /tmp/values292295293 (20 bytes) 17: --values (8 bytes) 18: /tmp/values984229304 (20 bytes) 19: --history-max (13 bytes) 20: 10 (2 bytes) 21: --kubeconfig=/etc/kubernetes/recommendation-api-ng-deploy-ml-staging-codfw.config (81 bytes) ERROR: exit status 1 EXIT STATUS 1 STDERR: WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/recommendation-api-ng-deploy-ml-staging-codfw.config Error: release main failed, and has been uninstalled due to atomic being set: timed out waiting for the condition COMBINED OUTPUT: WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/recommendation-api-ng-deploy-ml-staging-codfw.config Release "main" does not exist. Installing it now. Error: release main failed, and has been uninstalled due to atomic being set: timed out waiting for the condition
I tried to sync and got the same error as Kevin mentioned above.
However I saw that a pod was created during the process which was failing with CrashLoopBackOff.
The only issue I saw was probes failing (via kubectl describe )
Normal Killing 4m16s kubelet Container recommendation-api-ng-main failed liveness probe, will be restarted Warning Unhealthy 4m6s (x4 over 4m36s) kubelet Liveness probe failed: dial tcp 10.194.61.146:8080: connect: connection refused Warning Unhealthy 3m56s (x11 over 4m43s) kubelet Readiness probe failed: Get "http://10.194.61.146:8080/": dial tcp 10.194.61.146:8080: connect: connection refused
Change 958976 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):
[operations/deployment-charts@master] ml-services: update recommendation-api-ng port
Change 958977 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):
[research/recommendation-api@master] blubber: update recommendation-api-ng port
Change 958977 merged by jenkins-bot:
[research/recommendation-api@master] blubber: update recommendation-api-ng port
Change 958976 merged by jenkins-bot:
[operations/deployment-charts@master] ml-services: update recommendation-api-ng image
Change 958978 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):
[operations/deployment-charts@master] ml-services: update recommendation-api-ng readiness_probe
Change 958978 merged by jenkins-bot:
[operations/deployment-charts@master] ml-services: update recommendation-api-ng readiness_probe
Change 959789 had a related patch set uploaded (by Elukey; author: Elukey):
[operations/deployment-charts@master] ml-services: enable the ingress module for recommendation-api-ng
Change 959789 merged by Elukey:
[operations/deployment-charts@master] ml-services: enable the ingress module for recommendation-api-ng
The recommendation-api-ng has been successfully deployed on LiftWing staging:
kevinbazira@deploy2002:/srv/deployment-charts/helmfile.d/ml-services$ curl https://recommendation-api-ng.k8s-ml-staging.discovery.wmnet:31443/api/spec {"basePath":"/api","consumes":["application/json"],"definitions":{"Article":{"properties":{"pageviews":{"description":"pageviews","type":"integer"},"rank":{"description":"rank","type":"number"},"title":{"description":"title","type":"string"},"wikidata_id":{"description":"wikidata_id","type":"string"}},"required":["rank","title","wikidata_id"],"type":"object"}},"info":{"title":"API","version":"1.0"},"paths":{"/":{"get":{"deprecated":true,"description":"Gets recommendations of source articles that are missing in the target","operationId":"get_legacy_article","parameters":[{"description":"Source wiki project language code","in":"query","name":"s","required":true,"type":"string"},{"description":"Target wiki project language code","in":"query","name":"t","required":true,"type":"string"},{"default":12,"description":"Number of recommendations to fetch","in":"query","maximum":24,"minimum":0,"name":"n","type":"integer"},{"description":"Seed article for personalized recommendations that can also be a list separated by \"|\"","in":"query","name":"article","pattern":"^([^|]+(\\|[^|]+)*)?$","type":"string"},{"default":true,"description":"Whether to include pageview counts","in":"query","name":"pageviews","type":"boolean"},{"collectionFormat":"multi","default":"morelike","description":"Which search algorithm to use if a seed is specified","enum":["morelike","wiki"],"in":"query","name":"search","type":"string"}],"responses":{"200":{"description":"Success","schema":{"items":{"$ref":"#/definitions/Article"},"type":"array"}}},"tags":["default"]}},"/spec":{"get":{"operationId":"get_spec","responses":{"200":{"description":"Success"}},"tags":["default"]}}},"produces":["application/json"],"responses":{"MaskError":{"description":"When any error occurs on mask"},"ParseError":{"description":"When a mask can't be parsed"}},"swagger":"2.0","tags":[{"description":"Default namespace","name":"default"}]}
Next step is to deploy it in prod. Below are a few notes from @elukey:
- in staging we have a single VIP/LB endpoint so every time we don't have to add one etc. but the TLS config and the Istio routing config needs to be enabled
- for prod we'll need to add a dedicated LV/VIP endpoint (recommendation-api-ng.discovery.wmnet), elukey or Tobias will handle the task
The recommendation-api-ng has successfully been deployed to LiftWing production in both eqiad and codfw:
kevinbazira@deploy2002:/srv/deployment-charts/helmfile.d/ml-services/recommendation-api-ng$ kube_env recommendation-api-ng ml-serve-eqiad kevinbazira@deploy2002:/srv/deployment-charts/helmfile.d/ml-services/recommendation-api-ng$ kubectl get pods NAME READY STATUS RESTARTS AGE recommendation-api-ng-main-5c4f58c685-jts5q 2/2 Running 0 8m24s recommendation-api-ng-main-5c4f58c685-m4mfs 2/2 Running 0 8m24s recommendation-api-ng-main-5c4f58c685-q8vdn 2/2 Running 0 8m24s recommendation-api-ng-main-5c4f58c685-snn99 2/2 Running 0 8m24s recommendation-api-ng-main-5c4f58c685-tfdmt 2/2 Running 0 8m24s ... kevinbazira@deploy2002:/srv/deployment-charts/helmfile.d/ml-services/recommendation-api-ng$ kube_env recommendation-api-ng ml-serve-codfw kevinbazira@deploy2002:/srv/deployment-charts/helmfile.d/ml-services/recommendation-api-ng$ kubectl get pods NAME READY STATUS RESTARTS AGE recommendation-api-ng-main-5c4f58c685-dc22j 2/2 Running 0 7m37s recommendation-api-ng-main-5c4f58c685-mqm97 2/2 Running 0 7m37s recommendation-api-ng-main-5c4f58c685-s47l8 2/2 Running 0 7m37s recommendation-api-ng-main-5c4f58c685-w8s2p 2/2 Running 0 7m37s recommendation-api-ng-main-5c4f58c685-zw9tx 2/2 Running 0 7m37s