[go: up one dir, main page]

Page MenuHomePhabricator

Deploy the recommendation-api-ng on LiftWing
Closed, ResolvedPublic

Description

In T339890, we made adjustments to the deployment settings of the recommendation-api-ng until we reached a state where it could run on LiftWing. In this task, our goal is to deploy the rec-api on the staging environment first. If everything goes smoothly, we will then deploy it to the production environment.

Event Timeline

The attempt to deploy the rec-api on LW staging returned:

kevinbazira@deploy1002:/srv/deployment-charts/helmfile.d/ml-services/recommendation-api-ng$ helmfile -e ml-staging-codfw sync
Affected releases are:
  main (wmf-stable/python-webapp) UPDATED

skipping missing values file matching "values-main.yaml"
helmfile.yaml: basePath=.
Upgrading release=main, chart=wmf-stable/python-webapp
Release "main" does not exist. Installing it now.


FAILED RELEASES:
NAME
main
in ./helmfile.yaml: failed processing release main: command "/usr/bin/helm3" exited with non-zero status:

PATH:
  /usr/bin/helm3

ARGS:
  0: helm3 (5 bytes)
  1: upgrade (7 bytes)
  2: --install (9 bytes)
  3: --reset-values (14 bytes)
  4: main (4 bytes)
  5: wmf-stable/python-webapp (24 bytes)
  6: --timeout (9 bytes)
  7: 600s (4 bytes)
  8: --atomic (8 bytes)
  9: --namespace (11 bytes)
  10: recommendation-api-ng (21 bytes)
  11: --values (8 bytes)
  12: /tmp/values285361875 (20 bytes)
  13: --values (8 bytes)
  14: /tmp/values965689366 (20 bytes)
  15: --values (8 bytes)
  16: /tmp/values329015421 (20 bytes)
  17: --values (8 bytes)
  18: /tmp/values041198520 (20 bytes)
  19: --history-max (13 bytes)
  20: 10 (2 bytes)
  21: --kubeconfig=/etc/kubernetes/recommendation-api-ng-deploy-ml-staging-codfw.config (81 bytes)

ERROR:
  exit status 1

EXIT STATUS
  1

STDERR:
  WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/recommendation-api-ng-deploy-ml-staging-codfw.config
  Error: release main failed, and has been uninstalled due to atomic being set: timed out waiting for the condition

COMBINED OUTPUT:
  WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/recommendation-api-ng-deploy-ml-staging-codfw.config
  Release "main" does not exist. Installing it now.
  Error: release main failed, and has been uninstalled due to atomic being set: timed out waiting for the condition

On IRC, @elukey advised that we use deploy2002 due to a recent datacenter switchover. The second attempt to deploy on staging returned:

kevinbazira@deploy2002:/srv/deployment-charts/helmfile.d/ml-services/recommendation-api-ng$ helmfile -e ml-staging-codfw sync
Affected releases are:
  main (wmf-stable/python-webapp) UPDATED

skipping missing values file matching "values-main.yaml"
helmfile.yaml: basePath=.
Upgrading release=main, chart=wmf-stable/python-webapp
Release "main" does not exist. Installing it now.


FAILED RELEASES:
NAME
main
in ./helmfile.yaml: failed processing release main: command "/usr/bin/helm3" exited with non-zero status:

PATH:
  /usr/bin/helm3

ARGS:
  0: helm3 (5 bytes)
  1: upgrade (7 bytes)
  2: --install (9 bytes)
  3: --reset-values (14 bytes)
  4: main (4 bytes)
  5: wmf-stable/python-webapp (24 bytes)
  6: --timeout (9 bytes)
  7: 600s (4 bytes)
  8: --atomic (8 bytes)
  9: --namespace (11 bytes)
  10: recommendation-api-ng (21 bytes)
  11: --values (8 bytes)
  12: /tmp/values722100947 (20 bytes)
  13: --values (8 bytes)
  14: /tmp/values904989206 (20 bytes)
  15: --values (8 bytes)
  16: /tmp/values292295293 (20 bytes)
  17: --values (8 bytes)
  18: /tmp/values984229304 (20 bytes)
  19: --history-max (13 bytes)
  20: 10 (2 bytes)
  21: --kubeconfig=/etc/kubernetes/recommendation-api-ng-deploy-ml-staging-codfw.config (81 bytes)

ERROR:
  exit status 1

EXIT STATUS
  1

STDERR:
  WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/recommendation-api-ng-deploy-ml-staging-codfw.config
  Error: release main failed, and has been uninstalled due to atomic being set: timed out waiting for the condition

COMBINED OUTPUT:
  WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/kubernetes/recommendation-api-ng-deploy-ml-staging-codfw.config
  Release "main" does not exist. Installing it now.
  Error: release main failed, and has been uninstalled due to atomic being set: timed out waiting for the condition

I tried to sync and got the same error as Kevin mentioned above.
However I saw that a pod was created during the process which was failing with CrashLoopBackOff.
The only issue I saw was probes failing (via kubectl describe )

Normal   Killing    4m16s                   kubelet            Container recommendation-api-ng-main failed liveness probe, will be restarted
 Warning  Unhealthy  4m6s (x4 over 4m36s)    kubelet            Liveness probe failed: dial tcp 10.194.61.146:8080: connect: connection refused
 Warning  Unhealthy  3m56s (x11 over 4m43s)  kubelet            Readiness probe failed: Get "http://10.194.61.146:8080/": dial tcp 10.194.61.146:8080: connect: connection refused

Change 958976 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: update recommendation-api-ng port

https://gerrit.wikimedia.org/r/958976

Change 958977 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[research/recommendation-api@master] blubber: update recommendation-api-ng port

https://gerrit.wikimedia.org/r/958977

Change 958977 merged by jenkins-bot:

[research/recommendation-api@master] blubber: update recommendation-api-ng port

https://gerrit.wikimedia.org/r/958977

Change 958976 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update recommendation-api-ng image

https://gerrit.wikimedia.org/r/958976

Change 958978 had a related patch set uploaded (by Kevin Bazira; author: Kevin Bazira):

[operations/deployment-charts@master] ml-services: update recommendation-api-ng readiness_probe

https://gerrit.wikimedia.org/r/958978

Change 958978 merged by jenkins-bot:

[operations/deployment-charts@master] ml-services: update recommendation-api-ng readiness_probe

https://gerrit.wikimedia.org/r/958978

Change 959789 had a related patch set uploaded (by Elukey; author: Elukey):

[operations/deployment-charts@master] ml-services: enable the ingress module for recommendation-api-ng

https://gerrit.wikimedia.org/r/959789

Change 959789 merged by Elukey:

[operations/deployment-charts@master] ml-services: enable the ingress module for recommendation-api-ng

https://gerrit.wikimedia.org/r/959789

The recommendation-api-ng has been successfully deployed on LiftWing staging:

kevinbazira@deploy2002:/srv/deployment-charts/helmfile.d/ml-services$ curl https://recommendation-api-ng.k8s-ml-staging.discovery.wmnet:31443/api/spec
{"basePath":"/api","consumes":["application/json"],"definitions":{"Article":{"properties":{"pageviews":{"description":"pageviews","type":"integer"},"rank":{"description":"rank","type":"number"},"title":{"description":"title","type":"string"},"wikidata_id":{"description":"wikidata_id","type":"string"}},"required":["rank","title","wikidata_id"],"type":"object"}},"info":{"title":"API","version":"1.0"},"paths":{"/":{"get":{"deprecated":true,"description":"Gets recommendations of source articles that are missing in the target","operationId":"get_legacy_article","parameters":[{"description":"Source wiki project language code","in":"query","name":"s","required":true,"type":"string"},{"description":"Target wiki project language code","in":"query","name":"t","required":true,"type":"string"},{"default":12,"description":"Number of recommendations to fetch","in":"query","maximum":24,"minimum":0,"name":"n","type":"integer"},{"description":"Seed article for personalized recommendations that can also be a list separated by \"|\"","in":"query","name":"article","pattern":"^([^|]+(\\|[^|]+)*)?$","type":"string"},{"default":true,"description":"Whether to include pageview counts","in":"query","name":"pageviews","type":"boolean"},{"collectionFormat":"multi","default":"morelike","description":"Which search algorithm to use if a seed is specified","enum":["morelike","wiki"],"in":"query","name":"search","type":"string"}],"responses":{"200":{"description":"Success","schema":{"items":{"$ref":"#/definitions/Article"},"type":"array"}}},"tags":["default"]}},"/spec":{"get":{"operationId":"get_spec","responses":{"200":{"description":"Success"}},"tags":["default"]}}},"produces":["application/json"],"responses":{"MaskError":{"description":"When any error occurs on mask"},"ParseError":{"description":"When a mask can't be parsed"}},"swagger":"2.0","tags":[{"description":"Default namespace","name":"default"}]}

Next step is to deploy it in prod. Below are a few notes from @elukey:

  • in staging we have a single VIP/LB endpoint so every time we don't have to add one etc. but the TLS config and the Istio routing config needs to be enabled
  • for prod we'll need to add a dedicated LV/VIP endpoint (recommendation-api-ng.discovery.wmnet), elukey or Tobias will handle the task

The recommendation-api-ng has successfully been deployed to LiftWing production in both eqiad and codfw:

kevinbazira@deploy2002:/srv/deployment-charts/helmfile.d/ml-services/recommendation-api-ng$ kube_env recommendation-api-ng ml-serve-eqiad
kevinbazira@deploy2002:/srv/deployment-charts/helmfile.d/ml-services/recommendation-api-ng$ kubectl get pods
NAME                                          READY   STATUS    RESTARTS   AGE
recommendation-api-ng-main-5c4f58c685-jts5q   2/2     Running   0          8m24s
recommendation-api-ng-main-5c4f58c685-m4mfs   2/2     Running   0          8m24s
recommendation-api-ng-main-5c4f58c685-q8vdn   2/2     Running   0          8m24s
recommendation-api-ng-main-5c4f58c685-snn99   2/2     Running   0          8m24s
recommendation-api-ng-main-5c4f58c685-tfdmt   2/2     Running   0          8m24s
...
kevinbazira@deploy2002:/srv/deployment-charts/helmfile.d/ml-services/recommendation-api-ng$ kube_env recommendation-api-ng ml-serve-codfw
kevinbazira@deploy2002:/srv/deployment-charts/helmfile.d/ml-services/recommendation-api-ng$ kubectl get pods
NAME                                          READY   STATUS    RESTARTS   AGE
recommendation-api-ng-main-5c4f58c685-dc22j   2/2     Running   0          7m37s
recommendation-api-ng-main-5c4f58c685-mqm97   2/2     Running   0          7m37s
recommendation-api-ng-main-5c4f58c685-s47l8   2/2     Running   0          7m37s
recommendation-api-ng-main-5c4f58c685-w8s2p   2/2     Running   0          7m37s
recommendation-api-ng-main-5c4f58c685-zw9tx   2/2     Running   0          7m37s