Background
The NLLB-200 machine translation system that a research team from Meta (Facebook) provided was running in an AWS hosting managed by Meta as a temporary solution. Recently we migrated that to AWS account by WMF (T321781). This allowed to keep supporting the initial set of communities, several of which with no previous machine translaiton options. However, budget constraints of this approach prevent to use the machine translation system to its full potential. Hosting this system directly on Wikimedia infrastructure was not an option because of dependency on NVIDIA GPU and hence nonfree CUDA drivers.
A recent exploration by @santhosh discovered an alternative mechanism to get the same or better performance by just CPUs. This is achieved by a one time conversion of model to a special model with the help of Ctranslate2, which optimize the model for inference in low processor and memory setting. A version of this is running at https://translate.wmcloud.org/, it provides good performance for translation, but it is a cloud VM.
WMF Language team would like to host this system in a production system.
As per the consensus from the team, the MT service will be called as "MinT" machine translation service. This is only for exposing it as an option to users.
Plan
A rough plan for the next steps (the order isn't strict, things can be done in parallel)
- Request to deploy a new service T329971: New Service Deployment Request: NNLB-200 for machine translation
- Develop an application to expose the machine translation service
- Adding prometheus metrics support to the app
- Adding structured logging per ECS to the app
- Adding /healthz endpoint for readiness probes (the app is expected to take some time to load and expand in memory the 3GB model
- Adding capability to the app to fetch the 3GB model from some HTTP place (what exactly is still a question, people.wikimedia.org or per @elukey's suggestion swift are good candidates - the latter vastly preferred down the line) -> App can fetch models from https://people.wikimedia.org/~santhosh/nllb/ at run time
- Moving the repo to gerrit https://gerrit.wikimedia.org/g/mediawiki/services/machinetranslation
- Provide an architecture diagram to get high level understanding of this in our infrastructure
- Enabling the deployment pipeline and get the first container built
- Get a new helm chart using the create_service.sh script of the deployment-charts repo
- Using that first container validate the helm chart working
- Get a namespace on the proper cluster to deploy the service into
- Deploy (first deploy with help, subsequent deploy will be done by the language team)
- Remove Flores
- Remove configuration (https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/922064)
- Remove key/secret T337284: Remove Flores key from production
- QA tests: each language checked as they were enabled (T326578, T339105, T340953, T336683, T333969)
- Announce: blog post