8000 Allow lazy loading of translations in gettext. · Issue #79809 · python/cpython · GitHub
[go: up one dir, main page]

Skip to content

Allow lazy loading of translations in gettext. #79809

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
s-ball mannequin opened this issue Dec 31, 2018 · 2 comments
Open

Allow lazy loading of translations in gettext. #79809

s-ball mannequin opened this issue Dec 31, 2018 · 2 comments
Assignees
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@s-ball
Copy link
Mannequin
s-ball mannequin commented Dec 31, 2018
BPO 35628
Nosy @s-ball

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2018-12-31.13:14:02.336>
labels = ['3.8', 'type-feature', 'library']
title = 'Allow lazy loading of translations in gettext.'
updated_at = <Date 2018-12-31.13:14:02.336>
user = 'https://github.com/s-ball'

bugs.python.org fields:

activity = <Date 2018-12-31.13:14:02.336>
actor = 's-ball'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation = <Date 2018-12-31.13:14:02.336>
creator = 's-ball'
dependencies = []
files = []
hgrepos = []
issue_num = 35628
keywords = []
message_count = 1.0
messages = ['332815']
nosy_count = 1.0
nosy_names = ['s-ball']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue35628'
versions = ['Python 3.8']

@s-ball
Copy link
Mannequin Author
s-ball mannequin commented Dec 31, 2018

When working on i18n, I realized that msgfmt.py did not generate any hash table. One step further, I realized that the gettext.py would not have used it because it unconditionnaly loads the whole translation files and contains the following TODO message:

TODO:

  • Lazy loading of .mo files. Currently the entire catalog is loaded into
    memory, but that's probably bad for large translated programs. Instead,
    the lexical sort of original strings in GNU .mo files should be exploited
    to do binary searches and lazy initializations. Or you might want to use
    the undocumented double-hash algorithm for .mo files with hash tables, but
    you'll need to study the GNU gettext code to do this.

I have studied the code, and found that it should not be too complex to implement it in pure Python. I have posted a message on python-ideas about it and here are my conclusion:

Features:
========
The gettext module should be allowed to load lazily the catalogs from mo
file. This lazy load should be optional and make use of the hash tables
from mo files when they are present or revert to a binary search. The
translation strings should be cached for better performances.

API changes:
============
3 functions from the gettext module will have 2 new optional parameter
named caching, and keepopen:

gettext.bindtextdomain(domain, localedir=None) would become
gettext.bindtextdomain(domain, localedir=None, caching=None, keepopen=False)

gettext.translation(domain, localedir=None, languages=None, class_=None,
fallback=False, codeset=None) would become
gettext.translation(domain, localedir=None, languages=None, class_=None,
fallback=False, codeset=None, caching=None, keepopen=False)

gettext.install(domain, localedir=None, codeset=None, names=None) would
become
gettext.install(domain, localedir=None, codeset=None, names=None,
caching=None, keepopen=False)

The new caching parameter could receive the following values:
caching=None: revert to the previour eager loading of the full catalog.
It will be the default to allow previous application to see no change
caching=1: lazy loading with unlimited cache
caching=n where n is a positive (>=0) integer value: lazy loading with a
LRU cache limited to n strings

The keepopen parameter would be a boolean:
keepopen=False (default): the mo file is only opened before loading a
translation string and closed immediately after - it is also opened once
when the GNUTranslation class is initialized to load the file description
keepopen=True: the mo file is kept open during the lifetime of the
GNUTranslation object.
This parameter is ignored and not used if caching is None

Implementation:
==============
The current GNUTranslation class loads the content of the mo file to
build a dictionnary where the original strings are the keys and the
translated keys the values. Plural forms use a special processing: the
key is a 2 tuple (singular original string, order), and the value is the
corresponding translated string - order=0 is normally for the singular
translated string.

The proposed implementation would simply replace this dictionary with a
special mapping subclass when caching is not None. That subclass would
use same keys as the original directory and would:

  • first search in its cache
  • if not found in cache and if the hashtable has not a zero size search
    the original string by hash
  • if not found in cache and if the hashtable has a zero size, search the
    original string with a binary search algorithm.
  • if a string is found, it should feed the LRU cache, eventually
    throwing away the oldest entry (entries)

That should allow to implement the new feature with minimal refactoring
for the gettext module.

But I also propose to change msgfmt.py to build the hashtable. IMHO, the function should lie in the standard library probably as a submodule of gettext to allow various Python projects (pybabel, django) to directly use it instead of developping their own ones.

I will probably submit a PR in a while but it will will require some time to propose a full implementation with a correct test coverage.

@s-ball s-ball mannequin added 3.8 (EOL) end of life stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Dec 31, 2018
@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
@encukou encukou removed the 3.8 (EOL) end of life label Mar 5, 2025
@StanFromIreland
Copy link
Contributor

I am going to try to implement this, msgfmt.py will have to be modified to generate files with the hash table but that'll be it's own pr.

performance notes

I created a small bench marking setup. (.mo with ~10000 unique entries, called for middle entry) I found that a C program was roughly 32x faster than its Python equivalent at retrieving that entry.

I also tested the affect of hash tables on the C implementation, and found a 10% speed up. I thought it would have more of an effect, realistically Python will probably only have a 5% speed up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement
Projects
Status: No status
Development

No branches or pull requests

2 participants
0