-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sentence CHRF silently accepts single reference as string #121
Comments
Thanks for reporting this. The PR synchronizes the behaviors across metrics. |
Thanks! A similar issue, I think, is that CHRF sacrebleu/sacrebleu/metrics/chrf.py Line 136 in 65747b6
If you agree, perhaps you could change this as well? |
I did not understand the latter issue. The signature of I accept that having a compat API and also object-oriented API at the same time is kind of confusing, and the latter should be supported with more examples/docs and maybe unit tests to discover potential flaws & argument inconsistencies. |
In this function, there is already code to "add some robustness to the input arguments": sacrebleu/sacrebleu/metrics/chrf.py Line 146 in 65747b6
The code intercepts strings as inputs to this function and makes them into lists. Could it also intercept / give an error message if the references argument is a simple list If the input is Is that perhaps something worth changing as well? Let me know if this is more clear. |
Suggestion:
|
oh I see. As I am not the original author of those "robustness" parts, I wasn't able myself as well to clearly enumerate which scenarios are handled by these |
Agreed, I totally understand your reasoning. |
You suggest the latter for all metrics, right? |
Sorry, I somehow did not read your post a month ago closely enough (= my answer was a bit strange). In principle yes, but for backwards compatibility that would be a problem I think and would not so much break existing code, as still work but silently produce the wrong results. Edit: ah, you probably mean only |
not removing but, didn't you suggest adding the following as a 3rd check, to fix the issue?
|
Hm, I don't think I understand sorry :( I thought you were saying that you would not want to add another of those checks, rather explain to users how to use the functions correctly? (Yes, I did suggest this check for corpus CHRF and just checked - works the same way for all metrics, yes.) |
sorry, as a quick fix, I wanted to incorporate your suggestion rather than the API fix that I've previously suggested. Could you also paste a small example to exploit the current issue so that we can add it to the tests? |
Sure, sth like this? >>> import sacrebleu
>>> hyp = "Es wird mit Gesamtkosten von 18.4 Millionen Franken gerechnet."
>>> ref = "Es werden Gesamtkosten von 18.4 Millionen Franken veranschlagt."
# correct
>>> sacrebleu.corpus_chrf(hyp, [[ref]])
chrF2 = 0.678
# incorrect, but there is an isinstance check for this
>>> sacrebleu.corpus_chrf(hyp, ref)
chrF2 = 0.678
# no check for this situation
>>> sacrebleu.corpus_chrf(hyp, [ref])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/net/cephfs/scratch/mathmu/map-volatility/venvs/sockeye3-cpu/lib/python3.6/site-packages/sacrebleu/compat.py", line 103, in corpus_chrf
return metric.corpus_score(hypotheses, references)
File "/net/cephfs/scratch/mathmu/map-volatility/venvs/sockeye3-cpu/lib/python3.6/site-packages/sacrebleu/metrics/chrf.py", line 158, in corpus_score
raise EOFError("Source and reference streams have different lengths!")
EOFError: Source and reference streams have different lengths! |
- pin mecab version to 1.0.3 for Python 3.5 support - assert if refs are not in a list (#121)
This is also happening with BLEU. I suggest keeping the latter issue as it is for now, and coming up with a well-defined API rather than adding another isinstance fix which would add a 3rd possibility for the |
By the way,
This score is wrong as well. The official chrF script gives 7.57 instead of 45.5 for this. In |
I don't understand why this score is wrong. Does this mean CHRF was wrong in SacreBLEU this entire time? If yes, are there no unit / regression tests for CHRF? |
Probably not at corpus level, because for that we had unit tests. But when I try the following for your example, I don't get 45.5
|
The current code does smoothing by computing the effective order, which was introduced by @mjpost when adding chrF to sacreBLEU. I guess this was taken from the original chrF implementation, but I cannot confirm it (see below). When now skimming through the original chrF paper (2015) and the chrF++ paper (2017), I could not find anything about smoothing. The current chrF++ implementation does not compute the effective order, instead it does smoothing by the 1e-16 constant. |
yes I didn't notice any kind of smoothing either in the chrF++. This explains why in my current branch where I synced things from chrF++, the numbers are the same. The only difference between chrF and chrF++ is |
But there is the 1e-16 smoothing as I wrote above.
According to the papers yes. But we cannot be sure before inspecting the original chrF implementation, which I could not find anymore.
Perhaps stay in sync with chrF++ and chrF as implemented in the GitHub repo chrF. |
We can have the effective order configurable as well similar to BLEU. |
Since none of the other tools (i.e. chrF++, NLTK and Moses) applied effective order smoothing, I am now slightly leaning towards dropping it and adopting the same approach as chrF++. This'll change scores for sentence-level chrF with sacreBLEU >= 2.0.0 but I think it's important to be in sync with other implementations. (The new scores should still correlate perfectly with old ones I presume) |
Yes, effective order is basically just another smoothing method. I asked Maja for her opinion on this, perhaps we get a reply before version 2.0.0 happens. |
Returning back to the original issue, my plan for >=2.0.0 is to disallow all kinds of obscure input type manipulations and expect the user to obey to the type signatures.
|
- Build: Add Windows and OS X testing to github workflow - Improve documentation and type annotations. - Drop `Python < 3.6` support and migrate to f-strings. - Drop input type manipulation through `isinstance` checks. If the user does not obey to the expected annotations, exceptions will be raised. Robustness attempts lead to confusions and obfuscated score errors in the past (fixes #121) - Use colored strings in tabular outputs (multi-system evaluation mode) through the help of `colorama` package. - tokenizers: Add caching to tokenizers which seem to speed up things a bit. - `intl` tokenizer: Use `regex` module. Speed goes from ~4 seconds to ~0.6 seconds for a particular test set evaluation. (fixes #46) - Signature: Formatting changed (mostly to remove '+' separator as it was interfering with chrF++). The field separator is now '|' and key values are separated with ':' rather than '.'. - Metrics: Scale all metrics into the [0, 100] range (fixes #140) - BLEU: In case of no n-gram matches at all, skip smoothing and return 0.0 BLEU (fixes #141). - BLEU: allow modifying max_ngram_order (fixes #156) - CHRF: Added multi-reference support, verified the scores against chrF++.py, added test case. - CHRF: Added chrF+ support through `word_order` argument. Added test cases against chrF++.py. Exposed it through the CLI (--chrf-word-order) (fixes #124) - CHRF: Add possibility to disable effective order smoothing (pass --chrf-eps-smoothing). This way, the scores obtained are exactly the same as chrF++, Moses and NLTK implementations. We keep the effective ordering as the default for compatibility, since this only affects sentence-level scoring with very short sentences. (fixes #144) - CLI: Allow modifying TER arguments through CLI. We still keep the TERCOM defaults. - CLI: Prefix metric-specific arguments with --chrf and --ter. To maintain compatibility, BLEU argument names are kept the same. - CLI: Added `--format/-f` flag. The single-system output mode is now `json` by default. If you want to keep the old text format persistently, you can export `SACREBLEU_FORMAT=text` into your shell. - CLI: sacreBLEU now supports evaluating multiple systems for a given test set in an efficient way. Through the use of `tabulate` package, the results are nicely rendered into a plain text table, LaTeX, HTML or RST (cf. --format/-f argument). The systems can be either given as a list of plain text files to `-i/--input` or as a tab-separated single stream redirected into `STDIN`. In the former case, the basenames of the files will be automatically used as system names. - Statistical tests: sacreBLEU now supports confidence interval estimation through bootstrap resampling for single-system evaluation (`--confidence` flag) as well as paired bootstrap resampling (`--paired-bs`) and paired approximate randomization tests (`--paired-ar`) when evaluating multiple systems (fixes #40 and fixes #78).
Is this chrf score error in Sacrebleu still present? |
No, the error was fixed in 2.0.0, as indicated above by closing this issue. |
Hi Matt, all
While trying to debug an issue related to sentence-level CHRF, I noticed that the compat function
sentence_chrf
expects a list of references as input, but happily accepts a single string (a list of characters). But then only uses the first character as the reference:While the
sentence_score
function for BLEU intercepts strings instead of lists of strings:sacrebleu/sacrebleu/metrics/bleu.py
Line 227 in 65747b6
the one for CHRF does not:
sacrebleu/sacrebleu/metrics/chrf.py
Line 125 in 65747b6
Could this perhaps be changed / standardized for all metrics?
The text was updated successfully, but these errors were encountered: