Sentence CHRF silently accepts single reference as string #121

bricksdont · 2020-10-22T11:19:56Z

Hi Matt, all

While trying to debug an issue related to sentence-level CHRF, I noticed that the compat function sentence_chrf expects a list of references as input, but happily accepts a single string (a list of characters). But then only uses the first character as the reference:

>>> from sacrebleu import sentence_chrf
>>> sentence_chrf("hi there", "hi there")
chrF2 = 0.455
>>> sentence_chrf("hi there", "h")
chrF2 = 0.455

While the sentence_score function for BLEU intercepts strings instead of lists of strings:

sacrebleu/sacrebleu/metrics/bleu.py

Line 227 in 65747b6

    
           assert not isinstance(references, str), "sentence_score needs a list of references, not a single string"

the one for CHRF does not:

sacrebleu/sacrebleu/metrics/chrf.py

Line 125 in 65747b6

def sentence_score(self, hypothesis: str, references: List[str]) -> CHRFScore:

Could this perhaps be changed / standardized for all metrics?

The text was updated successfully, but these errors were encountered:

ozancaglayan · 2020-10-22T13:12:31Z

Thanks for reporting this. The PR synchronizes the behaviors across metrics.

bricksdont · 2020-10-22T16:18:27Z

Thanks!

A similar issue, I think, is that CHRF corpus_score expects List[Iterable[str]] as input, and silently accepts List[str] which will fail in zip_longest, but the error message will be confusing. ~~In the edge case where the number of hypothesis sentences is equal to the number of references, it will even not fail at all and produce erroneous results.~~. I do not think that is true, sorry!

sacrebleu/sacrebleu/metrics/chrf.py

Line 136 in 65747b6

def corpus_score(self, sys_stream: Union[str, Iterable[str]],

If you agree, perhaps you could change this as well?

ozancaglayan · 2020-10-26T09:28:16Z

I did not understand the latter issue. The signature of corpus_chrf was synchronized from BLEU to share the common signature. Could you elaborate more?

I accept that having a compat API and also object-oriented API at the same time is kind of confusing, and the latter should be supported with more examples/docs and maybe unit tests to discover potential flaws & argument inconsistencies.

bricksdont · 2020-10-26T10:49:30Z

In this function, there is already code to "add some robustness to the input arguments":

sacrebleu/sacrebleu/metrics/chrf.py

Line 146 in 65747b6

# Add some robustness to the input arguments

The code intercepts strings as inputs to this function and makes them into lists. Could it also intercept / give an error message if the references argument is a simple list List[str] instead of the expected List[List[str]]?

If the input is List[str], i.e. a single reference where the user forgot to wrap in an extra [], then this function will zip the hypothesis tokens which one character each of the reference tokens, and the error message in zip_longest will be confusing (the streams were of different lengths, or similar).

Is that perhaps something worth changing as well?

Let me know if this is more clear.

bricksdont · 2020-10-26T10:51:16Z

Suggestion:

if isinstance(ref_streams[0], str):
        ref_streams = [ref_streams]

ozancaglayan · 2020-10-26T12:42:22Z

oh I see. As I am not the original author of those "robustness" parts, I wasn't able myself as well to clearly enumerate which scenarios are handled by these isinstance checks. I would actually suggest to remove all these stuff and select a single stable signature for corpus_* methods, and assume that the users obey to the documentation.

bricksdont · 2020-10-26T13:28:24Z

Agreed, I totally understand your reasoning.

ozancaglayan · 2020-11-24T18:03:12Z

You suggest the latter for all metrics, right?

bricksdont · 2020-11-24T18:12:30Z

Sorry, I somehow did not read your post a month ago closely enough (= my answer was a bit strange).

In principle yes, but for backwards compatibility that would be a problem I think and would not so much break existing code, as still work but silently produce the wrong results.

Edit: ah, you probably mean only corpus_* methods? There removing isinstance checks would perhaps throw errors in existing code that relies on them. SacreBLEU does have the concept of major versions that are backwards-incompatible right? If yes, then let's remove those checks.

ozancaglayan · 2020-11-24T18:27:45Z

not removing but, didn't you suggest adding the following as a 3rd check, to fix the issue?

if isinstance(ref_streams[0], str):
        ref_streams = [ref_streams]

bricksdont · 2020-11-24T18:32:51Z

Hm, I don't think I understand sorry :(

I thought you were saying that you would not want to add another of those checks, rather explain to users how to use the functions correctly?

(Yes, I did suggest this check for corpus CHRF and just checked - works the same way for all metrics, yes.)

ozancaglayan · 2020-11-24T18:44:29Z

sorry, as a quick fix, I wanted to incorporate your suggestion rather than the API fix that I've previously suggested. Could you also paste a small example to exploit the current issue so that we can add it to the tests?

bricksdont · 2020-11-24T18:56:56Z

Sure, sth like this?

>>> import sacrebleu
>>> hyp = "Es wird mit Gesamtkosten von 18.4 Millionen Franken gerechnet."
>>> ref = "Es werden Gesamtkosten von 18.4 Millionen Franken veranschlagt."
# correct
>>> sacrebleu.corpus_chrf(hyp, [[ref]])
chrF2 = 0.678
# incorrect, but there is an isinstance check for this
>>> sacrebleu.corpus_chrf(hyp, ref)
chrF2 = 0.678
# no check for this situation
>>> sacrebleu.corpus_chrf(hyp, [ref])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/net/cephfs/scratch/mathmu/map-volatility/venvs/sockeye3-cpu/lib/python3.6/site-packages/sacrebleu/compat.py", line 103, in corpus_chrf
    return metric.corpus_score(hypotheses, references)
  File "/net/cephfs/scratch/mathmu/map-volatility/venvs/sockeye3-cpu/lib/python3.6/site-packages/sacrebleu/metrics/chrf.py", line 158, in corpus_score
    raise EOFError("Source and reference streams have different lengths!")
EOFError: Source and reference streams have different lengths!

- pin mecab version to 1.0.3 for Python 3.5 support - assert if refs are not in a list (#121)

ozancaglayan · 2021-01-12T11:10:15Z

This is also happening with BLEU. I suggest keeping the latter issue as it is for now, and coming up with a well-defined API rather than adding another isinstance fix which would add a 3rd possibility for the ref_streams argument's type.

ozancaglayan · 2021-02-26T10:54:45Z

By the way,

>>> sentence_chrf("hi there", ["h"])
chrF2 = 0.455

This score is wrong as well. The official chrF script gives 7.57 instead of 45.5 for this. In refactor2021 the numbers match probably because I changed how the chrF stats were computed by syncing the behavior from chrF++.py script.

bricksdont · 2021-02-26T10:59:05Z

I don't understand why this score is wrong. Does this mean CHRF was wrong in SacreBLEU this entire time? If yes, are there no unit / regression tests for CHRF?

ozancaglayan · 2021-02-26T11:16:40Z

Probably not at corpus level, because for that we had unit tests. But when I try the following for your example, I don't get 45.5

git checkout https://github.com/m-popovic/chrF.git
echo 'hi there' > h
echo 'h' > r
python chrf/chrf++.py -H h -R r -nw 0 -s

martinpopel · 2021-02-26T12:07:43Z

The current code does smoothing by computing the effective order, which was introduced by @mjpost when adding chrF to sacreBLEU. I guess this was taken from the original chrF implementation, but I cannot confirm it (see below).

When now skimming through the original chrF paper (2015) and the chrF++ paper (2017), I could not find anything about smoothing.

The current chrF++ implementation does not compute the effective order, instead it does smoothing by the 1e-16 constant.
The chrF++ code seems to simulate the original chrF metric by -nw 0 (and chrF+ by -nw 1), but I am not sure if this is the only difference between chrF and chrF++.
Unfortunately, I was not able to find the original chrF implementation. I weakly remember @m-popovic uploaded the script on her old website (not on GitHub, nor any other versioning system), but I could not find that website anymore.

ozancaglayan · 2021-02-26T12:15:19Z

yes I didn't notice any kind of smoothing either in the chrF++. This explains why in my current branch where I synced things from chrF++, the numbers are the same.

The only difference between chrF and chrF++ is -nw being 2 in the latter.
I don't know what would be the correct way of proceeding here.

martinpopel · 2021-02-26T12:24:45Z

I didn't notice any kind of smoothing either in the chrF++

But there is the 1e-16 smoothing as I wrote above.
The same is used in NLTK, something different in Moses (also default beta is 3 here instead of 2).

The only difference between chrF and chrF++ is -nw being 2 in the latter.

According to the papers yes. But we cannot be sure before inspecting the original chrF implementation, which I could not find anymore.

I don't know what would be the correct way of proceeding here.

Perhaps stay in sync with chrF++ and chrF as implemented in the GitHub repo chrF.

ozancaglayan · 2021-02-26T12:31:03Z

We can have the effective order configurable as well similar to BLEU.

ozancaglayan · 2021-03-02T21:56:35Z

Since none of the other tools (i.e. chrF++, NLTK and Moses) applied effective order smoothing, I am now slightly leaning towards dropping it and adopting the same approach as chrF++. This'll change scores for sentence-level chrF with sacreBLEU >= 2.0.0 but I think it's important to be in sync with other implementations. (The new scores should still correlate perfectly with old ones I presume)

bricksdont · 2021-03-03T07:50:53Z

Yes, effective order is basically just another smoothing method. I asked Maja for her opinion on this, perhaps we get a reply before version 2.0.0 happens.

ozancaglayan · 2021-03-03T09:53:15Z

Returning back to the original issue, my plan for >=2.0.0 is to disallow all kinds of obscure input type manipulations and expect the user to obey to the type signatures.

In [12]: sentence_chrf("hi there", "hi there")
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-12-91fdfab4b6df> in <module>
----> 1 sentence_chrf("hi there", "hi there")

~/git/sacrebleu/sacrebleu/compat.py in sentence_chrf(hypothesis, references, char_order, word_order, beta, remove_whitespace)
    135         beta=beta,
    136         whitespace=not remove_whitespace)
--> 137     return metric.sentence_score(hypothesis, references)
    138 
    139 

~/git/sacrebleu/sacrebleu/metrics/chrf.py in sentence_score(self, hypothesis, references)
    264         :return: A `CHRFScore` object.
    265         """
--> 266         return super().sentence_score(hypothesis, references)
    267 
    268     def corpus_score(self, hypotheses: Sequence[str],

~/git/sacrebleu/sacrebleu/metrics/base.py in sentence_score(self, hypothesis, references)
    149         :return: A `Score` object.
    150         """
--> 151         self._check_sentence_score_args(hypothesis, references)
    152 
    153         stats = self._extract_corpus_statistics(

~/git/sacrebleu/sacrebleu/metrics/base.py in _check_sentence_score_args(self, hyp, refs)
    107 
    108         if err_msg:
--> 109             raise RuntimeError(f'{prefix}: {err_msg}')
    110 
    111     def _check_corpus_score_args(self, hyps: Sequence[str],

RuntimeError: CHRF: The argument `refs` should be a sequence of strings.

- Build: Add Windows and OS X testing to github workflow - Improve documentation and type annotations. - Drop `Python < 3.6` support and migrate to f-strings. - Drop input type manipulation through `isinstance` checks. If the user does not obey to the expected annotations, exceptions will be raised. Robustness attempts lead to confusions and obfuscated score errors in the past (fixes #121) - Use colored strings in tabular outputs (multi-system evaluation mode) through the help of `colorama` package. - tokenizers: Add caching to tokenizers which seem to speed up things a bit. - `intl` tokenizer: Use `regex` module. Speed goes from ~4 seconds to ~0.6 seconds for a particular test set evaluation. (fixes #46) - Signature: Formatting changed (mostly to remove '+' separator as it was interfering with chrF++). The field separator is now '|' and key values are separated with ':' rather than '.'. - Metrics: Scale all metrics into the [0, 100] range (fixes #140) - BLEU: In case of no n-gram matches at all, skip smoothing and return 0.0 BLEU (fixes #141). - BLEU: allow modifying max_ngram_order (fixes #156) - CHRF: Added multi-reference support, verified the scores against chrF++.py, added test case. - CHRF: Added chrF+ support through `word_order` argument. Added test cases against chrF++.py. Exposed it through the CLI (--chrf-word-order) (fixes #124) - CHRF: Add possibility to disable effective order smoothing (pass --chrf-eps-smoothing). This way, the scores obtained are exactly the same as chrF++, Moses and NLTK implementations. We keep the effective ordering as the default for compatibility, since this only affects sentence-level scoring with very short sentences. (fixes #144) - CLI: Allow modifying TER arguments through CLI. We still keep the TERCOM defaults. - CLI: Prefix metric-specific arguments with --chrf and --ter. To maintain compatibility, BLEU argument names are kept the same. - CLI: Added `--format/-f` flag. The single-system output mode is now `json` by default. If you want to keep the old text format persistently, you can export `SACREBLEU_FORMAT=text` into your shell. - CLI: sacreBLEU now supports evaluating multiple systems for a given test set in an efficient way. Through the use of `tabulate` package, the results are nicely rendered into a plain text table, LaTeX, HTML or RST (cf. --format/-f argument). The systems can be either given as a list of plain text files to `-i/--input` or as a tab-separated single stream redirected into `STDIN`. In the former case, the basenames of the files will be automatically used as system names. - Statistical tests: sacreBLEU now supports confidence interval estimation through bootstrap resampling for single-system evaluation (`--confidence` flag) as well as paired bootstrap resampling (`--paired-bs`) and paired approximate randomization tests (`--paired-ar`) when evaluating multiple systems (fixes #40 and fixes #78).

RamoramaInteractive · 2022-04-24T12:37:54Z

Is this chrf score error in Sacrebleu still present?

martinpopel · 2022-04-24T21:19:34Z

No, the error was fixed in 2.0.0, as indicated above by closing this issue.
When you try sentence_chrf("hi there", "hi there") , it now correctly fails with RuntimeError: CHRF: The argument 'refs' should be a sequence of strings.
You need to pass a list of references (sentence_chrf("hi there", ["hi there"])).

ozancaglayan added a commit that referenced this issue Oct 22, 2020

chrf: assert if refs are not in a list (#121)

0a159bc

bricksdont changed the title ~~Sentence CHRF silently accepts single reference~~ Sentence CHRF silently accepts single reference as string Oct 22, 2020

ozancaglayan added a commit that referenced this issue Nov 25, 2020

chrf assertion fix (#121) and mecab version fix

5f14e99

- pin mecab version to 1.0.3 for Python 3.5 support - assert if refs are not in a list (#121)

ozancaglayan mentioned this issue Jan 12, 2021

BLEU with a variable number of references #130

Closed

ozancaglayan added this to the 2.0.0 milestone Feb 24, 2021

ozancaglayan mentioned this issue Mar 3, 2021

chrF not compatible with chrF++, Moses and NLTK for sentence-level smoothing #144

Closed

ozancaglayan mentioned this issue Mar 26, 2021

Changes for 2.0.0 #152

Merged

ozancaglayan linked a pull request Mar 26, 2021 that will close this issue

Changes for 2.0.0 #152

Merged

ozancaglayan closed this as completed in #152 Jul 18, 2021

ZJaume mentioned this issue Jul 27, 2021

Fix chrF was accepting single reference as string browsermt/students#42

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sentence CHRF silently accepts single reference as string #121

Sentence CHRF silently accepts single reference as string #121

Sentence CHRF silently accepts single reference as string #121

Sentence CHRF silently accepts single reference as string #121

Comments