8000 Sentence CHRF silently accepts single reference as string · Issue #121 · mjpost/sacrebleu · GitHub
[go: up one dir, main page]

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sentence CHRF silently accepts single reference as string #121

Closed
bricksdont opened this issue Oct 22, 2020 · 26 comments · Fixed by #152
Closed

Sentence CHRF silently accepts single reference as string #121

bricksdont opened this issue Oct 22, 2020 · 26 comments · Fixed by #152
Milestone

Comments

@bricksdont
Copy link
bricksdont commented Oct 22, 2020

Hi Matt, all

While trying to debug an issue related to sentence-level CHRF, I noticed that the compat function sentence_chrf expects a list of references as input, but happily accepts a single string (a list of characters). But then only uses the first character as the reference:

>>> from sacrebleu import sentence_chrf
>>> sentence_chrf("hi there", "hi there")
chrF2 = 0.455
>>> sentence_chrf("hi there", "h")
chrF2 = 0.455

While the sentence_score function for BLEU intercepts strings instead of lists of strings:

assert not isinstance(references, str), "sentence_score needs a list of references, not a single string"

the one for CHRF does not:

def sentence_score(self, hypothesis: str, references: List[str]) -> CHRFScore:

Could this perhaps be changed / standardized for all metrics?

@ozancaglayan
Copy link
Collaborator

Thanks for reporting this. The PR synchronizes the behaviors across metrics.

@bricksdont
Copy link
Author
bricksdont commented Oct 22, 2020

Thanks!

A similar issue, I think, is that CHRF corpus_score expects List[Iterable[str]] as input, and silently accepts List[str] which will fail in zip_longest, but the error message will be confusing. In the edge case where the number of hypothesis sentences is equal to the number of references, it will even not fail at all and produce erroneous results.. I do not think that is true, sorry!

def corpus_score(self, sys_stream: Union[str, Iterable[str]],

If you agree, perhaps you could change this as well?

@brick
8000
sdont bricksdont changed the title Sentence CHRF silently accepts single reference Sentence CHRF silently accepts single reference as string Oct 22, 2020
@ozancaglayan
Copy link
Collaborator
ozancaglayan commented Oct 26, 2020

I did not understand the latter issue. The signature of corpus_chrf was synchronized from BLEU to share the common signature. Could you elaborate more?

I accept that having a compat API and also object-oriented API at the same time is kind of confusing, and the latter should be supported with more examples/docs and maybe unit tests to discover potential flaws & argument inconsistencies.

@bricksdont
Copy link
Author
bricksdont commented Oct 26, 2020

In this function, there is already code to "add some robustness to the input arguments":

# Add some robustness to the input arguments

The code intercepts strings as inputs to this function and makes them into lists. Could it also intercept / give an error message if the references argument is a simple list List[str] instead of the expected List[List[str]]?

If the input is List[str], i.e. a single reference where the user forgot to wrap in an extra [], then this function will zip the hypothesis tokens which one character each of the reference tokens, and the error message in zip_longest will be confusing (the streams were of different lengths, or similar).

Is that perhaps something worth changing as well?

Let me know if this is more clear.

@bricksdont
Copy link
Author

Suggestion:

if isinstance(ref_streams[0], str):
        ref_streams = [ref_streams]

@ozancaglayan
Copy link
Collaborator

oh I see. As I am not the original author of those "robustness" parts, I wasn't able myself as well to clearly enumerate which scenarios are handled by these isinstance checks. I would actually suggest to remove all these stuff and select a single stable signature for corpus_* methods, and assume that the users obey to the documentation.

@bricksdont
Copy link
Author

Agreed, I totally understand your reasoning.

@ozancaglayan
Copy link
Collaborator

You suggest the latter for all metrics, right?

@bricksdont
Copy link
Author
bricksdont commented Nov 24, 2020

Sorry, I somehow did not read your post a month ago closely enough (= my answer was a bit strange).

In principle yes, but for backwards compatibility that would be a problem I think and would not so much break existing code, as still work but silently produce the wrong results.

Edit: ah, you probably mean only corpus_* methods? There removing isinstance checks would perhaps throw errors in existing code that relies on them. SacreBLEU does have the concept of major versions that are backwards-incompatible right? If yes, then let's remove those checks.

@ozancaglayan
Copy link
Collaborator

not removing but, didn't you suggest adding the following as a 3rd check, to fix the issue?

if isinstance(ref_streams[0], str):
        ref_streams = [ref_streams]

@bricksdont
Copy link
Author

Hm, I don't think I understand sorry :(

I thought you were saying that you would not want to add another of those checks, rather explain to users how to use the functions correctly?

(Yes, I did suggest this check for corpus CHRF and just checked - works the same way for all metrics, yes.)

@ozancaglayan
Copy link
Collaborator

sorry, as a quick fix, I wanted to incorporate your suggestion rather than the API fix that I've previously suggested. Could you also paste a small example to exploit the current issue so that we can add it to the tests?

@bricksdont
Copy link
Author

Sure, sth like this?

>>> import sacrebleu
>>> hyp = "Es wird mit Gesamtkosten von 18.4 Millionen Franken gerechnet."
>>> ref = "Es werden Gesamtkosten von 18.4 Millionen Franken veranschlagt."
# correct
>>> sacrebleu.corpus_chrf(hyp, [[ref]])
chrF2 = 0.678
# incorrect, but there is an isinstance check for this
>>> sacrebleu.corpus_chrf(hyp, ref)
chrF2 = 0.678
# no check for this situation
>>> sacrebleu.corpus_chrf(hyp, [ref])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/net/cephfs/scratch/mathmu/map-volatility/venvs/sockeye3-cpu/lib/python3.6/site-packages/sacrebleu/compat.py", line 103, in corpus_chrf
    return metric.corpus_score(hypotheses, references)
  File "/net/cephfs/scratch/mathmu/map-volatility/venvs/sockeye3-cpu/lib/python3.6/site-packages/sacrebleu/metrics/chrf.py", line 158, in corpus_score
    raise EOFError("Source and reference streams have different lengths!")
EOFError: Source and reference streams have different lengths!

ozancaglayan added a commit that referenced this issue Nov 25, 2020
- pin mecab version to 1.0.3 for Python 3.5 support
- assert if refs are not in a list (#121)
@ozancaglayan
Copy link
Collaborator

This is also happening with BLEU. I suggest keeping the latter issue as it is for now, and coming up with a well-defined API rather than adding another isinstance fix which would add a 3rd possibility for the ref_streams argument's type.

@ozancaglayan
Copy link
Collaborator
ozancaglayan commented Feb 26, 2021

By the way,

>>> sentence_chrf("hi there", ["h"])
chrF2 = 0.455

This score is wrong as well. The official chrF script gives 7.57 instead of 45.5 for this. In refactor2021 the numbers match probably because I changed how the chrF stats were computed by syncing the behavior from chrF++.py script.

@bricksdont
Copy link
Author
bricksdont commented Feb 26, 2021

I don't understand why this score is wrong. Does this mean CHRF was wrong in SacreBLEU this entire time? If yes, are there no unit / regression tests for CHRF?

@ozancaglayan
Copy link
Collaborator

Probably not at corpus level, because for that we had unit tests. But when I try the following for your example, I don't get 45.5

git checkout https://github.com/m-popovic/chrF.git
echo 'hi there' > h
echo 'h' > r
python chrf/chrf++.py -H h -R r -nw 0 -s

@martinpopel
Copy link
Collaborator

The current code does smoothing by computing the effective order, which was introduced by @mjpost when adding chrF to sacreBLEU. I guess this was taken from the original chrF implementation, but I cannot confirm it (see below).

When now skimming through the original chrF paper (2015) and the chrF++ paper (2017), I could not find anything about smoothing.

The current chrF++ implementation does not compute the effective order, instead it does smoothing by the 1e-16 constant.
The chrF++ code seems to simulate the original chrF metric by -nw 0 (and chrF+ by -nw 1), but I am not sure if this is the only difference between chrF and chrF++.
Unfortunately, I was not able to find the original chrF implementation. I weakly remember @m-popovic uploaded the script on her old website (not on GitHub, nor any other versioning system), but I could not find that website anymore.

@ozancaglayan
Copy link
Collaborator

yes I didn't notice any kind of smoothing either in the chrF++. This explains why in my current branch where I synced things from chrF++, the numbers are the same.

The only difference between chrF and chrF++ is -nw being 2 in the latter.
I don't know what would be the correct way of proceeding here.

@martinpopel
Copy link
Collaborator

I didn't notice any kind of smoothing either in the chrF++

But there is the 1e-16 smoothing as I wrote above.
The same is used in NLTK, something different in Moses (also default beta is 3 here instead of 2).

The only difference between chrF and chrF++ is -nw being 2 in the latter.

According to the papers yes. But we cannot be sure before inspecting the original chrF implementation, which I could not find anymore.

I don't know what would be the correct way of proceeding here.

Perhaps stay in sync with chrF++ and chrF as implemented in the GitHub repo chrF.

@ozancaglayan
Copy link
Collaborator

We can have the effective order configurable as well similar to BLEU.

@ozancaglayan
Copy link
Collaborator

Since none of the other tools (i.e. chrF++, NLTK and Moses) applied effective order smoothing, I am now slightly leaning towards dropping it and adopting the same approach as chrF++. This'll change scores for sentence-level chrF with sacreBLEU >= 2.0.0 but I think it's important to be in sync with other implementations. (The new scores should still correlate perfectly with old ones I presume)

@bricksdont
Copy link
Author

Yes, effective order is basically just another smoothing method. I asked Maja for her opinion on this, perhaps we get a reply before version 2.0.0 happens.

@ozancaglayan
Copy link
Collaborator

Returning back to the original issue, my plan for >=2.0.0 is to disallow all kinds of obscure input type manipulations and expect the user to obey to the type signatures.

In [12]: sentence_chrf("hi there", "hi there")
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-12-91fdfab4b6df> in <module>
----> 1 sentence_chrf("hi there", "hi there")

~/git/sacrebleu/sacrebleu/compat.py in sentence_chrf(hypothesis, references, char_order, word_order, beta, remove_whitespace)
    135         beta=beta,
    136         whitespace=not remove_whitespace)
--> 137     return metric.sentence_score(hypothesis, references)
    138 
    139 

~/git/sacrebleu/sacrebleu/metrics/chrf.py in sentence_score(self, hypothesis, references)
    264         :return: A `CHRFScore` object.
    265         """
--> 266         return super().sentence_score(hypothesis, references)
    267 
    268     def corpus_score(self, hypotheses: Sequence[str],

~/git/sacrebleu/sacrebleu/metrics/base.py in sentence_score(self, hypothesis, references)
    149         :return: A `Score` object.
    150         """
--> 151         self._check_sentence_score_args(hypothesis, references)
    152 
    153         stats = self._extract_corpus_statistics(

~/git/sacrebleu/sacrebleu/metrics/base.py in _check_sentence_score_args(self, hyp, refs)
    107 
    108         if err_msg:
--> 109             raise RuntimeError(f'{prefix}: {err_msg}')
    110 
    111     def _check_corpus_score_args(self, hyps: Sequence[str],

RuntimeError: CHRF: The argument `refs` should be a sequence of strings.

@ozancaglayan ozancaglayan linked a pull request Mar 26, 2021 that will close this issue
ozancaglayan added a commit that referenced this issue Jul 18, 2021
  - Build: Add Windows and OS X testing to github workflow
  - Improve documentation and type annotations.
  - Drop `Python < 3.6` support and migrate to f-strings.
  - Drop input type manipulation through `isinstance` checks. If the user does not obey
    to the expected annotations, exceptions will be raised. Robustness attempts lead to
    confusions and obfuscated score errors in the past (fixes #121)
  - Use colored strings in tabular outputs (multi-system evaluation mode) through
    the help of `colorama` package.
  - tokenizers: Add caching to tokenizers which seem to speed up things a bit.
  - `intl` tokenizer: Use `regex` module. Speed goes from ~4 seconds to ~0.6 seconds
    for a particular test set evaluation. (fixes #46)
  - Signature: Formatting changed (mostly to remove '+' separator as it was
    interfering with chrF++). The field separator is now '|' and key values
    are separated with ':' rather than '.'.
  - Metrics: Scale all metrics into the [0, 100] range (fixes #140)
  - BLEU: In case of no n-gram matches at all, skip smoothing and return 0.0 BLEU (fixes #141).
  - BLEU: allow modifying max_ngram_order (fixes #156)
  - CHRF: Added multi-reference support, verified the scores against chrF++.py, added test case.
  - CHRF: Added chrF+ support through `word_order` argument. Added test cases against chrF++.py.
    Exposed it through the CLI (--chrf-word-order) (fixes #124)
  - CHRF: Add possibility to disable effective order smoothing (pass --chrf-eps-smoothing).
    This way, the scores obtained are exactly the same as chrF++, Moses and NLTK implementations.
    We keep the effective ordering as the default for compatibility, since this only
    affects sentence-level scoring with very short sentences. (fixes #144)
  - CLI: Allow modifying TER arguments through CLI. We still keep the TERCOM defaults.
  - CLI: Prefix metric-specific arguments with --chrf and --ter. To maintain compatibility, BLEU argument names are kept the same.
  - CLI: Added `--format/-f` flag. The single-system output mode is now `json` by default.
    If you want to keep the old text format persistently, you can export `SACREBLEU_FORMAT=text` into your
    shell.
  - CLI: sacreBLEU now supports evaluating multiple systems for a given test set
    in an efficient way. Through the use of `tabulate` package, the results are
    nicely rendered into a plain text table, LaTeX, HTML or RST (cf. --format/-f argument).
    The systems can be either given as a list of plain text files to `-i/--input` or
    as a tab-separated single stream redirected into `STDIN`. In the former case,
    the basenames of the files will be automatically used as system names.
  - Statistical tests: sacreBLEU now supports confidence interval estimation
    through bootstrap resampling for single-system evaluation (`--confidence` flag)
    as well as paired bootstrap resampling (`--paired-bs`) and paired approximate
    randomization tests (`--paired-ar`) when evaluating multiple systems (fixes #40 and fixes #78).
@RamoramaInteractive
Copy link
RamoramaInteractive commented Apr 24, 2022

Is this chrf score error in Sacrebleu still present?

@martinpopel
Copy link
Collaborator

No, the error was fixed in 2.0.0, as indicated above by closing this issue.
When you try sentence_chrf("hi there", "hi there") , it now correctly fails with RuntimeError: CHRF: The argument 'refs' should be a sequence of strings.
You need to pass a list of references (sentence_chrf("hi there", ["hi there"])).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants
0