8000 Allow empty replacement strings in contrib/unaccent. · postgrespro/postgres@97c40ce · GitHub
[go: up one dir, main page]

Skip to content
  • Commit 97c40ce

    Browse files
    committed
    Allow empty replacement strings in contrib/unaccent.
    This is useful in languages where diacritic signs are represented as separate characters; it's also one step towards letting unaccent be used for arbitrary substring substitutions. In passing, improve the user documentation for unaccent, which was sadly vague about some important details. Mohammad Alhashash, reviewed by Abhijit Menon-Sen
    1 parent 5586327 commit 97c40ce

    File tree

    2 files changed

    +54
    -11
    lines changed

    2 files changed

    +54
    -11
    lines changed

    contrib/unaccent/unaccent.c

    Lines changed: 23 additions & 6 deletions
    Original file line numberDiff line numberDiff line change
    @@ -104,11 +104,21 @@ initTrie(char *filename)
    104104

    105105
    while ((line = tsearch_readline(&trst)) != NULL)
    106106
    {
    107-
    /*
    108-
    * The format of each line must be "src trg" where src and trg
    109-
    * are sequences of one or more non-whitespace characters,
    110-
    * separated by whitespace. Whitespace at start or end of
    111-
    * line is ignored.
    107+
    /*----------
    108+
    * The format of each line must be "src" or "src trg", where
    109+
    * src and trg are sequences of one or more non-whitespace
    110+
    * characters, separated by whitespace. Whitespace at start
    111+
    * or end of line is ignored. If trg is omitted, an empty
    112+
    * string is used as the replacement.
    113+
    *
    114+
    * We use a simple state machine, with states
    115+
    * 0 initial (before src)
    116+
    * 1 in src
    117+
    * 2 in whitespace after src
    118+
    * 3 in trg
    119+
    * 4 in whitespace after trg
    120+
    * -1 syntax error detected (line will be ignored)
    121+
    *----------
    112122
    */
    113123
    int state;
    114124
    char *ptr;
    @@ -160,7 +170,14 @@ initTrie(char *filename)
    160170
    }
    161171
    }
    162172

    163-
    if (state >= 3)
    173+
    if (state == 1 || state == 2)
    174+
    {
    175+
    /* trg was omitted, so use "" */
    176+
    trg = "";
    177+
    trglen = 0;
    178+
    }
    179+
    180+
    if (state > 0)
    164181
    rootTrie = placeChar(rootTrie,
    165182
    (unsigned char *) src, srclen,
    166183
    trg, trglen);

    doc/src/sgml/unaccent.sgml

    Lines changed: 31 additions & 5 deletions
    Original file line numberDiff line numberDiff line change
    @@ -45,9 +45,9 @@
    4545
    <itemizedlist>
    4646
    <listitem>
    4747
    <para>
    48-
    Each line represents a pair, consisting of a character with accent
    49-
    followed by a character without accent. The first is translated into
    50-
    the second. For example,
    48+
    Each line represents one translation rule, consisting of a character with
    49+
    accent followed by a character without accent. The first is translated
    50+
    into the second. For example,
    5151
    <programlisting>
    5252
    &Agrave; A
    5353
    &Aacute; A
    @@ -57,6 +57,27 @@
    5757
    &Aring; A
    5858
    &AElig; A
    5959
    </programlisting>
    60+
    The two characters must be separated by whitespace, and any leading or
    61+
    trailing whitespace on a line is ignored.
    62+
    </para>
    63+
    </listitem>
    64+
    65+
    <listitem>
    66+
    <para>
    67+
    Alternatively, if only one character is given on a line, instances of
    68+
    that character are deleted; this is useful in languages where accents
    69+
    are represented by separate characters.
    70+
    </para>
    71+
    </listitem>
    72+
    73+
    <listitem>
    74+
    <para>
    75+
    As with other <productname>PostgreSQL</> text search configuration files,
    76+
    the rules file must be stored in UTF-8 encoding. The data is
    77+
    automatically translated into the current database's encoding when
    78+
    loaded. Any lines containing untranslatable characters are silently
    79+
    ignored, so that rules files can contain rules that are not applicable in
    80+
    the current encoding.
    6081
    </para>
    6182
    </listitem>
    6283
    </itemizedlist>
    @@ -132,8 +153,8 @@ mydb=# select ts_headline('fr','H&ocirc;tel de la Mer',to_tsquery('fr','Hotels')
    132153

    133154
    <para>
    134155
    The <function>unaccent()</> function removes accents (diacritic signs) from
    135-
    a given string. Basically, it's a wrapper around the
    136-
    <filename>unaccent</> dictionary, but it can be used outside normal
    156+
    a given string. Basically, it's a wrapper around
    157+
    <filename>unaccent</>-type dictionaries, but it can be used outside normal
    137158
    text search contexts.
    138159
    </para>
    139160

    @@ -145,6 +166,11 @@ mydb=# select ts_headline('fr','H&ocirc;tel de la Mer',to_tsquery('fr','Hotels')
    145166
    unaccent(<optional><replaceable class="PARAMETER">dictionary</replaceable>, </optional> <replaceable class="PARAMETER">string</replaceable>) returns <type>text</type>
    146167
    </synopsis>
    147168

    169+
    <para>
    170+
    If the <replaceable class="PARAMETER">dictionary</replaceable> argument is
    171+
    omitted, <literal>unaccent</> is assumed.
    172+
    </para>
    173+
    148174
    <para>
    149175
    For example:
    150176
    <programlisting>

    0 commit comments

    Comments
     (0)
    0