8000 Feature/escape unicode control chars by cpjulia · Pull Request #14805 · arangodb/arangodb · GitHub
[go: up one dir, main page]

Skip to content

Feature/escape unicode control chars #14805

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 27 commits into from
Sep 30, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
4bfa630
Added parser for retaining or escaping control and unicode characters…
cpjulia Sep 18, 2021
600deca
Merge branch 'devel' of https://github.com/arangodb/arangodb into fea…
cpjulia Sep 20, 2021
feb58b7
Added unicode escaping for 4 bytes representation, parsing for broken…
cpjulia Sep 20, 2021
616265e
Merge branch 'devel' of https://github.com/arangodb/arangodb into fea…
cpjulia Sep 20, 2021
f48f87a
Added more tests
cpjulia Sep 20, 2021
4cc1e64
Removed unused functions, updated CHANGELOG, removed unused include i…
cpjulia Sep 20, 2021
2448e9c
Resolved CHANGELOG conflict from merge with devel
cpjulia Sep 20, 2021
3c29365
Update tests/Logger/EscaperTest.cpp
cpjulia Sep 20, 2021
bfa05eb
Update lib/Logger/LoggerFeature.cpp
cpjulia Sep 20, 2021
0ca0d25
Update lib/Logger/LoggerFeature.h
cpjulia Sep 20, 2021
39f97e1
Update lib/Logger/Escaper.h
cpjulia Sep 20, 2021
7a3480f
Update lib/Logger/Escaper.cpp
cpjulia Sep 20, 2021
963921b
Update CHANGELOG
cpjulia Sep 20, 2021
015a188
Update CHANGELOG
cpjulia Sep 21, 2021
a007926
Update CHANGELOG
cpjulia Sep 21, 2021
2f8f3a2
Updated CHANGELOG
cpjulia Sep 21, 2021
3067a5a
Added more tests, updated CHANGELOG
cpjulia Sep 21, 2021
482e186
Update tests/Logger/EscaperTest.cpp
cpjulia Sep 22, 2021
0ea373f
Update tests/Logger/EscaperTest.cpp
cpjulia Sep 22, 2021
c9ad897
Update CHANGELOG
cpjulia Sep 22, 2021
832f569
Update CHANGELOG
cpjulia Sep 22, 2021
7950a3f
Update CHANGELOG
cpjulia Sep 22, 2021
7818ca2
Merge branch 'devel' of github.com:arangodb/arangodb into feature/esc…
jsteemann Sep 22, 2021
a3d9738
Merge branch 'devel' into feature/escape-unicode-control-chars
cpjulia Sep 27, 2021
d9385b1
Merge branch 'devel' of https://github.com/arangodb/arangodb into fea…
cpjulia Sep 29, 2021
432a737
Merge branch 'feature/escape-unicode-control-chars' of https://github…
cpjulia Sep 29, 2021
92f3cbf
Merge branch 'devel' into feature/escape-unicode-control-chars
mchacki Sep 30, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions CHANGELOG
Original file line number Diff line number Diff line change
@@ -1,6 +1,34 @@
devel
-----

* The server now has two flags to control the escaping control and Unicode
characters in the log. The flag `--log.escape` is now deprecated and, instead,
the new flags `--log.escape-control-chars` and `--log.escape-unicode-chars`
should be used.

- `--log.escape-control-chars`: this flag applies to the control characters,
which have hex code below `\x20`, and also the character DEL, with hex code
of `\x7f`. When its value is set to false, the control character will be
retained, and its actual value will be displayed when it is a visible
character, or a space ` ` character will be displayed if it is not a
visible character. The same will happen to `DEL` character (code `\xF7`),
even though it is not a control character, because it is not visible. For
example, control characer `\n` is visible, so a `\n` will be displayed in
the log, and control character `BEL` is not visible, so a space ` ` would
be displayed. When its value is set to true, the hex code for the character
is displayed, for example, `BEL` character would be displayed as its hex
code, `\x07`.
The default value for this flag is `true` for compatibility with
previous versions.

- `--log.escape-unicode-chars`: when its value is set to false, the unicode
character will be retained, and its actual value will be displayed. For
example, `犬` will be displayed as `犬`. When its value is set to true,
the character is escaped, and the hex code for the character is displayed.
For example, `犬` would be displayed as its hex code, `\u72AC`.
The default value for this flag is `false` for compatibility with
previous versions.

* Fixed BTS-582: ArangoDB client EXE package for Windows has incorrect metadata.

* Fixed BTS-575: Windows EXE installer doesn't replace service during upgrade in
Expand Down
62 changes: 0 additions & 62 deletions lib/Basics/tri-strings.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -303,68 +303,6 @@ char* TRI_SHA256String(char const* source, size_t sourceLen, size_t* dstLen) {
return (char*)dst;
}

////////////////////////////////////////////////////////////////////////////////
/// @brief escapes special characters using C escapes
/// the target buffer must have been allocated already and big enough to hold
/// the result of at most (4 * inLength) + 2 bytes!
////////////////////////////////////////////////////////////////////////////////

char* TRI_EscapeControlsCString(char const* in, size_t inLength, char* out,
size_t* outLength, bool appendNewline) {
if (out == nullptr) {
return nullptr;
}

char* qtr = out;
char const* ptr;
char const* end;

for (ptr = in, end = ptr + inLength; ptr < end; ptr++, qtr++) {
uint8_t n;

switch (*ptr) {
case '\n':
*qtr++ = '\\';
*qtr = 'n';
break;

case '\r':
*qtr++ = '\\';
*qtr = 'r';
break;

case '\t':
*qtr++ = '\\';
*qtr = 't';
break;

default:
n = (uint8_t)(*ptr);

if (n < 32) {
uint8_t n1 = n >> 4;
uint8_t n2 = n & 0x0F;

*qtr++ = '\\';
*qtr++ = 'x';
*qtr++ = (n1 < 10) ? ('0' + n1) : ('A' + n1 - 10);
*qtr = (n2 < 10) ? ('0' + n2) : ('A' + n2 - 10);
} else {
*qtr = *ptr;
}

break;
}
}

if (appendNewline) {
*qtr++ = '\n';
}

*qtr = '\0';
*outLength = static_cast<size_t>(qtr - out);
return out;
}

////////////////////////////////////////////////////////////////////////////////
/// @brief unescapes unicode escape sequences
Expand Down
18 changes: 0 additions & 18 deletions lib/Basics/tri-strings.h
Original file line number Diff line number Diff line change
Expand Up @@ -109,24 +109,6 @@ void TRI_FreeString(char*) noexcept;

char* TRI_SHA256String(char const* source, size_t sourceLen, size_t* dstLen);

////////////////////////////////////////////////////////////////////////////////
/// @brief returns the maximum result length for an escaped string
/// (4 * inLength) + 2 bytes!
////////////////////////////////////////////////////////////////////////////////

constexpr size_t TRI_MaxLengthEscapeControlsCString(size_t inLength) {
return (4 * inLength) + 2; // for newline and 0 byte
}

////////////////////////////////////////////////////////////////////////////////
/// @brief escapes special characters using C escapes
/// the target buffer must have been allocated already and big enough to hold
/// the result of at most (4 * inLength) + 2 bytes!
////////////////////////////////////////////////////////////////////////////////

char* TRI_EscapeControlsCString(char const* in, size_t inLength, char* out,
size_t* outLength, bool appendNewline);

////////////////////////////////////////////////////////////////////////////////
/// @brief unescapes unicode escape sequences
///
Expand Down
1 change: 1 addition & 0 deletions lib/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -159,6 +159,7 @@ add_library(arango STATIC
Endpoint/EndpointIpV6.cpp
Endpoint/EndpointList.cpp
Futures/Future.cpp
Logger/Escaper.cpp
Logger/LogAppender.cpp
Logger/LogAppenderFile.cpp
Logger/LogAppenderSyslog.cpp
Expand Down
211 changes: 211 additions & 0 deletions lib/Logger/Escaper.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,211 @@
////////////////////////////////////////////////////////////////////////////////
/// DISCLAIMER
///
/// Copyright 2014-2021 ArangoDB GmbH, Cologne, Germany
/// Copyright 2004-2014 triAGENS GmbH, Cologne, Germany
///
/// Licensed under the Apache License, Version 2.0 (the "License");
/// you may not use this file except in compliance with the License.
/// You may obtain a copy of the License at
///
/// http://www.apache.org/licenses/LICENSE-2.0
///
/// Unless required by applicable law or agreed to in writing, software
/// distributed under the License is distributed on an "AS IS" BASIS,
/// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
/// See the License for the specific language governing permissions and
/// limitations under the License.
///
/// Copyright holder is ArangoDB GmbH, Cologne, Germany
///
/// @author Julia Puget
////////////////////////////////////////////////////////////////////////////////

#include "Escaper.h"
#include "Basics/debugging.h"

namespace arangodb {

void ControlCharsSuppressor::writeCharIntoOutputBuffer(uint32_t c, char*& output, int numBytes) {
*output++ = ' ';
}

void ControlCharsEscaper::writeCharIntoOutputBuffer(uint32_t c, char*& output, int numBytes) {
switch (c) {
case '\n':
*output++ = '\\';
*output++ = 'n';
break;

case '\r':
*output++ = '\\';
*output++ = 'r';
break;

case '\t':
*output++ = '\\';
*output++ = 't';
break;

default: {
uint8_t n1 = c >> 4;
uint8_t n2 = c & 0x0F;

*output++ = '\\';
*output++ = 'x';
*output++ = (n1 < 10) ? ('0' + n1) : ('A' + n1 - 10);
*output++ = (n2 < 10) ? ('0' + n2) : ('A' + n2 - 10);
}
}
}

void UnicodeCharsRetainer::writeCharIntoOutputBuffer(uint32_t c, char*& output, int numBytes) {
if (numBytes == 2) {
uint16_t num1 = c & 0xffff;
*output++ = ((num1 >> 6) & 0x1f) | 0xc0;
*output++ = (num1 & 0x3f) | 0x80;
} else if (numBytes == 3) {
uint16_t num1 = c & 0xffff;
*output++ = ((num1 >> 12) & 0x0f) | 0xe0;
*output++ = ((num1 >> 6) & 0x3f) | 0x80;
*output++ = (num1 & 0x3f) | 0x80;
} else if (numBytes == 4) {
*output++ = ((c >> 18) & 0x07) | 0xF0;
*output++ = ((c >> 12) & 0x3f) | 0x80;
*output++ = ((c >> 6) & 0x3f) | 0x80;
*output++ = (c & 0x3f) | 0x80;
}
}

void UnicodeCharsEscaper::writeCharHelper(uint16_t c, char*& output) {
*output++ = '\\';
*output++ = 'u';

uint16_t i1 = (c & 0xF000) >> 12;
uint16_t i2 = (c & 0x0F00) >> 8;
uint16_t i3 = (c & 0x00F0) >> 4;
uint16_t i4 = (c & 0x000F);

*output++ = (i1 < 10) ? ('0' + i1) : ('A' + i1 - 10);
*output++ = (i2 < 10) ? ('0' + i2) : ('A' + i2 - 10);
*output++ = (i3 < 10) ? ('0' + i3) : ('A' + i3 - 10);
*output++ = (i4 < 10) ? ('0' + i4) : ('A' + i4 - 10);
}

void UnicodeCharsEscaper::writeCharIntoOutputBuffer(uint32_t c, char*& output, int numBytes) {
if (numBytes == 4) { // when the unicode requires 4 bytes for representation, its code is escaped with surrogate pairs, the highest and the lowest bytes of the character
TRI_ASSERT(c >= 0x10000U);
c -= 0x10000U;
uint16_t high = (uint16_t) (((c & 0xffc00U) >> 10) + 0xd800);
writeCharHelper(high, output);
uint16_t low = (c & 0x3ffU) + 0xdc00U;
writeCharHelper(low, output);
} else {
writeCharHelper(c, output);
}
}

template <typename ControlCharHandler, typename UnicodeCharHandler>
size_t Escaper<ControlCharHandler, UnicodeCharHandler>::determineOutputBufferSize(
std::string const& message) const {
return message.size() * std::max(this->_controlHandler.maxCharLength(),
this->_unicodeHandler.maxCharLength());
}

template <typename ControlCharHandler, typename UnicodeCharHandler>
void Escaper<ControlCharHandler, UnicodeCharHandler>::writeIntoOutputBuffer(
std::string const& message, char*& buffer) {
unsigned char const* p = reinterpret_cast<unsigned char const*>(message.data());
unsigned char const* end = p + message.length();
while (p < end) {
unsigned char c = *p;
if (c < 128) { // the character is ASCII
if (c < 0x20 || c == 0x7f) { // the character is either control, which comprises codes until 32, or is DEL, which is not a visible character
this->_controlHandler.writeCharIntoOutputBuffer(c, buffer, 1); //retain or escape the control character
} else { // is a visible ascii character
*buffer++ = c;
}
p++;
} else if (c < 224) { // unicode which requires 2 bytes for representation
if ((p + 1) >= end) { // no next byte to represent it, so it's broken unicode
*buffer++ = '?';
p++;
continue;
}
uint8_t d = (uint8_t) * (p + 1);
if ((d & 0xC0) == 0x80) { // is within the rules for representing unicode characters for the second byte
this->_unicodeHandler.writeCharIntoOutputBuffer(((c & 0x1F) << 6) | (d & 0x3F),
buffer, 2); // retain or escape the unicode character represented by 2 bytes
++p;
} else { // the next byte is broken unicode
*buffer++ = '?';
}
p++;
} else if (c < 240) { // unicode which requires 3 bytes for representation
if ((p + 2) >= end) { // there's no 2 other sequential bytes to represent the unicode character, so it's broken unicode
*buffer++ = '?';
p++;
continue;
}
uint8_t d = (uint8_t) * (p + 1);
if ((d & 0xC0) == 0x80) { // second byte is within the rules for representing a unicode character that requires 3 bytes for representation
++p;
uint8_t e = (uint8_t) * (p + 1);
if ((e & 0xC0) == 0x80) { // third byte is within the rules for representing a unicode character that requires 3 bytes for representation
++p;
this->_unicodeHandler.writeCharIntoOutputBuffer(
((c & 0x0F) << 12) | ((d & 0x3F) << 6) | (e & 0x3F), buffer, 3); // retain or escape the unicode character represented by 3 bytes
} else { // second byte is not within the rules for representing a unicode character
*buffer++ = '?';
}
} else { // third byte is not within the rules for representing a unicode character
*buffer++ = '?';
}
p++;
} else if (c < 248) { // unicode which requires 4 bytes for representation
if ((p + 3) >= end) { // there's not 3 sequential bytes for representing this unicode character, so it's broken unicode
*buffer++ = '?';
p++;
continue;
}
uint8_t d = (uint8_t) * (p + 1);
if ((d & 0xC0) == 0x80) { // second byte is within the rules for representing a unicode character that requires 3 bytes for representation
++p;
uint8_t e = (uint8_t) * (p + 1);
if ((e & 0xC0) == 0x80) { // third byte is within the rules for representing a unicode character that requires 3 bytes for representation
++p;
uint8_t f = (uint8_t) * (p + 1);
if ((f & 0xC0) == 0x80) { // fourth byte is within the rules for representing a unicode character that requires 3 bytes for representation
p++;
this->_unicodeHandler.writeCharIntoOutputBuffer(((c & 0x07) << 18) |
((d & 0x3F) << 12) |
((e & 0x3F) << 6) |
(f & 0x3F),
buffer, 4); // retain or escape the unicode character represented by 4 bytes
} else { // second byte is not within the rules for representing a unicode character
*buffer++ = '?';
}
} else { // third byte is not within the rules for representing a unicode character
*buffer++ = '?';
}
} else { // fourth byte is not within the rules for representing a unicode character
*buffer++ = '?';
}
p++;
} else { // broken unicode, is not ascii and not represented with 2, 3 or 4 bytes
*buffer++ = '?';
// invalid UTF-8 sequence
break;
}
}
}

template class Escaper<ControlCharsSuppressor, UnicodeCharsRetainer>;

template class Escaper<ControlCharsSuppressor, UnicodeCharsEscaper>;

template class Escaper<ControlCharsEscaper, UnicodeCharsRetainer>;

template class Escaper<ControlCharsEscaper, UnicodeCharsEscaper>;

} // namespace arangodb
Loading
0