Port/rewrite builtins/string to Rust#9854
Conversation
0635cf8 to
be98fb2
Compare
It's fine to remove those if we have equivalent tests elsewhere - either as a unit test or as an integration test, for everything that's still sensible to test. A bunch of these cases are already covered by the littlecheck tests.
Ah. It crashes when it tries to output a NUL, which we need to allow. I had so far only seen it try that when it got a NUL passed through as an argument, which in the C++ version is impossible (because we pass args as a C-style NUL-delimited array). But The error is in fish-shell/fish-rust/src/builtins/shared.rs Lines 90 to 92 in a75de42 That's bogus, the output stream can and should get NULs. I'll see if I can figure it out - the It's also fine to disable these tests for now, so you're unblocked.
We've been trying to first do as direct a port as possible. It's fine to do this later, but divide-and-conquer - first port what we have, then think about making it nice. |
|
Try this for append(): use crate::wchar_ffi::ToCppWString;
// ...
pub fn append<Str: AsRef<wstr> + ToCppWString>(&mut self, s: Str) -> bool {
self.ffi().append(&s.into_cpp())
}This builds and runs for me, and causes Edit: Now available as #9859. |
78cd564 to
cfdd217
Compare
3eb99c2 to
ca5a147
Compare
|
This is an incredible amount of work! Thank you!!
One possibility to avoid the whole object-safe nonsense is to make it concrete, via a generic handler, and then maybe use impl SubCommandRunner {
fn run<T: StringSubCommand>(&mut self, cmd: T) -> Option<c_int> { ... }
fn run_subcommand(&mut self, cmd: &str) {
match cmd {
"collect" => self.run(Collect::default()),
"escape" => self.run(Escape::default()),
...
}
}The advantage is that you don't have to sweat "object safe;" it does require an awkward conversion to |
ca16bd5 to
5aa0b21
Compare
ad8eb3f to
5a04a9b
Compare
d0afdac to
ed62f66
Compare
f1dfa8c to
8d925f4
Compare
aec45b4 to
8d2fa70
Compare
8d2fa70 to
81e36a5
Compare
| let mut args_read = Vec::with_capacity(args.len()); | ||
| args_read.extend_from_slice(args); | ||
|
|
||
| let mut w = wgetopter_t::new(Self::SHORT_OPTIONS, Self::LONG_OPTIONS, args); |
There was a problem hiding this comment.
The copy of args into args_read is ugly.
wgetopter_t wants exclusive access to args (because it will rearrange args so positionals are next to each other).
How about giving wgetopt accessor methods like w.cmd() and w.argv()?
There's another issue: args: &mut [&wstr] has the wrong type (for all builtins).
In practice it is always underpinned by a &mut Vec<WString> we should change
it to args: &mut [WString] to avoid the need to allocate a slice-of-slices;
Here's some of that change I made a while ago:
Details
diff --git a/fish-rust/src/wgetopt.rs b/fish-rust/src/wgetopt.rs
index bc4224c98..0ff56323f 100644
--- a/fish-rust/src/wgetopt.rs
+++ b/fish-rust/src/wgetopt.rs
@@ -23,7 +23,9 @@ License along with the GNU C Library; see the file COPYING.LIB. If
not, write to the Free Software Foundation, Inc., 675 Mass Ave,
Cambridge, MA 02139, USA. */
-use crate::wchar::{wstr, WExt, L};
+use std::ops;
+
+use crate::wchar::{wstr, WExt, WString, L};
/// Describe how to deal with options that follow non-option ARGV-elements.
///
@@ -64,14 +66,14 @@ fn empty_wstr() -> &'static wstr {
Default::default()
}
-pub struct wgetopter_t<'opts, 'args, 'argarray> {
+pub struct wgetopter_t<'opts, 'argarray> {
/// Argv.
- argv: &'argarray mut [&'args wstr],
+ argv: &'argarray mut [WString],
/// For communication from `getopt` to the caller. When `getopt` finds an option that takes an
/// argument, the argument value is returned here. Also, when `ordering` is RETURN_IN_ORDER, each
/// non-option ARGV-element is returned here.
- pub woptarg: Option<&'args wstr>,
+ woptarg_idx: Option<(usize, usize)>,
shortopts: &'opts wstr,
longopts: &'opts [woption<'opts>],
@@ -80,7 +82,13 @@ pub struct wgetopter_t<'opts, 'args, 'argarray> {
/// returned was found. This allows us to pick up the scan where we left off.
///
/// If this is empty, it means resume the scan by advancing to the next ARGV-element.
+ nextchar_slice: ops::Range<usize>,
/// Index in ARGV of the next element to be scanned. This is used for communication to and from
/// the caller and for communication between successive calls to `getopt`.
@@ -153,11 +161,11 @@ pub const fn wopt(name: &wstr, has_arg: woption_argument_t, val: char) -> woptio
woption { name, has_arg, val }
}
-impl<'opts, 'args, 'argarray> wgetopter_t<'opts, 'args, 'argarray> {
+impl<'opts, 'argarray> wgetopter_t<'opts, 'argarray> {
pub fn new(
shortopts: &'opts wstr,
longopts: &'opts [woption],
- argv: &'argarray mut [&'args wstr],
+ argv: &'argarray mut [WString],
) -> Self {
return wgetopter_t {
woptopt: '?',
@@ -168,9 +176,9 @@ impl<'opts, 'args, 'argarray> wgetopter_t<'opts, 'args, 'argarray> {
initialized: false,
last_nonopt: 0,
missing_arg_return_colon: false,
- nextchar: Default::default(),
+ nextchar_slice: 0..0,
ordering: Ordering::PERMUTE,
- woptarg: None,
+ woptarg_idx: None,
woptind: 0,
};
}
@@ -185,6 +193,23 @@ impl<'opts, 'args, 'argarray> wgetopter_t<'opts, 'args, 'argarray> {
return self._wgetopt_internal(opt_index, false);
}
+ pub fn woptarg(&self) -> Option<&wstr> {
+ self.woptarg_idx
+ .map(|(woptind, start)| &self.argv[woptind][start..])
+ }
+
+ pub fn cmd(&self) -> &wstr {
+ &self.argv[0]
+ }
+
+ pub fn argv(&self) -> &[WString] {
+ self.argv
+ }
+
+ pub fn nextchar(&self) -> &wstr {
+ &self.argv[self.woptind][self.nextchar_slice.clone()]
+ }
+
/// \return the number of arguments.
fn argc(&self) -> usize {
return self.argv.len();
@@ -241,7 +266,7 @@ impl<'opts, 'args, 'argarray> wgetopter_t<'opts, 'args, 'argarray> {
self.first_nonopt = 1;
self.last_nonopt = 1;
self.woptind = 1;
- self.nextchar = empty_wstr();
+ self.nextchar_slice = 0..0;
let mut optstring = self.shortopts;
@@ -321,7 +346,7 @@ impl<'opts, 'args, 'argarray> wgetopter_t<'opts, 'args, 'argarray> {
if self.ordering == Ordering::REQUIRE_ORDER {
return None;
}
- self.woptarg = Some(self.argv[self.woptind]);
+ self.woptarg_idx = Some((self.woptind, 0));
self.woptind += 1;
return Some(char::from(1));
}
@@ -332,15 +357,15 @@ impl<'opts, 'args, 'argarray> wgetopter_t<'opts, 'args, 'argarray> {
} else {
1
};
- self.nextchar = self.argv[self.woptind][skip..].into();
+ self.nextchar_slice = skip..(self.argv[self.woptind].len());
return Some(char::from(0));
}
/// Check for a matching short opt.
fn _handle_short_opt(&mut self) -> char {
// Look at and handle the next short option-character.
- let mut c = self.nextchar.char_at(0);
- self.nextchar = &self.nextchar[1..];
+ let mut c = self.nextchar().char_at(0);
+ self.nextchar_slice = self.nextchar_slice.start + 1..self.nextchar_slice.end;
let temp = match self.shortopts.chars().position(|sc| sc == c) {
Some(pos) => &self.shortopts[pos..],
@@ -348,14 +373,14 @@ impl<'opts, 'args, 'argarray> wgetopter_t<'opts, 'args, 'argarray> {
};
// Increment `woptind' when we start to process its last character.
- if self.nextchar.is_empty() {
+ if self.nextchar().is_empty() {
self.woptind += 1;
}
if temp.is_empty() || c == ':' {
self.woptopt = c;
- if !self.nextchar.is_empty() {
+ if !self.nextchar().is_empty() {
self.woptind += 1;
}
return '?';
@@ -367,17 +392,17 @@ impl<'opts, 'args, 'argarray> wgetopter_t<'opts, 'args, 'argarray> {
if temp.char_at(2) == ':' {
// This is an option that accepts an argument optionally.
- if !self.nextchar.is_empty() {
- self.woptarg = Some(self.nextchar);
+ if !self.nextchar().is_empty() {
+ self.woptarg_idx = Some((self.woptind, 0));
self.woptind += 1;
} else {
- self.woptarg = None;
+ self.woptarg_idx = None;
}
- self.nextchar = empty_wstr();
+ self.nextchar_slice = 0..0;
} else {
// This is an option that requires an argument.
- if !self.nextchar.is_empty() {
- self.woptarg = Some(self.nextchar);
+ if !self.nextchar().is_empty() {
+ self.woptarg_idx = Some((self.woptind, 0));
// If we end this ARGV-element by taking the rest as an arg, we must advance to
// the next element now.
self.woptind += 1;
@@ -391,10 +416,10 @@ impl<'opts, 'args, 'argarray> wgetopter_t<'opts, 'args, 'argarray> {
} else {
// We already incremented `woptind' once; increment it again when taking next
// ARGV-elt as argument.
- self.woptarg = Some(self.argv[self.woptind]);
+ self.woptarg_idx = Some((self.woptind, 0));
self.woptind += 1;
}
- self.nextchar = empty_wstr();
+ self.nextchar_slice = 0..0;
}
return c;
@@ -409,21 +434,23 @@ impl<'opts, 'args, 'argarray> wgetopter_t<'opts, 'args, 'argarray> {
retval: &mut char,
) {
self.woptind += 1;
- assert!(self.nextchar.char_at(nameend) == '\0' || self.nextchar.char_at(nameend) == '=');
- if self.nextchar.char_at(nameend) == '=' {
+ assert!(
+ self.nextchar().char_at(nameend) == '\0' || self.nextchar().char_at(nameend) == '='
+ );
+ if self.nextchar().char_at(nameend) == '=' {
if pfound.has_arg != woption_argument_t::no_argument {
- self.woptarg = Some(self.nextchar[(nameend + 1)..].into());
+ self.woptarg_idx = Some((self.woptind, nameend + 1));
} else {
- self.nextchar = empty_wstr();
+ self.nextchar_slice = 0..0;
*retval = '?';
return;
}
} else if pfound.has_arg == woption_argument_t::required_argument {
if self.woptind < self.argc() {
- self.woptarg = Some(self.argv[self.woptind]);
+ self.woptarg_idx = Some((self.woptind, 0));
self.woptind += 1;
} else {
- self.nextchar = empty_wstr();
+ self.nextchar_slice = 0..0;
*retval = if self.missing_arg_return_colon {
':'
} else {
@@ -433,7 +460,7 @@ impl<'opts, 'args, 'argarray> wgetopter_t<'opts, 'args, 'argarray> {
}
}
- self.nextchar = empty_wstr();
+ self.nextchar_slice = 0..0;
*longind = option_index;
*retval = pfound.val;
}
@@ -451,7 +478,7 @@ impl<'opts, 'args, 'argarray> wgetopter_t<'opts, 'args, 'argarray> {
// Test all long options for either exact match or abbreviated matches.
for (option_index, p) in self.longopts.iter().enumerate() {
// Check if current option is prefix of long opt
- if p.name.starts_with(&self.nextchar[..nameend]) {
+ if p.n
4D24
ame.starts_with(&self.nextchar()[..nameend]) {
if nameend == p.name.len() {
// The current option is exact match of this long option
pfound = Some(*p);
@@ -483,14 +510,14 @@ impl<'opts, 'args, 'argarray> wgetopter_t<'opts, 'args, 'argarray> {
let mut indfound: usize = 0;
let mut nameend = 0;
- while self.nextchar.char_at(nameend) != '\0' && self.nextchar.char_at(nameend) != '=' {
+ while self.nextchar().char_at(nameend) != '\0' && self.nextchar().char_at(nameend) != '=' {
nameend += 1;
}
let pfound = self._find_matching_long_opt(nameend, &mut exact, &mut ambig, &mut indfound);
if ambig && !exact {
- self.nextchar = empty_wstr();
+ self.nextchar_slice = 0..0;
self.woptind += 1;
*retval = '?';
return true;
@@ -509,9 +536,9 @@ impl<'opts, 'args, 'argarray> wgetopter_t<'opts, 'args, 'argarray> {
|| !self
.shortopts
.as_char_slice()
- .contains(&self.nextchar.char_at(0))
+ .contains(&self.nextchar().char_at(0))
{
- self.nextchar = empty_wstr();
+ self.nextchar_slice = 0..0;
self.woptind += 1;
*retval = '?';
return true;
@@ -541,7 +568,7 @@ impl<'opts, 'args, 'argarray> wgetopter_t<'opts, 'args, 'argarray> {
/// If a char in OPTSTRING is followed by a colon, that means it wants an arg, so the following text
/// in the same ARGV-element, or the text of the following ARGV-element, is returned in `optarg`.
/// Two colons mean an option that wants an optional arg; if there is text in the current
- /// ARGV-element, it is returned in `w.woptarg`, otherwise `w.woptarg` is set to zero.
+ /// ARGV-element, it is returned in `w.woptarg()`, otherwise `w.woptarg()` is set to zero.
///
/// If OPTSTRING starts with `-` or `+', it requests different methods of handling the non-option
/// ARGV-elements. See the comments about RETURN_IN_ORDER and REQUIRE_ORDER, above.
@@ -563,9 +590,9 @@ impl<'opts, 'args, 'argarray> wgetopter_t<'opts, 'args, 'argarray> {
if !self.initialized {
self._wgetopt_initialize();
}
- self.woptarg = None;
+ self.woptarg_idx = None;
- if self.nextchar.is_empty() {
+ if self.nextchar().is_empty() {
let narg = self._advance_to_next_argv();
if narg != Some(char::from(0)) {
return narg;
@@ -586,7 +613,7 @@ impl<'opts, 'args, 'argarray> wgetopter_t<'opts, 'args, 'argarray> {
//
// This distinction seems to be the most useful approach.
if !self.longopts.is_empty() && self.woptind < self.argc() {
- let arg = self.argv[self.woptind];
+ let arg = &self.argv[self.woptind];
#[allow(clippy::if_same_then_else)]
#[allow(clippy::needless_bool)]There was a problem hiding this comment.
I agree the cloning is ugly, but this is independent of this PR and present in other subcommands as well, I would prefer to not batch that into this already quite large change
|
I've reviewed everything except match and replace. Details |
There was a problem hiding this comment.
The prr-comments got kinda mangled
fish-rust/src/builtins/string.rs
Outdated
| while let Some(ec_pos) = ins.slice_from(pos).find_char('\x1B') { | ||
| pos += ec_pos; | ||
| if let Some(len) = escape_code_length(ins.slice_from(pos)) { | ||
| let sub = ins.slice_from(pos).slice_to(len); |
There was a problem hiding this comment.
I vote for using bracket indexing for the sake of consistency
I agree, but then we should probably just remove slice_{to,from} from WExt
| let sub = ins.slice_from(pos).slice_to(len); | |
| let sub = &ins[pos..pos + len]; |
There was a problem hiding this comment.
I agree, but then we should probably just remove slice_{to,from} from WExt
yeah. I'm not sure what others think
There was a problem hiding this comment.
The logical distinction of slice_to and slice_from is that they operate on char counts, instead of byte counts.
In a hypothetical future where we try to switch to UTF-8, that distinction will become important. I worry that the bracket indexing will introduce bugs at that point. But I'm open to other ideas for how to handle that.
There was a problem hiding this comment.
If that's the case we should warn for all usage of wstr-slicing and indexing with a clippy lint. I think the point here is that we want to write code that's portable across string representations, and that's not necessarily that simple. I am very much in support of full unicode-support with graphemes and UTF8, so we should prepare for it.
But: this should have an API closer to slicing, e.g. like
fn slice_chars<I>(&self, index: I) -> Self
where
I: RangeBounds<usize> + SliceIndex<[char], Output = [char]>,so we can still use slice-like operations
There was a problem hiding this comment.
Interesting; a clippy lint banning str slicing seems easy and robust enough
| } | ||
| } | ||
|
|
||
| let ourmax: usize = self.max.unwrap_or(min_width); |
There was a problem hiding this comment.
nit regarding the earlier // TODO: Can we have negative width comment.
If easily possible, I'd have preferred putting the behavior fix commit first, so that such questions never arise.
One issue here is that the earlier Rust code before the fix panicked with a capacity overflow, and backporting to C++ so you have to rebase with a C++ bugfix is weird, so you kinda need to squash the fix with the port, which is not ideal. I don't like this situation. In this case it was easier to fix the bug than port the bug
There was a problem hiding this comment.
and backporting to C++ so you have to rebase with a C++ bugfix is weird, so you kinda need to squash the fix with the port, which is not ideal. I don't like this situation. In this case it was easier to fix the bug than port the bug
one recipe is to rebase, breaking before the rewrite to add a commit that fixes the bug in C++. Then revert that commit and continue the rebase. There are no conflicts because of the revert. Of course the bug fix needs to be applied again on the Rust side (before or after the rebase), and squashed (along with the revert; both need to be squashed into the rewrite commit to minimize noise)
There was a problem hiding this comment.
I think the conclusion here is to squash this all into one commit, I don't think this needs to be backported for a 3.6.2 release, as fixing the C++ code is not as trivial as #9900, and less important. If we want to backport this please inform me and make a PR towards master so I can rebase on that when I rebase for #9900
There was a problem hiding this comment.
any order is fine here (hence nit) but in general it's better to put "smaller" commits first because the larger ones are more likely to be changed further.
Squashing the behavior change into the port is not something we should do (but there is no need to right)
There was a problem hiding this comment.
Just to make it clear: I do not want to write a C++ version of the behavior fix. The initial porting-commit still has bugged behavior (but different, as in crashing!).
My understanding is that we do not want to backport the fix, which will leave the "broken" commit in place if we do not squash it. But as you are telling me, that is what we want? If so please mark this as resolved.
I'd have preferred putting the behavior fix commit first, so that such questions never arise.
I agree that we should ideally have the fix separately, but without writing a C++ version of the fix there is very little I can do to improve the commit ordering.
fish-rust/src/builtins/tests/string_tests.rs
Outdated
BD9E
span>
Show resolved
Hide resolved
fixed |
6d37e91 to
b9b840f
Compare
|
FYI I plan on finishing reviewing this weekend. |
There was a problem hiding this comment.
So once again this is an incredible amount of work and it looks overall fantastic. Thank you!!
Regarding splitting subcommands into separate modules: either seems fine to me, and in general I defer to the person doing the actual work. While this diverges from the C++ I understand why you chose to do it (string is pretty weird) and I'm very pleased with the new structure.
Regarding bracket-slicing [start..] vs slice_from(start): I'm not going to be a stickler about the difference for now. If/when we later move to UTF-8 we will need to audit all the slices; the fact that this can be done with a custom clippy lint rule is really good info (I didn't know you could do that!) so no reason to sweat it now. Same story for len() / char_count().
I spotted some small bugs, see my comments.
The biggest missing piece here is localization - a lot of the C++ error messages were wrapped in _(...) to plug them into wgettext and much of that has been lost. You can fix that now, or defer it and I or others can go back and add them after merge.
I don't expect to have any further comments after the small bugs are fixed. Once again I think is awesome.
| } | ||
| } | ||
| } else { | ||
| let n = arg.len(); |
There was a problem hiding this comment.
This is an example of a place we'd have to audit and fix should we switch to UTF-8 strings, since string length gets the number of chars, not number of bytes. You can switch to char_count() now, or defer it, since we really will do that audit.
|
|
||
| impl StringSubCommand<'_> for Unescape { | ||
| const LONG_OPTIONS: &'static [woption<'static>] = &[ | ||
| wopt(L!("no-quoted"), no_argument, 'q'), |
There was a problem hiding this comment.
no-quoted has short option 'n'. Distressing that this wasn't caught by the tests.
There was a problem hiding this comment.
I think this is rather a fault in that no-quoted does not mean anything for unescape — it did nothing and does nothing, the NO_QUOTED flag is not defined for unescapeflags
fish-shell/fish-rust/src/common.rs
Line 134 in ade6650
It is also not documented that it should support no-quoted https://fishshell.com/docs/current/cmds/string-unescape.html#synopsis
There was a problem hiding this comment.
I fixed this so it is the correct n, but left the option still there
fefcb4e to
d8b406d
Compare
- Add test to verify piped string replace exit code Ensure fields parsing error messages are the same. Note: C++ relied upon the value of the parsed value even when `errno` was set, that is defined behaviour we should not rely on, and cannot easilt be replicated from Rust. Therefore the Rust version will change the following error behaviour from: ```shell > string split --fields=a "" abc string split: Invalid fields value 'a' > string split --fields=1a "" abc string split: 1a: invalid integer ``` To: ```shell > string split --fields=a "" abc string split: a: invalid integer > string split --fields=1a "" abc string split: 1a: invalid integer ```
Padding with an unprintable character is now disallowed, like it was for other zero-length characters. `string shorten` now ignores escape sequences and non-printable characters when calculating the visible width of the ellipsis used (except for `\b`, which is treated as a width of -1). Previously `fish_wcswidth` returned a length of -1 when the ellipsis-str contained any non-printable character, causing the command to poentially print a larger width than expected. This also fixes an integer overflows in `string shorten`'s `max` and `max2`, when the cumulative sum of character widths turned negative (e.g. with any non-printable characters, or `\b` after the changes above). The overflow potentially caused strings containing non-printable characters to be truncated. This adds test that verify the fixed behaviour.
d8b406d to
a8d6806
Compare
There was a problem hiding this comment.
Now rebased and squashed. Also reduced the number of import statements. There are some minor nits still open:
- whether we want to backport the non-printable-fixes.
- For the UTF8-compat-changes I would much rather do that separately.
- I still think the string-translation-stuff is messy, but I can't do anything about it until we have an xgettext-like thing working for Rust code
There are a lot of TODO's that are better suited for later PR's, this should be mergeable as is
| self.fields = arg.unwrap().try_into().map_err(|e| match e { | ||
| FieldParseError::Number => StringError::NotANumber, | ||
| FieldParseError::Range => { | ||
| invalid_args!("%ls: Invalid range value for field '%ls'\n", name, arg) |
There was a problem hiding this comment.
These should now all support translation — how we extract the translation strings I leave as a future work as I don't know whether that will happen before/after macro-expansion
| streams.err.append(wgettext_fmt!( | ||
| BUILTIN_ERR_COMBO2, | ||
| cmd, | ||
| wgettext!("--invert and --groups-only are mutually exclusive") |
There was a problem hiding this comment.
These also now have translation support added back
- The dermination is from commit 7988cff - See PR fish-shell#9139
|
Merged. What can I say, except thank you. |
Capturespublic to build (so far)Individual sub-commands to review:
Please feel free to only look at the overall setup/some individual subcommands and not review everything at once, this is quite a lot
Merging is blocked on fish-shell/rust-pcre2#1, unless you want to always point to my fork