8000 Port/rewrite builtins/string to Rust by henrikhorluck · Pull Request #9854 · fish-shell/fish-shell · GitHub
[go: up one dir, main page]

Skip to content

Port/rewrite builtins/string to Rust#9854

Merged
ridiculousfish merged 4 commits intofish-shell:masterfrom
henrikhorluck:riir/builtins/string
Jul 28, 2023
Merged

Port/rewrite builtins/string to Rust#9854
ridiculousfish merged 4 commits intofish-shell:masterfrom
henrikhorluck:riir/builtins/string

Conversation

@henrikhorluck
Copy link
Contributor
@henrikhorluck henrikhorluck commented Jun 20, 2023
  • - For regex replace/subsitution, I'm planning on going through rust-pcre2
  • - This requires a patched rust-pcre2 that makes Captures public to build (so far)
  • - I'm not a particular fan of how each subcommand was set up before or with this new setup, still a lot of boilerplate/duplication

Individual sub-commands to review:

  • Collect
  • Escape
  • Join
  • Length
  • Lower
  • Match
  • Pad
  • Repeat
  • Replace
  • Shorten
  • Split
  • Sub
  • Trim
  • Unescape
  • Upper

Please feel free to only look at the overall setup/some individual subcommands and not review everything at once, this is quite a lot

Merging is blocked on fish-shell/rust-pcre2#1, unless you want to always point to my fork

henrikhorluck

This comment was marked as outdated.

@faho
Copy link
Member
faho commented Jun 22, 2023

builtin_string is still used in test-setup https://github.com/fish-shell/fish-shell/blob/master/src/fish_tests.cpp#L5088, replacing it is a bit convoluted

It's fine to remove those if we have equivalent tests elsewhere - either as a unit test or as an integration test, for everything that's still sensible to test.

A bunch of these cases are already covered by the littlecheck tests.

This is now partly blocked on #9739, as most of the remaining failing tests are because of it

Ah. It crashes when it tries to output a NUL, which we need to allow.

I had so far only seen it try that when it got a NUL passed through as an argument, which in the C++ version is impossible (because we pass args as a C-style NUL-delimited array).

But string can also get a NUL from its stdin, and is supposed to handle that sensibly, so we need to allow that (and then, possibly, to avoid a behavior change truncate after a NUL for arguments passed to a builtin).

The error is in

pub fn append<Str: AsRef<wstr>>(&mut self, s: Str) -> bool {
self.ffi().append1(c_str!(s))
}
, which tries to append a "C string" (with NUL-terminator).

That's bogus, the output stream can and should get NULs. I'll see if I can figure it out - the c_str! needs to go and be replaced with something that gets us a CxxWString with the NULs intact.

It's also fine to disable these tests for now, so you're unblocked.

Thoughts on using PCRE2_LITERAL to handle the literal case of StringReplacer::Literal? It would remove quite a bit of code, and quite trivial to implement support in rust-pcre2

We've been trying to first do as direct a port as possible. It's fine to do this later, but divide-and-conquer - first port what we have, then think about making it nice.

@faho
Copy link
Member
faho commented Jun 22, 2023

Try this for append():

use crate::wchar_ffi::ToCppWString;

// ...

pub fn append<Str: AsRef<wstr> + ToCppWString>(&mut self, s: Str) -> bool {
    self.ffi().append(&s.into_cpp())
}

This builds and runs for me, and causes echo a\x00b to print a, a NUL and then a b.

Edit: Now available as #9859.

@henrikhorluck henrikhorluck force-pushed the riir/builtins/string branch from 78cd564 to cfdd217 Compare June 22, 2023 18:10
@faho faho mentioned this pull request Jun 22, 2023
3 tasks
@henrikhorluck henrikhorluck force-pushed the riir/builtins/string branch 2 times, most recently from 3eb99c2 to ca5a147 Compare June 23, 2023 16:40
@henrikhorluck henrikhorluck marked this pull request as ready for review June 23, 2023 23:42
@zanchey zanchey added the rust label Jun 24, 2023
@zanchey zanchey added this to the fish 3.7.0 milestone Jun 24, 2023
@ridiculousfish
Copy link
Member

This is an incredible amount of work! Thank you!!

I'd very much appreciate some feedback/potential alternatives to the whole subcommand-setup, I'm not a fan or either the old one or the new, old because it mixed all the command-options, and new because it's quite a bit of boilerplate.

One possibility to avoid the whole object-safe nonsense is to make it concrete, via a generic handler, and then maybe use match:

impl SubCommandRunner {

     fn run<T: StringSubCommand>(&mut self, cmd: T) -> Option<c_int> { ... }

    fn run_subcommand(&mut self, cmd: &str) {
        match cmd {
            "collect" => self.run(Collect::default()),
            "escape" => self.run(Escape::default()),
            ...
        }
    }

The advantage is that you don't have to sweat "object safe;" it does require an awkward conversion to &str but that's not a big deal since we are matching ASCII only. I'm not sure how much boilerplate it removes though.

@henrikhorluck henrikhorluck force-pushed the riir/builtins/string branch 3 times, most recently from ca16bd5 to 5aa0b21 Compare June 30, 2023 01:38
@henrikhorluck henrikhorluck force-pushed the riir/builtins/string branch 3 times, most recently from ad8eb3f to 5a04a9b Compare July 4, 2023 22:41
@henrikhorluck henrikhorluck force-pushed the riir/builtins/string branch from f1dfa8c to 8d925f4 Compare July 11, 2023 21:39
@henrikhorluck henrikhorluck force-pushed the riir/builtins/string branch from aec45b4 to 8d2fa70 Compare July 13, 2023 09:52
@henrikhorluck henrikhorluck force-pushed the riir/builtins/string branch from 8d2fa70 to 81e36a5 Compare July 13, 2023 09:56
@zanchey zanchey requested a review from ridiculousfish July 13, 2023 10:10
let mut args_read = Vec::with_capacity(args.len());
args_read.extend_from_slice(args);

let mut w = wgetopter_t::new(Self::SHORT_OPTIONS, Self::LONG_OPTIONS, args);
Copy link
Contributor
@krobelus krobelus Jul 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The copy of args into args_read is ugly.
wgetopter_t wants exclusive access to args (because it will rearrange args so positionals are next to each other).
How about giving wgetopt accessor methods like w.cmd() and w.argv()?

There's another issue: args: &mut [&wstr] has the wrong type (for all builtins).
In practice it is always underpinned by a &mut Vec<WString> we should change
it to args: &mut [WString] to avoid the need to allocate a slice-of-slices;

Here's some of that change I made a while ago:

Details
diff --git a/fish-rust/src/wgetopt.rs b/fish-rust/src/wgetopt.rs
index bc4224c98..0ff56323f 100644
--- a/fish-rust/src/wgetopt.rs
+++ b/fish-rust/src/wgetopt.rs
@@ -23,7 +23,9 @@ License along with the GNU C Library; see the file COPYING.LIB.  If
 not, write to the Free Software Foundation, Inc., 675 Mass Ave,
 Cambridge, MA 02139, USA.  */

-use crate::wchar::{wstr, WExt, L};
+use std::ops;
+
+use crate::wchar::{wstr, WExt, WString, L};

 /// Describe how to deal with options that follow non-option ARGV-elements.
 ///
@@ -64,14 +66,14 @@ fn empty_wstr() -> &'static wstr {
     Default::default()
 }

-pub struct wgetopter_t<'opts, 'args, 'argarray> {
+pub struct wgetopter_t<'opts, 'argarray> {
     /// Argv.
-    argv: &'argarray mut [&'args wstr],
+    argv: &'argarray mut [WString],

     /// For communication from `getopt` to the caller. When `getopt` finds an option that takes an
     /// argument, the argument value is returned here. Also, when `ordering` is RETURN_IN_ORDER, each
     /// non-option ARGV-element is returned here.
-    pub woptarg: Option<&'args wstr>,
+    woptarg_idx: Option<(usize, usize)>,

     shortopts: &'opts wstr,
     longopts: &'opts [woption<'opts>],
@@ -80,7 +82,13 @@ pub struct wgetopter_t<'opts, 'args, 'argarray> {
     /// returned was found. This allows us to pick up the scan where we left off.
     ///
     /// If this is empty, it means resume the scan by advancing to the next ARGV-element.
+    nextchar_slice: ops::Range<usize>,

     /// Index in ARGV of the next element to be scanned. This is used for communication to and from
     /// the caller and for communication between successive calls to `getopt`.
@@ -153,11 +161,11 @@ pub const fn wopt(name: &wstr, has_arg: woption_argument_t, val: char) -> woptio
     woption { name, has_arg, val }
 }

-impl<'opts, 'args, 'argarray> wgetopter_t<'opts, 'args, 'argarray> {
+impl<'opts, 'argarray> wgetopter_t<'opts, 'argarray> {
     pub fn new(
         shortopts: &'opts wstr,
         longopts: &'opts [woption],
-        argv: &'argarray mut [&'args wstr],
+        argv: &'argarray mut [WString],
     ) -> Self {
         return wgetopter_t {
             woptopt: '?',
@@ -168,9 +176,9 @@ impl<'opts, 'args, 'argarray> wgetopter_t<'opts, 'args, 'argarray> {
             initialized: false,
             last_nonopt: 0,
             missing_arg_return_colon: false,
-            nextchar: Default::default(),
+            nextchar_slice: 0..0,
             ordering: Ordering::PERMUTE,
-            woptarg: None,
+            woptarg_idx: None,
             woptind: 0,
         };
     }
@@ -185,6 +193,23 @@ impl<'opts, 'args, 'argarray> wgetopter_t<'opts, 'args, 'argarray> {
         return self._wgetopt_internal(opt_index, false);
     }

+    pub fn woptarg(&self) -> Option<&wstr> {
+        self.woptarg_idx
+            .map(|(woptind, start)| &self.argv[woptind][start..])
+    }
+
+    pub fn cmd(&self) -> &wstr {
+        &self.argv[0]
+    }
+
+    pub fn argv(&self) -> &[WString] {
+        self.argv
+    }
+
+    pub fn nextchar(&self) -> &wstr {
+        &self.argv[self.woptind][self.nextchar_slice.clone()]
+    }
+
     /// \return the number of arguments.
     fn argc(&self) -> usize {
         return self.argv.len();
@@ -241,7 +266,7 @@ impl<'opts, 'args, 'argarray> wgetopter_t<'opts, 'args, 'argarray> {
         self.first_nonopt = 1;
         self.last_nonopt = 1;
         self.woptind = 1;
-        self.nextchar = empty_wstr();
+        self.nextchar_slice = 0..0;

         let mut optstring = self.shortopts;

@@ -321,7 +346,7 @@ impl<'opts, 'args, 'argarray> wgetopter_t<'opts, 'args, 'argarray> {
             if self.ordering == Ordering::REQUIRE_ORDER {
                 return None;
             }
-            self.woptarg = Some(self.argv[self.woptind]);
+            self.woptarg_idx = Some((self.woptind, 0));
             self.woptind += 1;
             return Some(char::from(1));
         }
@@ -332,15 +357,15 @@ impl<'opts, 'args, 'argarray> wgetopter_t<'opts, 'args, 'argarray> {
         } else {
             1
         };
-        self.nextchar = self.argv[self.woptind][skip..].into();
+        self.nextchar_slice = skip..(self.argv[self.woptind].len());
         return Some(char::from(0));
     }

     /// Check for a matching short opt.
     fn _handle_short_opt(&mut self) -> char {
         // Look at and handle the next short option-character.
-        let mut c = self.nextchar.char_at(0);
-        self.nextchar = &self.nextchar[1..];
+        let mut c = self.nextchar().char_at(0);
+        self.nextchar_slice = self.nextchar_slice.start + 1..self.nextchar_slice.end;

         let temp = match self.shortopts.chars().position(|sc| sc == c) {
             Some(pos) => &self.shortopts[pos..],
@@ -348,14 +373,14 @@ impl<'opts, 'args, 'argarray> wgetopter_t<'opts, 'args, 'argarray> {
         };

         // Increment `woptind' when we start to process its last character.
-        if self.nextchar.is_empty() {
+        if self.nextchar().is_empty() {
             self.woptind += 1;
         }

         if temp.is_empty() || c == ':' {
             self.woptopt = c;

-            if !self.nextchar.is_empty() {
+            if !self.nextchar().is_empty() {
                 self.woptind += 1;
             }
             return '?';
@@ -367,17 +392,17 @@ impl<'opts, 'args, 'argarray> wgetopter_t<'opts, 'args, 'argarray> {

         if temp.char_at(2) == ':' {
             // This is an option that accepts an argument optionally.
-            if !self.nextchar.is_empty() {
-                self.woptarg = Some(self.nextchar);
+            if !self.nextchar().is_empty() {
+                self.woptarg_idx = Some((self.woptind, 0));
                 self.woptind += 1;
             } else {
-                self.woptarg = None;
+                self.woptarg_idx = None;
             }
-            self.nextchar = empty_wstr();
+            self.nextchar_slice = 0..0;
         } else {
             // This is an option that requires an argument.
-            if !self.nextchar.is_empty() {
-                self.woptarg = Some(self.nextchar);
+            if !self.nextchar().is_empty() {
+                self.woptarg_idx = Some((self.woptind, 0));
                 // If we end this ARGV-element by taking the rest as an arg, we must advance to
                 // the next element now.
                 self.woptind += 1;
@@ -391,10 +416,10 @@ impl<'opts, 'args, 'argarray> wgetopter_t<'opts, 'args, 'argarray> {
             } else {
                 // We already incremented `woptind' once; increment it again when taking next
                 // ARGV-elt as argument.
-                self.woptarg = Some(self.argv[self.woptind]);
+                self.woptarg_idx = Some((self.woptind, 0));
                 self.woptind += 1;
             }
-            self.nextchar = empty_wstr();
+            self.nextchar_slice = 0..0;
         }

         return c;
@@ -409,21 +434,23 @@ impl<'opts, 'args, 'argarray> wgetopter_t<'opts, 'args, 'argarray> {
         retval: &mut char,
     ) {
         self.woptind += 1;
-        assert!(self.nextchar.char_at(nameend) == '\0' || self.nextchar.char_at(nameend) == '=');
-        if self.nextchar.char_at(nameend) == '=' {
+        assert!(
+            self.nextchar().char_at(nameend) == '\0' || self.nextchar().char_at(nameend) == '='
+        );
+        if self.nextchar().char_at(nameend) == '=' {
             if pfound.has_arg != woption_argument_t::no_argument {
-                self.woptarg = Some(self.nextchar[(nameend + 1)..].into());
+                self.woptarg_idx = Some((self.woptind, nameend + 1));
             } else {
-                self.nextchar = empty_wstr();
+                self.nextchar_slice = 0..0;
                 *retval = '?';
                 return;
             }
         } else if pfound.has_arg == woption_argument_t::required_argument {
             if self.woptind < self.argc() {
-                self.woptarg = Some(self.argv[self.woptind]);
+                self.woptarg_idx = Some((self.woptind, 0));
                 self.woptind += 1;
             } else {
-                self.nextchar = empty_wstr();
+                self.nextchar_slice = 0..0;
                 *retval = if self.missing_arg_return_colon {
                     ':'
                 } else {
@@ -433,7 +460,7 @@ impl<'opts, 'args, 'argarray> wgetopter_t<'opts, 'args, 'argarray> {
             }
         }

-        self.nextchar = empty_wstr();
+        self.nextchar_slice = 0..0;
         *longind = option_index;
         *retval = pfound.val;
     }
@@ -451,7 +478,7 @@ impl<'opts, 'args, 'argarray> wgetopter_t<'opts, 'args, 'argarray> {
         // Test all long options for either exact match or abbreviated matches.
         for (option_index, p) in self.longopts.iter().enumerate() {
             // Check if current option is prefix of long opt
-            if p.name.starts_with(&self.nextchar[..nameend]) {
+            if p.n
4D24
ame.starts_with(&self.nextchar()[..nameend]) {
                 if nameend == p.name.len() {
                     // The current option is exact match of this long option
                     pfound = Some(*p);
@@ -483,14 +510,14 @@ impl<'opts, 'args, 'argarray> wgetopter_t<'opts, 'args, 'argarray> {
         let mut indfound: usize = 0;

         let mut nameend = 0;
-        while self.nextchar.char_at(nameend) != '\0' && self.nextchar.char_at(nameend) != '=' {
+        while self.nextchar().char_at(nameend) != '\0' && self.nextchar().char_at(nameend) != '=' {
             nameend += 1;
         }

         let pfound = self._find_matching_long_opt(nameend, &mut exact, &mut ambig, &mut indfound);

         if ambig && !exact {
-            self.nextchar = empty_wstr();
+            self.nextchar_slice = 0..0;
             self.woptind += 1;
             *retval = '?';
             return true;
@@ -509,9 +536,9 @@ impl<'opts, 'args, 'argarray> wgetopter_t<'opts, 'args, 'argarray> {
             || !self
                 .shortopts
                 .as_char_slice()
-                .contains(&self.nextchar.char_at(0))
+                .contains(&self.nextchar().char_at(0))
         {
-            self.nextchar = empty_wstr();
+            self.nextchar_slice = 0..0;
             self.woptind += 1;
             *retval = '?';
             return true;
@@ -541,7 +568,7 @@ impl<'opts, 'args, 'argarray> wgetopter_t<'opts, 'args, 'argarray> {
     /// If a char in OPTSTRING is followed by a colon, that means it wants an arg, so the following text
     /// in the same ARGV-element, or the text of the following ARGV-element, is returned in `optarg`.
     /// Two colons mean an option that wants an optional arg; if there is text in the current
-    /// ARGV-element, it is returned in `w.woptarg`, otherwise `w.woptarg` is set to zero.
+    /// ARGV-element, it is returned in `w.woptarg()`, otherwise `w.woptarg()` is set to zero.
     ///
     /// If OPTSTRING starts with `-` or `+', it requests different methods of handling the non-option
     /// ARGV-elements. See the comments about RETURN_IN_ORDER and REQUIRE_ORDER, above.
@@ -563,9 +590,9 @@ impl<'opts, 'args, 'argarray> wgetopter_t<'opts, 'args, 'argarray> {
         if !self.initialized {
             self._wgetopt_initialize();
         }
-        self.woptarg = None;
+        self.woptarg_idx = None;

-        if self.nextchar.is_empty() {
+        if self.nextchar().is_empty() {
             let narg = self._advance_to_next_argv();
             if narg != Some(char::from(0)) {
                 return narg;
@@ -586,7 +613,7 @@ impl<'opts, 'args, 'argarray> wgetopter_t<'opts, 'args, 'argarray> {
         //
         // This distinction seems to be the most useful approach.
         if !self.longopts.is_empty() && self.woptind < self.argc() {
-            let arg = self.argv[self.woptind];
+            let arg = &self.argv[self.woptind];

             #[allow(clippy::if_same_then_else)]
             #[allow(clippy::needless_bool)]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree the cloning is ugly, but this is independent of this PR and present in other subcommands as well, I would prefer to not batch that into this already quite large change

@krobelus
Copy link
Contributor
krobelus commented Jul 15, 2023

I've reviewed everything except match and replace.
Github rejects my prr comments so I'm pasting my comments here.
It's in standard diff format so it should be easy to go-to-source etc.

Details
> diff --git a/fish-rust/src/builtins/string.rs b/fish-rust/src/builtins/string.rs
> new file mode 100644
> index 00000000000..e32a46ee2d9
> --- /dev/null
> +++ b/fish-rust/src/builtins/string.rs
> @@ -0,0 +1,493 @@
> +use std::borrow::Cow;
> +use std::fs::File;
> +use std::io::{BufRead, BufReader, Read};
> +use std::os::fd::FromRawFd;
> +
> +use crate::builtins::shared::{
> +    builtin_missing_argument, builtin_print_error_trailer, builtin_print_help, io_streams_t,
> +    BUILTIN_ERR_INVALID_SUBCMD, BUILTIN_ERR_MISSING_SUBCMD, BUILTIN_ERR_NOT_NUMBER,
> +    BUILTIN_ERR_TOO_MANY_ARGUMENTS, BUILTIN_ERR_UNKNOWN, STATUS_CMD_OK, STATUS_INVALID_ARGS,
> +};
> +use crate::common::str2wcstring;
> +use crate::ffi::parser_t;
> +use crate::wchar::{wstr, WString, L};
> +use crate::wchar_ext::WExt;
> +use crate::wcstringutil::fish_wcwidth_visible;
> +use crate::wgetopt::{wgetopter_t, woption};
> +use crate::wutil::{sprintf, wgettext_fmt};
> +use libc::c_int;
> +
> +mod collect;
> +mod escape;
> +mod join;
> +mod length;
> +mod r#match;
> +mod pad;
> +mod repeat;
> +mod replace;
> +mod shorten;
> +mod split;
> +mod sub;
> +mod transform;
> +mod trim;
> +mod unescape;
> +
> +macro_rules! string_error {
> +    (
> +    $streams:expr,
> +    $string:expr
> +    $(, $args:expr)+
> +    $(,)?
> +    ) => {
> +        $streams.err.append(L!("string "));
> +        $streams.err.append(wgettext_fmt!($string, $($args),*));
> +    };
> +}
> +pub(self) use string_error;
> +
> +fn string_unknown_option(
> +    parser: &mut parser_t,
> +    streams: &mut io_streams_t,
> +    subcmd: &wstr,
> +    opt: &wstr,
> +) {
> +    string_error!(streams, BUILTIN_ERR_UNKNOWN, subcmd, opt);
> +    builtin_print_error_trailer(parser, streams, L!("string"));
> +}
> +
> +trait StringSubCommand<'args> {
> +    const SHORT_OPTIONS: &'static wstr;
> +    const LONG_OPTIONS: &'static [woption<'static>];
> +
> +    /// Parse and store option specificed by the associated short or long option.
> +    fn parse_opt(&mut self, optarg: Option<&'args wstr>, c: char) -> Result<(), StringError>;
> +
> +    fn parse_opts(
> +        &mut self,
> +        args: &mut [&'args wstr],
> +        parser: &mut parser_t,
> +        streams: &mut io_streams_t,
> +    ) -> Result<usize, Option<c_int>> {
> +        let cmd = args[0];
> +        let mut args_read = Vec::with_capacity(args.len());
> +        args_read.extend_from_slice(args);
> +
> +        let mut w = wgetopter_t::new(Self::SHORT_OPTIONS, Self::LONG_OPTIONS, args);
> +        while let Some(c) = w.wgetopt_long() {
> +            match c {
> +                ':' => {
> +                    streams.err.append(L!("string ")); // clone of string_error
> +                    builtin_missing_argument(parser, streams, cmd, args_read[w.woptind - 1], false);
> +                    return Err(STATUS_INVALID_ARGS);
> +                }
> +                '?' => {
> +                    string_unknown_option(parser, streams, cmd, args_read[w.woptind - 1]);
> +                    return Err(STATUS_INVALID_ARGS);
> +                }
> +                c => {
> +                    let retval = self.parse_opt(w.woptarg, c);
> +                    if let Err(e) = retval {
> +                        let long_name = Self::LONG_OPTIONS.iter().find_map(|o| {
> +                            if o.val == c {
> +                                Some(o.name.to_owned())
> +                            } else {
> +                                None
> +                            }
> +                        });
> +                        let opt_name = long_name.unwrap_or_else(|| sprintf!("-%c", c));
> +                        e.print_error(&args_read, parser, streams, w.woptarg, w.woptind, &opt_name);
> +                        return Err(e.retval());
> +                    }
> +                }
> +            }
> +        }
> +
> +        return Ok(w.woptind);
> +    }
> +
> +    /// Take any positional arguments after options have been parsed.
> +    #[allow(unused_variables)]
> +    fn take_args(
> +        &mut self,
> +        optind: &mut usize,
> +        args: &[&'args wstr],
> +        streams: &mut io_streams_t,
> +    ) -> Option<c_int> {
> +        STATUS_CMD_OK
> +    }
> +
> +    /// Perform the business logic of the command.
> +    fn handle(
> +        &mut self,
> +        parser: &mut parser_t,
> +        streams: &mut io_streams_t,
> +        optind: &mut usize,
> +        args: &[&'args wstr],
> +    ) -> Option<c_int>;
> +
> +    fn run(
> +        &mut self,
> +        parser: &mut parser_t,
> +        streams: &mut io_streams_t,
> +        args: &mut [&'args wstr],
> +    ) -> Option<c_int> {
> +        if args.len() >= 3 && (args[2] == "-h" || args[2] == "--help") {
> +            let string_dash_subcmd = WString::from(args[0]) + L!("-") + args[1];
> +            builtin_print_help(parser, streams, &string_dash_subcmd);
> +            return STATUS_CMD_OK;
> +        }
> +
> +        let args = &mut args[1..];
> +
> +        let mut optind = match self.parse_opts(args, parser, streams) {
> +            Ok(optind) => optind,
> +            Err(retval) => return retval,
> +        };
> +
> +        let retval = self.take_args(&mut optind, args, streams);
> +        if retval != STATUS_CMD_OK {
> +            return retval;
> +        }
> +
> +        if streams.stdin_is_directly_redirected() && args.len() > optind {
> +            string_error!(streams, BUILTIN_ERR_TOO_MANY_ARGUMENTS, args[0]);
> +            return STATUS_INVALID_ARGS;
> +        }
> +
> +        return self.handle(parser, streams, &mut optind, args);
> +    }
> +}
> +
> +/// This covers failing argument/option parsing
> +enum StringError {
> +    InvalidArgs(&'static wstr),
> +    /// Like above but automatically `Invalid <opt_name> value '<arg>'`
> +    InvalidValue,
> +    NotANumber,
> +    UnknownOption,
> +}
> +
> +enum RegexError {
> +    Compile(WString, pcre2::Error),
> +    InvalidCaptureGroupName(WString),
> +    InvalidEscape(WString),
> +}
> +
> +impl RegexError {
> +    fn print_error(&self, args: &[&wstr], streams: &mut io_streams_t) {
> +        let cmd = args[0];
> +        use RegexError::*;
> +        match self {
> +            Compile(pattern, e) => {
> +                string_error!(
> +                    streams,
> +                    "%ls: Regular expression compile error: %ls\n",
> +                    cmd,
> +                    &WString::from(e.error_message())
> +                );
> +                string_error!(streams, "%ls: %ls\n", cmd, pattern);
> +                string_error!(streams, "%ls: %*ls\n", cmd, e.offset().unwrap(), "^");
> +            }
> +            InvalidCaptureGroupName(name) => {
> +                streams.err.append(wgettext_fmt!(
> +                    "Modification of read-only variable \"%ls\" is not allowed\n",
> +                    name
> +                ));
> +            }
> +            InvalidEscape(pattern) => {
> +                string_error!(
> +                    streams,
> +                    "%ls: Invalid escape sequence in pattern \"%ls\"\n",
> +                    cmd,
> +                    pattern
> +                );
> +            }
> +        }
> +    }
> +}
> +
> +impl From<crate::wutil::wcstoi::Error> for StringError {
> +    fn from(_: crate::wutil::wcstoi::Error) -> Self {
> +        StringError::NotANumber
> +    }
> +}
> +
> +impl From<std::num::TryFromIntError> for StringError {
> +    fn from(_: std::num::TryFromIntError) -> Self {
> +        // the result of `try_into` failing on the result of fish_wcsto{i,l}
> +        StringError::InvalidValue
> +    }
> +}
> +
> +impl From<&'static wstr> for StringError {
> +    fn from(s: &'static wstr) -> Self {
> +        StringError::InvalidArgs(s)
> +    }
> +}
> +
> +impl StringError {
> +    fn print_error(
> +        &self,
> +        args: &[&wstr],
> +        parser: &mut parser_t,
> +        streams: &mut io_streams_t,
> +        optarg: Option<&wstr>,
> +        optind: usize,
> +        option_name: &wstr,
> +    ) {
> +        let cmd = args[0];
> +        use StringError::*;
> +        match self {
> +            InvalidArgs(_) | InvalidValue => {
> +                let msg = match *self {
> +                    InvalidArgs(s) => s.to_owned(),
> +                    _ => sprintf!("Invalid %s value", option_name),
> +                };
> +                let error_msg = wgettext_fmt!("%ls: %ls '%ls'\n", cmd, msg, optarg.unwrap());

This translates only `"%ls: %ls '%ls'\n"`, not the dynamic messages.
We should rather put the already translated message inside `InvalidArgs`.
In practice it will probably work fine to store the translated string as `&'static wstr` in `InvalidArgs`, and format it here,
however that doesn't feel safe so I think we should store a WString that is already formatted.

> +
> +                streams.err.append(L!("string "));
> +                streams.err.append(error_msg);
> +            }
> +            NotANumber => {
> +                string_error!(streams, BUILTIN_ERR_NOT_NUMBER, cmd, optarg.unwrap());
> +            }
> +            UnknownOption => {
> +                string_unknown_option(parser, streams, cmd, args[optind - 1]);
> +            }
> +        }
> +    }
> +
> +    fn retval(&self) -> Option<c_int> {
> +        STATUS_INVALID_ARGS
> +    }
> +}
> +
> +#[derive(Default, PartialEq, Clone, Copy)]
> +enum Direction {
> +    #[default]
> +    Left,
> +    Right,
> +}
> +
> +pub(self) fn width_without_escapes(ins: &wstr, start_pos: usize) -> usize {
> +    let mut width = 0i32;
> +    for c in ins[start_pos..].chars() {
> +        let w = fish_wcwidth_visible(c);
> +        // We assume that this string is on its own line,
> +        // in which case a backslash can't bring us below 0.
> +        if w > 0 || width > 0 {
> +            width += w;
> +        }
> +    }
> +    // ANSI escape sequences like \e\[31m contain printable characters. Subtract their width
> +    // because they are not rendered.
> +    let mut pos = start_pos;
> +    while let Some(ec_pos) = ins.slice_from(pos).find_char('\x1B') {
> +        pos += ec_pos;
> +        if let Some(len) = escape_code_length(ins.slice_from(pos)) {
> +            let sub = ins.slice_from(pos).slice_to(len);

I vote for using bracket indexing for the sake of consistency

> diff --git a/fish-rust/src/builtins/string/join.rs b/fish-rust/src/builtins/string/join.rs
> new file mode 100644
> index 00000000000..19dae7f9060
> --- /dev/null
> +++ b/fish-rust/src/builtins/string/join.rs
> @@ -0,0 +1,106 @@
> +use super::{string_error, Arguments, StringError, StringSubCommand};
> +use crate::builtins::shared::{
> +    io_streams_t, BUILTIN_ERR_ARG_COUNT0, STATUS_CMD_ERROR, STATUS_CMD_OK, STATUS_INVALID_ARGS,
> +};
> +use crate::ffi::parser_t;
> +use crate::wchar::{wstr, L};
> +use crate::wgetopt::{wopt, woption, woption_argument_t::*};
> +use crate::wutil::wgettext_fmt;
> +
> +pub struct Join<'args> {
> +    quiet: bool,
> +    no_empty: bool,
> +    pub is_join0: bool,
> +    sep: &'args wstr,
> +}
> +
> +impl Default for Join<'_> {
> +    fn default() -> Self {
> +        Self {
> +            quiet: false,
> +            no_empty: false,
> +            is_join0: false,
> +            sep: L!("\0"),
> +        }
> +    }
> +}
> +
> +impl<'args> StringSubCommand<'args> for Join<'args> {
> +    const LONG_OPTIONS: &'static [woption<'static>] = &[
> +        wopt(L!("quiet"), no_argument, 'q'),
> +        wopt(L!("no-empty"), no_argument, 'n'),
> +    ];
> +    const SHORT_OPTIONS: &'static wstr = L!(":qn");
> +
> +    fn parse_opt(&mut self, _optarg: Option<&wstr>, c: char) -> Result<(), StringError> {
> +        match c {
> +            'q' => self.quiet = true,
> +            'n' => self.no_empty = true,
> +            _ => return Err(StringError::UnknownOption),
> +        }
> +        return Ok(());
> +    }
> +
> +    fn take_args(
> +        &mut self,
> +        optind: &mut usize,
> +        args: &[&'args wstr],
> +        streams: &mut io_streams_t,
> +    ) -> Option<libc::c_int> {
> +        if self.is_join0 {
> +            return STATUS_CMD_OK;
> +        }
> +
> +        let Some(arg) = args.get(*optind).copied() else {
> +           string_error!(streams, BUILTIN_ERR_ARG_COUNT0, args[0]);
> +           return STATUS_INVALID_ARGS;
> +       };
> +        *optind += 1;
> +        self.sep = arg;
> +
> +        STATUS_CMD_OK
> +    }
> +
> +    fn handle(
> +        &mut self,
> +        _parser: &mut parser_t,
> +        streams: &mut io_streams_t,
> +        optind: &mut usize,
> +        args: &[&wstr],
> +    ) -> Option<libc::c_int> {
> +        let sep = &self.sep;
> +        let mut nargs = 0usize;
> +        let mut print_newline = true;

`print_newline` is false only in a really special case (when a final newline is missing
from stdin). Let's give it a better name (everywhere) to communicate that it only applies
to the last line. maybe `print_trailing_newline` or `terminate_last_line`.

> diff --git a/fish-rust/src/builtins/string/pad.rs b/fish-rust/src/builtins/string/pad.rs
> new file mode 100644
> index 00000000000..4f1ca2fffa3
> --- /dev/null
> +++ b/fish-rust/src/builtins/string/pad.rs
> @@ -0,0 +1,104 @@
> +use std::borrow::Cow;
> +
> +use super::{width_without_escapes, Arguments, Direction, StringError, StringSubCommand};
> +use crate::builtins::shared::{io_streams_t, STATUS_CMD_OK};
> +use crate::fallback::fish_wcwidth;
> +use crate::ffi::parser_t;
> +use crate::wchar::{wstr, WString, L};
> +use crate::wgetopt::{wopt, woption, woption_argument_t::*};
> +use crate::wutil::fish_wcstol;
> +
> +pub struct Pad {
> +    char_to_pad: char,
> +    pad_char_width: usize,
> +    pad_from: Direction,
> +    width: usize,
> +}
> +
> +impl Default for Pad {
> +    fn default() -> Self {
> +        Self {
> +            char_to_pad: ' ',
> +            pad_char_width: 1,
> +            pad_from: Direction::Left,
> +            width: 0,
> +        }
> +    }
> +}
> +
> +impl StringSubCommand<'_> for Pad {
> +    const LONG_OPTIONS: &'static [woption<'static>] = &[
> +        wopt(L!("quiet"), no_argument, 'q'),
> +        wopt(L!("right"), no_argument, 'r'),
> +        wopt(L!("width"), required_argument, 'w'),
> +    ];
> +    const SHORT_OPTIONS: &'static wstr = L!(":c:qrw:");
> +
> +    fn parse_opt(&mut self, optarg: Option<&wstr>, c: char) -> Result<(), StringError> {
> +        match c {
> +            'c' => {
> +                let [pad_char] = optarg.unwrap().as_char_slice() else {
> +                    return Err(L!("Padding should be a character").into());

This error message is missing both translation support, see my other comment about wgettext_fmt. Maybe we should keep using string_error here.

> +                };
> +                let pad_char_width = fish_wcwidth(*pad_char);
> +                if pad_char_width <= 0 {
> +                    return Err(L!("Invalid padding character of width zero").into());
> +                }
> +                self.pad_char_width = pad_char_width as usize;
> +                self.char_to_pad = *pad_char;
> +            }
> +            'r' => self.pad_from = Direction::Right,
> +            'w' => self.width = fish_wcstol(optarg.unwrap())?.try_into()?,
> +            _ => return Err(StringError::UnknownOption),
> +        }
> +        return Ok(());
> +    }
> +
> +    fn handle<'args>(
> +        &mut self,
> +        _parser: &mut parser_t,
> +        streams: &mut io_streams_t,
> +        optind: &mut usize,
> +        args: &[&'args wstr],
> +    ) -> Option<libc::c_int> {
> +        let mut max_width = 0usize;
> +        let mut inputs: Vec<(Cow<'args, wstr>, usize)> = Vec::new();
> +        let mut print_newline = true;
> +
> +        for (arg, went_newline) in Arguments::new(args, optind, streams, true) {

typo:

```suggestion
       for (arg, want_newline) in Arguments::new(args, optind, streams, true) {
```


> diff --git a/fish-rust/src/builtins/string/repeat.rs b/fish-rust/src/builtins/string/repeat.rs
> new file mode 100644
> index 00000000000..b1278b66f6c
> --- /dev/null
> +++ b/fish-rust/src/builtins/string/repeat.rs
> @@ -0,0 +1,142 @@
> +use super::{Arguments, StringError, StringSubCommand};
> +use crate::builtins::shared::{io_streams_t, STATUS_CMD_ERROR, STATUS_CMD_OK};
> +use crate::ffi::parser_t;
> +use crate::wchar::{wstr, WString, L};
> +use crate::wchar_ext::WExt;
> +use crate::wgetopt::{wopt, woption, woption_argument_t::*};
> +use crate::wutil::fish_wcstol;
> +
> +#[derive(Default)]
> +pub struct Repeat {
> +    count: usize,
> +    max: usize,
> +    quiet: bool,
> +    no_newline: bool,
> +}
> +
> +impl StringSubCommand<'_> for Repeat {
> +    const LONG_OPTIONS: &'static [woption<'static>] = &[
> +        wopt(L!("count"), required_argument, 'n'),
> +        wopt(L!("max"), required_argument, 'm'),
> +        wopt(L!("quiet"), no_argument, 'q'),
> +        wopt(L!("no-newline"), no_argument, 'N'),
> +    ];
> +    const SHORT_OPTIONS: &'static wstr = L!(":n:m:qN");
> +
> +    fn parse_opt(&mut self, optarg: Option<&wstr>, c: char) -> Result<(), StringError> {
> +        match c {
> +            'n' => self.count = fish_wcstol(optarg.unwrap())?.try_into()?,
> +            'm' => self.max = fish_wcstol(optarg.unwrap())?.try_into()?,
> +            'q' => self.quiet = true,
> +            'N' => self.no_newline = true,
> +            _ => return Err(StringError::UnknownOption),
> +        }
> +        return Ok(());
> +    }
> +
> +    fn handle(
> +        &mut self,
> +        _parser: &mut parser_t,
> +        streams: &mut io_streams_t,
> +        optind: &mut usize,
> +        args: &[&wstr],
> +    ) -> Option<libc::c_int> {
> +        if self.max == 0 && self.count == 0 {
> +            // XXX: This used to be allowed, but returned 1.
> +            // Keep it that way for now instead of adding an error.
> +            // streams.err.append(L"Count or max must be greater than zero");
> +            return STATUS_CMD_ERROR;
> +        }
> +
> +        let mut all_empty = true;
> +        let mut first = true;
> +        let mut print_newline = true;
> +
> +        for (w, want_newline) in Arguments::new(args, optind, streams, true) {
> +            print_newline = want_newline;
> +            if w.is_empty() {
> +                continue;
> +            }
> +
> +            all_empty = false;
> +
> +            if self.quiet {
> +                // Early out if we can - see #7495.
> +                return STATUS_CMD_OK;
> +            }
> +
> +            if !first {
> +                streams.out.append1('\n');
> +            }
> +            first = false;
> +
> +            // The maximum size of the string is either the "max" characters,
> +            // or it's the "count" repetitions, whichever ends up lower.
> +            let max = if self.max == 0 || (self.count > 0 && w.len() * self.count < self.max) {
> +                w.len() * self.count
> +            } else {
> +                self.max
> +            };
> +
> +            // Reserve a string to avoid writing constantly.
> +            // The 1500 here is a total gluteal extraction, but 500 seems to perform slightly worse.
> +            let chunk_size = 1500;
> +            // The + word length is so we don't have to hit the chunk size exactly,
> +            // which would require us to restart in the middle of the string.
> +            // E.g. imagine repeating "12345678". The first chunk is hit after a last "1234",
> +            // so we would then have to restart by appending "5678", which requires a substring.
> +            // So let's not bother.
> +            //
> +            // Unless of course we don't even print the entire word, in which case we just need max.
> +            let mut chunk = WString::with_capacity(max.min(chunk_size + w.len()));
> +
> +            let mut i = max;
> +            loop {
> +                if i >= w.len() {
> +                    chunk.push_utfstr(&w);
> +                } else {
> +                    chunk.push_utfstr(w.slice_to(i));
> +                    break;
> +                }
> +
> +                if chunk.len() >= chunk_size {
> +                    // We hit the chunk size, write it repeatedly until we can't anymore.
> +                    streams.out.append(&chunk);
> +                    while i >= chunk.len() {
> +                        streams.out.append(&chunk);
> +                        // We can easily be asked to write *a lot* of data,
> +                        // so we need to check every so often if the pipe has been closed.
> +                        // If we didn't, running `string repeat -n LARGENUMBER foo | pv`
> +                        // and pressing ctrl-c seems to hang.
> +                        if streams.out.flush_and_check_error() != STATUS_CMD_OK.unwrap() {
> +                            return STATUS_CMD_ERROR;
> +                        }
> +                        i -= chunk.len();
> +                    }
> +                    chunk.clear();
> +                }
> +
> +                let Some(new_i) = i.checked_sub(w.len()) else {
> +                    break
> +                };
> +                i = new_i;

I don't think there's the possibility of overflow here because otherwise the C++ code  would have a possible infinite loop, no?
Can we keep the loop with `while i > 0`?

> diff --git a/fish-rust/src/builtins/string/shorten.rs b/fish-rust/src/builtins/string/shorten.rs
> new file mode 100644
> index 00000000000..17611c1df37
> --- /dev/null
> +++ b/fish-rust/src/builtins/string/shorten.rs
> @@ -0,0 +1,244 @@
> +use super::{
> +    escape_code_length, width_without_escapes, Arguments, Direction, StringError, StringSubCommand,
> +};
> +use crate::builtins::shared::{io_streams_t, STATUS_CMD_ERROR, STATUS_CMD_OK};
> +use crate::common::get_ellipsis_str;
> +use crate::ffi::parser_t;
> +use crate::wchar::{wstr, WString, L};
> +use crate::wchar_ext::WExt;
> +use crate::wcstringutil::fish_wcwidth_visible;
> +use crate::wcstringutil::split_string;
> +use crate::wgetopt::{wopt, woption, woption_argument_t::*};
> +use crate::wutil::fish_wcstol;
> +
> +pub struct Shorten<'args> {
> +    ellipsis: &'args wstr,
> +    ellipsis_width: usize,
> +    max: Option<usize>,
> +    no_newline: bool,
> +    quiet: bool,
> +    shorten_from: Direction,
> +}
> +
> +impl Default for Shorten<'_> {
> +    fn default() -> Self {
> +        Self {
> +            ellipsis: get_ellipsis_str(),
> +            ellipsis_width: width_without_escapes(get_ellipsis_str(), 0),
> +            max: None,
> +            no_newline: false,
> +            quiet: false,
> +            shorten_from: Direction::Right,
> +        }
> +    }
> +}
> +
> +impl<'args> StringSubCommand<'args> for Shorten<'args> {
> +    const LONG_OPTIONS: &'static [woption<'static>] = &[
> +        // FIXME: documentation says it's --char
> +        wopt(L!("chars"), required_argument, 'c'),

right. Maybe it's better to call it `--ellipsis` but that can be fixed later

> +        wopt(L!("max"), required_argument, 'm'),
> +        wopt(L!("no-newline"), no_argument, 'N'),
> +        wopt(L!("left"), no_argument, 'l'),
> +        wopt(L!("quiet"), no_argument, 'q'),
> +    ];
> +    const SHORT_OPTIONS: &'static wstr = L!(":c:m:Nlq");
> +
> +    fn parse_opt(&mut self, optarg: Option<&'args wstr>, c: char) -> Result<(), StringError> {
> +        match c {
> +            'c' => {
> +                self.ellipsis = optarg.unwrap();
> +                self.ellipsis_width = width_without_escapes(self.ellipsis, 0);
> +            }
> +            'm' => self.max = Some(fish_wcstol(optarg.unwrap())?.try_into()?),
> +            'N' => self.no_newline = true,
> +            'l' => self.shorten_from = Direction::Left,
> +            'q' => self.quiet = true,
> +            _ => return Err(StringError::UnknownOption),
> +        }
> +        return Ok(());
> +    }
> +
> +    fn handle(
> +        &mut self,
> +        _parser: &mut parser_t,
> +        streams: &mut io_streams_t,
> +        optind: &mut usize,
> +        args: &[&wstr],
> +    ) -> Option<libc::c_int> {
> +        let mut min_width = usize::MAX;
> +        let mut inputs = Vec::new();
> +
> +        let iter = Arguments::new(args, optind, streams, true);
> +
> +        if Some(0) == self.max {

not a fan of yoda condition

> +            // Special case: Max of 0 means no shortening.
> +            // This makes this more reusable, so you don't need special-cases like
> +            //
> +            // if test $shorten -gt 0
> +            //     string shorten -m $shorten whatever
> +            // else
> +            //     echo whatever
> +            // end
> +            for (arg, _) in iter {
> +                streams.out.append(arg);
> +                streams.out.append1('\n');
> +            }
> +            return STATUS_CMD_OK;
> +        }
> +
> +        for (arg, _) in iter {
> +            // Visible width only makes sense line-wise.
> +            // So either we have no-newlines (which means we shorten on the first newline),
> +            // or we handle the lines separately.
> +            let mut splits = split_string(&arg, '\n').into_iter();
> +            if self.no_newline && splits.len() > 1 {
> +                let mut s = match self.shorten_from {
> +                    Direction::Right => splits.next(),
> +                    Direction::Left => splits.last(),
> +                }
> +                .unwrap();
> +                s.push_utfstr(self.ellipsis);
> +                let width = width_without_escapes(&s, 0);
> +
> +                if width > 0 && width < min_width {
> +                    min_width = width;
> +                }
> +                inputs.push(s);
> +            } else {
> +                for s in splits {
> +                    let width = width_without_escapes(&s, 0);
> +                    if width > 0 && width < min_width {
> +                        min_width = width;
> +                    }
> +                    inputs.push(s);
> +                }
> +            }
> +        }
> +
> +        let ourmax: usize = self.max.unwrap_or(min_width);

nit regarding the earlier `// TODO: Can we have negative width` comment.
If easily possible, I'd have preferred putting the behavior fix commit first, so that such questions never arise.

> diff --git a/fish-rust/src/builtins/string/trim.rs b/fish-rust/src/builtins/string/trim.rs
> new file mode 100644
> index 00000000000..ed36dbbedd7
> --- /dev/null
> +++ b/fish-rust/src/builtins/string/trim.rs
> @@ -0,0 +1,99 @@
> +use super::{Arguments, StringError, StringSubCommand};
> +use crate::builtins::shared::{io_streams_t, STATUS_CMD_ERROR, STATUS_CMD_OK};
> +use crate::ffi::parser_t;
> +use crate::wchar::{wstr, L};
> +use crate::wchar_ext::WExt;
> +use crate::wgetopt::{wopt, woption, woption_argument_t::*};
> +
> +pub struct Trim<'args> {
> +    chars_to_trim: &'args wstr,
> +    left: bool,
> +    right: bool,
> +    quiet: bool,
> +}
> +
> +impl Default for Trim<'_> {
> +    fn default() -> Self {
> +        Self {
> +            // from " \f\n\r\t\v"
> +            chars_to_trim: L!(" \x0C\n\r\x09\x0B"),
> +            left: false,
> +            right: false,
> +            quiet: false,
> +        }
> +    }
> +}
> +
> +impl<'args> StringSubCommand<'args> for Trim<'args> {
> +    const LONG_OPTIONS: &'static [woption<'static>] = &[
> +        wopt(L!("chars"), required_argument, 'c'),
> +        wopt(L!("left"), no_argument, 'l'),
> +        wopt(L!("right"), no_argument, 'r'),
> +        wopt(L!("quiet"), no_argument, 'q'),
> +    ];
> +    const SHORT_OPTIONS: &'static wstr = L!(":c:lrq");
> +
> +    fn parse_opt(&mut self, optarg: Option<&'args wstr>, c: char) -> Result<(), StringError> {
> +        match c {
> +            'c' => self.chars_to_trim = optarg.unwrap(),
> +            'l' => self.left = true,
> +            'r' => self.right = true,
> +            'q' => self.quiet = true,
> +            _ => return Err(StringError::UnknownOption),
> +        }
> +        return Ok(());
> +    }
> +
> +    fn handle(
> +        &mut self,
> +        _parser: &mut parser_t,
> +        streams: &mut io_streams_t,
> +        optind: &mut usize,
> +        args: &[&wstr],
> +    ) -> Option<libc::c_int> {
> +        // If neither left or right is specified, we do both.
> +        if !self.left && !self.right {
> +            self.left = true;
> +            self.right = true;
> +        }
> +
> +        let mut ntrim = 0;
> +
> +        let to_trim_end = |str: &wstr| -> usize {
> +            str.chars()
> +                .rev()
> +                .take_while(|&c| self.chars_to_trim.contains(c))
> +                .count()
> +        };
> +
> +        let to_trim_start = |str: &wstr| -> usize {
> +            str.chars()
> +                .take_while(|&c| self.chars_to_trim.contains(c))
> +                .count()
> +        };

yeah I guess we don't have `trim_end_matches()` yet

> +
> +        for (arg, want_newline) in Arguments::new(args, optind, streams, true) {
> +            let trim_start = self.left.then(|| to_trim_start(&arg)).unwrap_or(0);
> +            // collision is only an issue if the whole string is getting trimmed
> +            let trim_end = (self.right && trim_start != arg.len())
> +                .then(|| to_trim_end(&arg))
> +                .unwrap_or(0);
> +
> +            ntrim += trim_start + trim_end;
> +            if !self.quiet {
> +                streams.out.append(&arg[trim_start..arg.len() - trim_end]);
> +                if want_newline {
> +                    streams.out.append1('\n');
> +                }
> +            } else if ntrim > 0 {
> +                return STATUS_CMD_OK;
> +            }
> +        }
> +
> +        if ntrim > 0 {
> +            STATUS_CMD_OK
> +        } else {
> +            STATUS_CMD_ERROR
> +        }
> +    }
> +}
> diff --git a/fish-rust/src/builtins/string/unescape.rs b/fish-rust/src/builtins/string/unescape.rs
> new file mode 100644
> index 00000000000..7f982848582
> --- /dev/null
> +++ b/fish-rust/src/builtins/string/unescape.rs
> @@ -0,0 +1,54 @@
> +use super::{Arguments, StringError, StringSubCommand};
> +use crate::builtins::shared::{io_streams_t, STATUS_CMD_ERROR, STATUS_CMD_OK};
> +use crate::common::{unescape_string, UnescapeStringStyle};
> +use crate::ffi::parser_t;
> +use crate::wchar::{wstr, L};
> +use crate::wgetopt::{wopt, woption, woption_argument_t::*};
> +
> +#[derive(Default)]
> +pub struct Unescape {
> +    no_quoted: bool,
> +    style: UnescapeStringStyle,
> +}
> +
> +impl StringSubCommand<'_> for Unescape {
> +    const LONG_OPTIONS: &'static [woption<'static>] = &[
> +        wopt(L!("no-quoted"), no_argument, 'q'),
> +        wopt(L!("style"), required_argument, '\u{1}'),
> +    ];
> +    const SHORT_OPTIONS: &'static wstr = L!(":q");
> +
> +    fn parse_opt(&mut self, optarg: Option<&wstr>, c: char) -> Result<(), StringError> {
> +        match c {
> +            'q' => self.no_quoted = true,
> +            '\u{1}' => self.style = optarg.unwrap().try_into()?,
> +            _ => return Err(StringError::UnknownOption),
> +        }
> +        return Ok(());
> +    }
> +
> +    fn handle(
> +        &mut self,
> +        _parser: &mut parser_t,
> +        streams: &mut io_streams_t,
> +        optind: &mut usize,
> +        args: &[&wstr],
> +    ) -> Option<libc::c_int> {
> +        let mut nesc = 0;
> +        for (arg, want_newline) in Arguments::new(args, optind, streams, true) {
> +            if let Some(res) = unescape_string(&arg, self.style) {
> +                streams.out.append(res);
> +                if want_newline {
> +                    streams.out.append1('\n');
> +                }
> +                nesc += 1;
> +            }
> +        }
> +
> +        if nesc > 0 {
> +            STATUS_CMD_OK
> +        } else {
> +            STATUS_CMD_ERROR
> +        }
> +    }
> +}
> diff --git a/fish-rust/src/builtins/tests/mod.rs b/fish-rust/src/builtins/tests/mod.rs
> index d718bc4f77c..b
4D1C
25db0384a0 100644
> --- a/fish-rust/src/builtins/tests/mod.rs
> +++ b/fish-rust/src/builtins/tests/mod.rs
> @@ -1 +1,2 @@
> +mod string_tests;
>  mod test_tests;
> diff --git a/fish-rust/src/builtins/tests/string_tests.rs b/fish-rust/src/builtins/tests/string_tests.rs
> new file mode 100644
> index 00000000000..5927246536c
> --- /dev/null
> +++ b/fish-rust/src/builtins/tests/string_tests.rs
> @@ -0,0 +1,305 @@
> +use crate::ffi_tests::add_test;
> +
> +add_test! {"test_string", || {
> +    use crate::ffi::parser_t;
> +    use crate::ffi;
> +    use crate::builtins::string::string;
> +    use crate::wchar_ffi::WCharFromFFI;
> +    use crate::common::{EscapeStringStyle, escape_string};
> +    use crate::wchar::wstr;
> +    use crate::wchar::L;
> +    use crate::builtins::shared::{STATUS_CMD_ERROR,STATUS_CMD_OK, STATUS_INVALID_ARGS};
> +
> +    use crate::future_feature_flags::{scoped_test, FeatureFlag};
> +
> +    // avoid 1.3k L!()'s
> +    macro_rules! test_cases {
> +        ([$($x:expr),*], $rc:expr, $out:expr) => { (vec![$(L!($x)),*], $rc, L!($out)) };
> +        [$($x:tt),* $(,)?] => { [$(test_cases!$x),*] };
> +    }
> +
> +    // TODO: these should be individual tests, not all in one, port when we can run these with `cargo test`
> +    fn string_test(mut args: Vec<&wstr>, expected_rc: Option<i32>, expected_out: &wstr) {
> +        let parser: &mut parser_t = unsafe { &mut *parser_t::principal_parser_ffi() };
> +        let mut streams = ffi::make_test_io_streams_ffi();
> +        let mut io = crate::builtins::shared::io_streams_t::new(streams.pin_mut());
> +
> +        let rc = string(parser, &mut io, args.as_mut_slice()).expect("string failed");
> +
> +        assert_eq!(expected_rc.unwrap(), rc, "string builtin returned unexpected return code");
> +
> +        let string_stream_contents = &ffi::get_test_output_ffi(&streams);
> +        let actual = escape_string(&string_stream_contents.from_ffi(), EscapeStringStyle::default());
> +        let expected = escape_string(expected_out, EscapeStringStyle::default());
> +        assert_eq!(expected, actual, "string builtin returned unexpected output");
> +    }
> +
> +    let tests = test_cases![
> +        (["string", "escape"], STATUS_CMD_ERROR, ""),
> +        (["string", "escape", ""], STATUS_CMD_OK, "''\n"),
> +        (["string", "escape", "-n", ""], STATUS_CMD_OK, "\n"),
> +        (["string", "escape", "a"], STATUS_CMD_OK, "a\n"),
> +        (["string", "escape", "\x07"], STATUS_CMD_OK, "\\cg\n"),
> +        (["string", "escape", "\"x\""], STATUS_CMD_OK, "'\"x\"'\n"),
> +        (["string", "escape", "hello world"], STATUS_CMD_OK, "'hello world'\n"),
> +        (["string", "escape", "-n", "hello world"], STATUS_CMD_OK, "hello\\ world\n"),
> +        (["string", "escape", "hello", "world"], STATUS_CMD_OK, "hello\nworld\n"),
> +        (["string", "escape", "-n", "~"], STATUS_CMD_OK, "\\~\n"),
> +
> +        (["string", "join"], STATUS_INVALID_ARGS, ""),
> +        (["string", "join", ""], STATUS_CMD_ERROR, ""),
> +        (["string", "join", "", "", "", ""], STATUS_CMD_OK, "\n"),
> +        (["string", "join", "", "a", "b", "c"], STATUS_CMD_OK, "abc\n"),
> +        (["string", "join", ".", "fishshell", "com"], STATUS_CMD_OK, "fishshell.com\n"),
> +        (["string", "join", "/", "usr"], STATUS_CMD_ERROR, "usr\n"),
> +        (["string", "join", "/", "usr", "local", "bin"], STATUS_CMD_OK, "usr/local/bin\n"),
> +        (["string", "join", "...", "3", "2", "1"], STATUS_CMD_OK, "3...2...1\n"),
> +        (["string", "join", "-q"], STATUS_INVALID_ARGS, ""),
> +        (["string", "join", "-q", "."], STATUS_CMD_ERROR, ""),
> +        (["string", "join", "-q", ".", "."], STATUS_CMD_ERROR, ""),
> +
> +        (["string", "length"], STATUS_CMD_ERROR, ""),
> +        (["string", "length", ""], STATUS_CMD_ERROR, "0\n"),
> +        (["string", "length", "", "", ""], STATUS_CMD_ERROR, "0\n0\n0\n"),
> +        (["string", "length", "a"], STATUS_CMD_OK, "1\n"),
> +
> +    // #if WCHAR_T_BITS > 16
> +        (["string", "length", "\u{2008A}"], STATUS_CMD_OK, "1\n"),
> +    // #endif

This #if we can drop because we have stronger guarantees now.
In Rust, we only deal with platform-specific wchar_t when converting to and from OS strings using mbrtowc which is not what we test here.

Copy link
Contributor Author
@henrikhorluck henrikhorluck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prr-comments got kinda mangled

while let Some(ec_pos) = ins.slice_from(pos).find_char('\x1B') {
pos += ec_pos;
if let Some(len) = escape_code_length(ins.slice_from(pos)) {
let sub = ins.slice_from(pos).slice_to(len);
Copy link
Contributor Author
@henrikhorluck henrikhorluck Jul 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote for using bracket indexing for the sake of consistency

I agree, but then we should probably just remove slice_{to,from} from WExt

Suggested change
let sub = ins.slice_from(pos).slice_to(len);
let sub = &ins[pos..pos + len];

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but then we should probably just remove slice_{to,from} from WExt

yeah. I'm not sure what others think

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logical distinction of slice_to and slice_from is that they operate on char counts, instead of byte counts.

In a hypothetical future where we try to switch to UTF-8, that distinction will become important. I worry that the bracket indexing will introduce bugs at that point. But I'm open to other ideas for how to handle that.

Copy link
Contributor Author
@henrikhorluck henrikhorluck Jul 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's the case we should warn for all usage of wstr-slicing and indexing with a clippy lint. I think the point here is that we want to write code that's portable across string representations, and that's not necessarily that simple. I am very much in support of full unicode-support with graphemes and UTF8, so we should prepare for it.

But: this should have an API closer to slicing, e.g. like

fn slice_chars<I>(&self, index: I) -> Self
    where
        I: RangeBounds<usize> + SliceIndex<[char], Output = [char]>,

so we can still use slice-like operations

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting; a clippy lint banning str slicing seems easy and robust enough

}
}

let ourmax: usize = self.max.unwrap_or(min_width);
Copy link
Contributor Author
@henrikhorluck henrikhorluck Jul 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit regarding the earlier // TODO: Can we have negative width comment.
If easily possible, I'd have preferred putting the behavior fix commit first, so that such questions never arise.

One issue here is that the earlier Rust code before the fix panicked with a capacity overflow, and backporting to C++ so you have to rebase with a C++ bugfix is weird, so you kinda need to squash the fix with the port, which is not ideal. I don't like this situation. In this case it was easier to fix the bug than port the bug

Copy link
Contributor
@krobelus krobelus Jul 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and backporting to C++ so you have to rebase with a C++ bugfix is weird, so you kinda need to squash the fix with the port, which is not ideal. I don't like this situation. In this case it was easier to fix the bug than port the bug

one recipe is to rebase, breaking before the rewrite to add a commit that fixes the bug in C++. Then revert that commit and continue the rebase. There are no conflicts because of the revert. Of course the bug fix needs to be applied again on the Rust side (before or after the rebase), and squashed (along with the revert; both need to be squashed into the rewrite commit to minimize noise)

Copy link
Contributor Author
@henrikhorluck henrikhorluck Jul 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the conclusion here is to squash this all into one commit, I don't think this needs to be backported for a 3.6.2 release, as fixing the C++ code is not as trivial as #9900, and less important. If we want to backport this please inform me and make a PR towards master so I can rebase on that when I rebase for #9900

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any order is fine here (hence nit) but in general it's better to put "smaller" commits first because the larger ones are more likely to be changed further.
Squashing the behavior change into the port is not something we should do (but there is no need to right)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make it clear: I do not want to write a C++ version of the behavior fix. The initial porting-commit still has bugged behavior (but different, as in crashing!).

My understanding is that we do not want to backport the fix, which will leave the "broken" commit in place if we do not squash it. But as you are telling me, that is what we want? If so please mark this as resolved.

I'd have preferred putting the behavior fix commit first, so that such questions never arise.

I agree that we should ideally have the fix separately, but without writing a C++ version of the fix there is very little I can do to improve the commit ordering.

@krobelus
Copy link
Contributor

The prr-comments got kinda mangled

fixed

@henrikhorluck henrikhorluck force-pushed the riir/builtins/string branch 2 times, most recently from 6d37e91 to b9b840f Compare July 15, 2023 20:14
@ridiculousfish
Copy link
Member

FYI I plan on finishing reviewing this weekend.

Copy link
Member
@ridiculousfish ridiculousfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So once again this is an incredible amount of work and it looks overall fantastic. Thank you!!

Regarding splitting subcommands into separate modules: either seems fine to me, and in general I defer to the person doing the actual work. While this diverges from the C++ I understand why you chose to do it (string is pretty weird) and I'm very pleased with the new structure.

Regarding bracket-slicing [start..] vs slice_from(start): I'm not going to be a stickler about the difference for now. If/when we later move to UTF-8 we will need to audit all the slices; the fact that this can be done with a custom clippy lint rule is really good info (I didn't know you could do that!) so no reason to sweat it now. Same story for len() / char_count().

I spotted some small bugs, see my comments.

The biggest missing piece here is localization - a lot of the C++ error messages were wrapped in _(...) to plug them into wgettext and much of that has been lost. You can fix that now, or defer it and I or others can go back and add them after merge.

I don't expect to have any further comments after the small bugs are fixed. Once again I think is awesome.

}
}
} else {
let n = arg.len();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an example of a place we'd have to audit and fix should we switch to UTF-8 strings, since string length gets the number of chars, not number of bytes. You can switch to char_count() now, or defer it, since we really will do that audit.


impl StringSubCommand<'_> for Unescape {
const LONG_OPTIONS: &'static [woption<'static>] = &[
wopt(L!("no-quoted"), no_argument, 'q'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no-quoted has short option 'n'. Distressing that this wasn't caught by the tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is rather a fault in that no-quoted does not mean anything for unescape — it did nothing and does nothing, the NO_QUOTED flag is not defined for unescapeflags

pub struct UnescapeFlags: u32 {
. It should probably rather be removed, but that is a potentially breaking change.

It is also not documented that it should support no-quoted https://fishshell.com/docs/current/cmds/string-unescape.html#synopsis

Copy link
Contributor Author
@henrikhorluck henrikhorluck Jul 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed this so it is the correct n, but left the option still there

@henrikhorluck henrikhorluck force-pushed the riir/builtins/string branch 4 times, most recently from fefcb4e to d8b406d Compare July 24, 2023 07:09
- Add test to verify piped string replace exit code

Ensure fields parsing error messages are the same.

Note: C++ relied upon the value of the parsed value even when `errno` was set,
that is defined behaviour we should not rely on, and cannot easilt be replicated from Rust.
Therefore the Rust version will change the following error behaviour from:

```shell
> string split --fields=a "" abc
string split: Invalid fields value 'a'
> string split --fields=1a "" abc
string split: 1a: invalid integer
```

To:

```shell
> string split --fields=a "" abc
string split: a: invalid integer
> string split --fields=1a "" abc
string split: 1a: invalid integer
```
Padding with an unprintable character is now disallowed, like it was for other
zero-length characters.

`string shorten` now ignores escape sequences and non-printable characters
when calculating the visible width of the ellipsis used (except for `\b`,
which is treated as a width of -1).
Previously `fish_wcswidth` returned a length of -1 when the ellipsis-str
contained any non-printable character, causing the command to poentially
print a larger width than expected.

This also fixes an integer overflows in `string shorten`'s
`max` and `max2`, when the cumulative sum of character widths turned negative
(e.g. with any non-printable characters, or `\b` after the changes above).
The overflow potentially caused strings containing non-printable characters
to be truncated.

This adds test that verify the fixed behaviour.
@henrikhorluck henrikhorluck force-pushed the riir/builtins/string branch from d8b406d to a8d6806 Compare July 24, 2023 07:25
Copy link
Contributor Author
@henrikhorluck henrikhorluck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now rebased and squashed. Also reduced the number of import statements. There are some minor nits still open:

  • whether we want to backport the non-printable-fixes.
  • For the UTF8-compat-changes I would much rather do that separately.
  • I still think the string-translation-stuff is messy, but I can't do anything about it until we have an xgettext-like thing working for Rust code

There are a lot of TODO's that are better suited for later PR's, this should be mergeable as is

self.fields = arg.unwrap().try_into().map_err(|e| match e {
FieldParseError::Number => StringError::NotANumber,
FieldParseError::Range => {
invalid_args!("%ls: Invalid range value for field '%ls'\n", name, arg)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should now all support translation — how we extract the translation strings I leave as a future work as I don't know whether that will happen before/after macro-expansion

streams.err.append(wgettext_fmt!(
BUILTIN_ERR_COMBO2,
cmd,
wgettext!("--invert and --groups-only are mutually exclusive")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These also now have translation support added back

@ridiculousfish ridiculousfish merged commit 2ec3633 into fish-shell:master Jul 28, 2023
@ridiculousfish
Copy link
Member

Merged. What can I say, except thank you.

@henrikhorluck henrikhorluck deleted the riir/builtins/string branch July 28, 2023 08:13
@henrikhorluck henrikhorluck mentioned this pull request Aug 14, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 20, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants

0