Bug #21783
open{Method,UnboundMethod,Proc}#source_location returns columns in bytes and not in characters
Description
The documentation says:
= Proc.source_location
(from ruby core)
------------------------------------------------------------------------
prc.source_location -> [String, Integer, Integer, Integer, Integer]
------------------------------------------------------------------------
Returns the location where the Proc was defined. The returned Array
contains:
(1) the Ruby source filename
(2) the line number where the definition starts
(3) the column number where the definition starts
(4) the line number where the definition ends
(5) the column number where the definitions ends
This method will return nil if the Proc was not defined in Ruby (i.e.
native).
So it talks about column numbers, so it should be a number of characters and not of bytes.
But currently it's a number of bytes:
$ ruby --parser=prism -ve 'def été; end; p method(:été).source_location'
ruby 4.0.0dev (2025-12-14T07:11:02Z master 711d14992e) +PRISM [x86_64-linux]
["-e", 1, 0, 1, 14]
$ ruby --parser=parse.y -ve 'def été; end; p method(:été).source_location'
ruby 4.0.0dev (2025-12-14T07:11:02Z master 711d14992e) [x86_64-linux]
["-e", 1, 0, 1, 14]
The last number should be 12 because "def été; end".size is 12 characters.
This is a Ruby-level API so I would never expect "byte columns" here, I think it's clear it should be a number of "editor columns" i.e. a number of characters.
Updated by Eregon (Benoit Daloze) 3 days ago
- Description updated (diff)
Updated by Eregon (Benoit Daloze) 3 days ago
- Related to Feature #21005: Update the source location method to include line start/stop and column start/stop details added
Updated by Eregon (Benoit Daloze) 3 days ago
- Related to Feature #6012: Proc#source_location also return the column added
Updated by kddnewton (Kevin Newton) 3 days ago
I think this is a documentation issue, as both parsers/compilers operate in terms of bytes. Changing this to characters would likely be a noticeable difference in speed, and quite a bit of code change. (Either both parsers/compilers would have to do this work initially, as that's where the numbers come from, or the source_location function would have to re-parse the source, which is not possible in some cases.) All of that is to say, please do not change this, it will be a ton of work for minimal benefit.
Updated by Eregon (Benoit Daloze) 3 days ago
Updating the docs is one solution, so at least it's consistent between docs and behavior.
I think as a Ruby-facing API it's weird that it operates in terms of bytes (and source_location does not have a byte prefix to indicate that).
I think most programmers when they hear line 4 column 6 they expect the 6th character on the 4th line, not the character starting at the 6th byte (actually hard to find in an editor, most editors don't show "byte columns", in fact it's not even possible to place the cursor at some byte positions, every programmer always think in characters when looking at source code).
For example, one might expect that highlighting with ^ based on the return values from source_location works, but it doesn't:
def underline(callable)
file, start_line, start_column, end_line, end_column = callable.source_location
raise unless start_line == end_line
source = File.readlines(file)[start_line-1]
puts source
puts ' '*start_column + '^'*(end_column-start_column)
end
my_proc = proc { ascii-only }
underline my_proc
my_proc = proc { il était une fois un été }
underline my_proc
gives
$ ruby underline.rb
my_proc = proc { ascii-only }
^^^^^^^^^^^^^^
my_proc = proc { il était une fois un été }
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Either both parsers/compilers would have to do this work initially, as that's where the numbers come from, or the source_location function would have to re-parse the source, which is not possible in some cases.
This is a good point, I didn't realize that.
I think it would still be worth it to change the parsers/compilers to compute the proper character column for literal lambdas, blocks and methods, and probably wouldn't be very expensive given most source files are ASCII-only and potentially the parsers could even use the knowledge that a given line is ASCII-only so it would still be as fast even if the file contains a few non-ASCII characters.
If columns would e.g. appear in error messages, I think everyone would expect them to be character columns, not byte columns.
For example gcc shows character columns, as one would expect:
int main() {
/* été */ notexist
}
gcc test.c
test.c: In function ‘main’:
test.c:2:15: error: ‘notexist’ undeclared (first use in this function)
2 | /* été */ notexist
| ^~~~~~~~
Note it's 2:15 (i.e. character columns), not 2:17 (byte columns).
The highlighting also needs to use character columns of course.
Updated by Eregon (Benoit Daloze) 3 days ago
From https://bugs.ruby-lang.org/issues/6012#note-25 @matz (Yukihiro Matsumoto) said adding column was OK, but not byte offsets.
I'm not sure what were his reasons, but maybe it's that byte offsets are too low-level for source_location?
If so, I would think byte columns are also too low level and it should be character columns instead.
From a user POV character columns seem better and more expected.
OTOH, I understand the reservation from @kddnewton (Kevin Newton) and I share it as a Ruby implementer, it's much simpler to return byte columns.
For example in TruffleRuby we currently save location information by having int32_t start_offset; int32_t length; in every Truffle AST node, i.e. byte offset and byte length.
Returning byte columns from that is easy and only requires the "newline offsets" array, and not the actual source code.
To return character columns, TruffleRuby would need to read from the beginning of the line to the byte offset to find how many characters that is, and keep the source code in memory (currently TruffleRuby does keep it in memory, but it might not in the future).
I have also seen this in the context of adding Prism.node_for and for that usage having byte columns is actually easier than character columns, OTOH it's not hard to convert from character columns to byte columns in that case and I already wrote the logic for that (because I expected source_location would return character columns, even before reading the docs).
It is of course possible to convert from character column to byte column and vice versa, but it requires access to the source code, which is not always available (e.g. eval).
Updated by kddnewton (Kevin Newton) 3 days ago
Honestly if we're interpreting column as something visual like you're implying, we're also going to run into issues with grapheme clusters and east asian width and all the other implications for whatever "character" actually means. I think we would also have to return the encoding of the source file inside that array in order for it to make any sense.
Updated by matz (Yukihiro Matsumoto) 3 days ago
I'd like to cancel source_location to have column information in 4.0, due to this concern. In my personal opinion, I am leaning toward byte index, though.
Matz.
Updated by Eregon (Benoit Daloze) 2 days ago
matz (Yukihiro Matsumoto) wrote in #note-8:
I'd like to cancel
source_locationto have column information in 4.0, due to this concern.Matz.
Thank you for the quick reply, I think that would be the worst outcome though, https://bugs.ruby-lang.org/issues/6012 was already opened 14 years ago and I have seen multiple users needing this in the last years.
It is one of the features I'm most looking forward to in Ruby 4.0 (in fact it's one of two features that really interests me in 4.0: that and Ractor improvements, the rest looks rather unexciting to me).
IOW, I would much rather have byte columns in 4.0 than no columns at all.
If we delay this, we'll implicitly tell people that using RubyVM::AbstractSyntaxTree is the only way to get column information, and that's bad because it only works on CRuby and it's not a proper API.
Alternative Ruby implementations might have to define their own API to get column information, vs just using the one we have agreed on in #6012 (I'd much rather not get there).
In my personal opinion, I am leaning toward byte index, though.
Let's go with byte columns then? I can make a PR to document that.
It seems you and @kddnewton (Kevin Newton) agree on that, and I'm basically hesitating which is best, but definitely better byte columns than columns.
I see it has already been reverted though :/ https://github.com/ruby/ruby/commit/065c48cdf11a1c4cece84db44ed8624d294f8fd5
Updated by byroot (Jean Boussier) 2 days ago
Maybe I'm totally off, but I expect this data to be used to extract the source code, e.g show a snippet of code in an error message, or something akin to that, hence byte offsets seem actually more convenient? (and performant).
But yes, it definitely need to be explicitly documented.
Updated by Eregon (Benoit Daloze) 2 days ago
· Edited
I made a PR to re-add {Method,UnboundMethod,Proc}#source_location and fix all known issues: https://github.com/ruby/ruby/pull/15580
@matz (Yukihiro Matsumoto) Would it be OK to merge it? 🙏
For context I opened this issue because I was surprised at the semantics and it was documented in character columns.
If the documentation stated the columns are in bytes I would have thought "somewhat unexpected for me, but I can deal with it, moving on".
So for the next person, if they look at the docs it should be clear now with this PR.
I would have never expected this issue to revert the feature, this is certainly not what I want.
I opened this issue to show the doc & implementation inconsistency, explain my expectations, and discuss what fix makes sense.
I think @kddnewton (Kevin Newton) makes a good point about grapheme clusters and east asian width, where even character width is not enough.
And it's probably not reasonable to ask parsers to handle those cases either.
So I now think byte columns is a good choice, as long as it's properly documented.
I think this is a case of we shouldn't let perfection (it's not really achievable here) get in the way of usefulness, from the saying Perfect is the enemy of good.
BTW using a variant of my C example from above with different compilers shows some variety:
int main() {
char* s = "🎉🎉🎉"; oops
}
$ gcc test.c
test.c: In function ‘main’:
test.c:2:25: error: ‘oops’ undeclared (first use in this function)
2 | char* s = "🎉🎉🎉"; oops
| ^~~~
$ clang test.c
test.c:2:31: error: use of undeclared identifier 'oops'
2 | char* s = "🎉🎉🎉"; oops
| ^
gcc shows column 25 which corresponds to nothing, ' char* s = "🎉🎉🎉"; ' is 21 characters and 30 bytes.
clang shows the column in bytes, so there is clearly some variety there, and at least clang chose to show byte columns.
Updated by Eregon (Benoit Daloze) 2 days ago
byroot (Jean Boussier) wrote in #note-10:
Maybe I'm totally off, but I expect this data to be used to extract the source code, e.g show a snippet of code in an error message, or something akin to that, hence byte offsets seem actually more convenient? (and performant).
Yes, for that case which might indeed be the most common one, it's more convenient to have the columns in bytes.
And I suspect many of these cases would likely use Prism and specifically Prism.node_for to get more information, such as showing the method call to which the block was given, highlight some specific part of the block or method, etc.
Updated by mame (Yusuke Endoh) 2 days ago
Eregon (Benoit Daloze) wrote in #note-12:
And I suspect many of these cases would likely use
Prismand specifically Prism.node_for to get more information, such as showing the method call to which the block was given, highlight some specific part of the block or method, etc.
I agree. But then, when would Proc#source_location be useful?
Updated by Eregon (Benoit Daloze) 2 days ago
mame (Yusuke Endoh) wrote in #note-13:
I agree. But then, when would
Proc#source_locationbe useful?
There are cases where it's useful to show just the block, i.e. { block body }.
And also to be able to use Prism.node_for one of course needs Proc#source_location.
Maybe you're saying Proc#source_location should always include the method call? I think it's good as-is and does not need changing (a little bit discussed in #21784).
Updated by byroot (Jean Boussier) 2 days ago
Isn't Proc.source_location very useful for things akin to power_assert?
If it was easy to extract a proc's source code I'd certainly integrate it in testing framework to improve failure rendering.
Updated by Eregon (Benoit Daloze) 2 days ago
Yes, it'd be useful for e.g. assert_raise(SomeException) { ... } in test-unit, expect { ... }.to in RSpec, -> { ... }.should raise_error(...) in MSpec, etc.
For example -> { 13 - "10" }.should raise_error(ArgumentError) in MSpec currently gives:
1)
Integer#- fixnum raises a TypeError when given a non-Integer ERROR
Expected ArgumentError
but got: TypeError (String can't be coerced into Integer)
/home/eregon/code/rubyspec/core/integer/minus_spec.rb:21:in 'Integer#-'
But it could show something like:
1)
Integer#- fixnum raises a TypeError when given a non-Integer ERROR
Expected ArgumentError
but got: TypeError (String can't be coerced into Integer)
For expression `13 - "10"`
/home/eregon/code/rubyspec/core/integer/minus_spec.rb:21:in 'Integer#-'
Updated by mame (Yusuke Endoh) 2 days ago
· Edited
And also to be able to use
Prism.node_forone of course needsProc#source_location.
Huh? Proc already has its node_id (via its ISeq), so Prism.node_for is implementable without source_location.
In fact, by doing the following, you can obtain the Prism node for Proc.
f = -> {}
require "prism"
node_id = RubyVM::InstructionSequence.of(f).to_a[4][:node_id]
pp Prism.parse_file(__FILE__).value.breadth_first_search { |node| node.node_id == node_id }
#=> @ LambdaNode (location: (1,4)-(1,9))
# ├── flags: ∅
# ├── locals: []
# ├── operator_loc: (1,4)-(1,6) = "->"
# ├── opening_loc: (1,7)-(1,8) = "{"
# ├── closing_loc: (1,8)-(1,9) = "}"
# ├── parameters: ∅
# └── body: ∅
Note that I don't say users should write RubyVM::InstructionSequence.of(-> {}).to_a[4][:node_id] manually. It should be hidden within Prism.node_for. This is what RubyVM::AbstractSyntaxTree.of does actually.
I want to say that source_location is unnecessary to implement Prism.node_for.
Regarding to power_assert, I talked with @ktsj (Kazuki Tsujimoto), and he said that Proc#source_location is not more useful than Kernel#caller_locations. Since power_assert pinpoints expressions within a block at a finer granularity, rough block's location information isn't particulary useful.
Updated by byroot (Jean Boussier) 2 days ago
Regarding to power_assert
So it's great power_assert has much more advanced parsing capabilities, but I for one would use the column information for things like Rails' assert_difference -> { User.count }, +1, as it would be quite trivial with column info in source_location.
Updated by mame (Yusuke Endoh) 2 days ago
· Edited
I have the following hypothesis: When people want column information, what they really need is an AST. #21005 was precisely such an example.
In some cases, column information alone may suffice, but obtaining the AST inevitably yields column information. #21784 discusses whether Proc#source_location should include the position of -> or not, but if the user has the AST, they can choose to include or exclude the range containing -> as desired.
Updated by Eregon (Benoit Daloze) 2 days ago
· Edited
mame (Yusuke Endoh) wrote in #note-17:
Huh?
Procalready has itsnode_id(via its ISeq), soPrism.node_foris implementable withoutsource_location.
I want to say thatsource_locationis unnecessary to implementPrism.node_for.
No, node_id is CRuby-specific and this code to extract it is a hack.
source_location with column information has the proper information, is a proper API, is portable and avoids that hack.
I think it would be unacceptable to have any functionality in Prism depending on RubyVM for example, we need to find general solutions, not CRuby-specific hacks.
FWIW I also consider the concept of node_id CRuby-specific, at least currently.
TruffleRuby doesn't have node_id for instance, but it has columns & byte offsets and those are universal concepts.
I have the following hypothesis: When people want column information, what they really need is an AST.
Yes, probably true for a lot of cases.
But such column information (or byte offsets from the start of line/source, any of these works) is also necessary to locate the right node in the AST, just the start_line like currently is insufficient (e.g. there can be multiple blocks on a single line).
For other cases it's unnecessary to have the AST, e.g. to print the source of a method, e.g. for RDoc or similar, i.e. the method_source use case (700 millions downloads, https://rubygems.org/gems/method_source).
Since power_assert pinpoints expressions within a block at a finer granularity, rough block's location information isn't particulary useful.
Then power_assert is likely incorrect/broken if there are multiple assert { ... } block on a single line.
It could identify the correct block with column information.
Updated by Eregon (Benoit Daloze) 2 days ago
· Edited
Indeed, power_assert is broken with 2 blocks on the same line:
require 'test/unit'
class FooTest < Test::Unit::TestCase
def test_foo
assert { 3.times.to_a.include?(2) }; assert { 3.times.to_a.include?(3) }
end
end
gives
==============================================================================================================================
Failure: test_foo(FooTest):
assert { 3.times.to_a.include?(2) }; assert { 3.times.to_a.include?(3) }
| | |
| | false
| [0, 1, 2]
#<Enumerator: 3:times>
pa.rb:5:in 'FooTest#test_foo'
2:
3: class FooTest < Test::Unit::TestCase
4: def test_foo
=> 5: assert { 3.times.to_a.include?(2) }; assert { 3.times.to_a.include?(3) }
6: end
7: end
==============================================================================================================================
So power_assert shows the wrong block and shows confusing information ([0, 1, 2].include?(2) is true not false).
Column information from Proc#source_location would fix this by allowing power_assert to find the correct block.
NOTE: I'm using ruby 3.4.7 because power_assert seems to not do anything for that case with ruby-master currently.
Updated by mame (Yusuke Endoh) 1 day ago
· Edited
Eregon (Benoit Daloze) wrote in #note-20:
FWIW I also consider the concept of
node_idCRuby-specific, at least currently.
Since Prism::Node#node_id exists, I don't think it's CRuby-specific anymore.
It's true that the part retrieving the node_id from a Proc is CRuby-specific, and that design requires some consideration.
A simple solution would be for Ruby to provide Proc#node_id or Method#node_id, but we need to discuss.
source_location does not uniquely identify a node.
For example, code x is represented as an AST with three nodes: ProgramNode, StatementsNode, and CallNode.
They all have the same source_location, so cannot be distinguished.
Whether TruffleRuby implements node_id is up to you, but I believe this is the consistnt approach in CRuby.
Since power_assert pinpoints expressions within a block at a finer granularity, rough block's location information isn't particulary useful.
Then
power_assertis likely incorrect/broken if there are multipleassert { ... }block on a single line.
@ktsj (Kazuki Tsujimoto) said that assert { condition ? expected == actual : expected == actual } cannot be distinguished at present.
Proc#source_location is not helpful to resolve this.
I think that what is needed is an API like RubyVM::AST.node_id_for_tracepoint to retrieve the node_id from a TracePoint.
Updated by Eregon (Benoit Daloze) 1 day ago
· Edited
mame (Yusuke Endoh) wrote in #note-22:
source_locationdoes not uniquely identify a node.
For example, codexis represented as an AST with three nodes:ProgramNode,StatementsNode, andCallNode.
They all have the samesource_location, so cannot be distinguished.
True but that's not a problem for Prism.node_for(Method | UnboundMethod | Proc) because it returns DefNode | LambdaNode | CallNode | ForNode,
i.e. we choose the relevant node to return even if multiple nodes would have the same source location.
And in general the user could choose that e.g. they want the outermost node, or a node of a specific node class.
mame (Yusuke Endoh) wrote in #note-22:
@ktsj (Kazuki Tsujimoto) said that
assert { condition ? expected == actual : expected == actual }cannot be distinguished at present.
That seems a separate and unrelated problem to having assert { ... }; assert { ... }.
(BTW no multi-line block seems currently a big limitation for power_assert)
For assert { ... }; assert { ... }, I'm convinced source_location with column information is helpful and a good portable solution.
Is there any chance to merge https://github.com/ruby/ruby/pull/15580 before the 4.0.0 release?
If not, it will establish a precedent of "asking clarification in an issue can result in removing a feature that has been on master for almost a year".
I think that's a bad precedent to set, i.e. to remove features implemented for a long time at the last minute with little evidence of a problem and not even discussing before reverting.
It's disrespectful of gems & users which have adapted to the changes.
Especially after discussing the feature for years, it feels like a betrayal to remove it now.
I think it might drive people away from Ruby.
Concretely I don't see much the point to discuss features on this tracker if they can be removed in such a way.
The only good out of that I could see is if somehow that results in adding Ruby::CodeLocation or so with an API like this in 4.1.
Or even in 4.0.X given it's a compatible change and CRuby doesn't follow SemVer anyway, so the 3rd component is a mix of SemVer minor/patch.
That would then be more flexible and e.g. expose columns, offsets, etc by having various method, and remove the need for things like RubyVM::InstructionSequence#script_lines by having a proper replacement.
In general the goal here is to avoid RubyVM since that's CRuby-only, experimental, unstable and shouldn't be used anyway.
The documentation of RubyVM makes it clear it's only for debugging, prototyping, and research and clearly it has been abused over and over, because there is no proper API for some things like column information.
After years of discussions to make some progress in this area with source_location column information, it's disheartening to see it reverted at the last minute.