8000 Add a `TimeUUID` class · Issue #231 · scylladb/python-driver · GitHub
[go: up one dir, main page]

Skip to content

Add a TimeUUID class #231

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cvybhu opened this issue Jun 23, 2023 · 15 comments
Closed

Add a TimeUUID class #231

cvybhu opened this issue Jun 23, 2023 · 15 comments
Assignees

Comments

@cvybhu
Copy link
cvybhu commented Jun 23, 2023

timeuuid is a UUID v1 - it contains a timestamp and some random bits that together form a unique identifier.
Their comparison operator should compare them primarily based on this timestamp.

timeuuid values are currently represented using uuid.UUID, but the comparison operators of uuid.UUID don't compare timeuuid values in the same way that Cassandra/Scylla does.

For example, with values 00000257-0efc-11ee-9547-00006490e9a6 and fed35080-0efb-11ee-a1ca-00006490e9a4:

Cassandra believes that fed35080-0efb-11ee-a1ca-00006490e9a4 is smaller:

CREATE KEYSPACE ks WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 1};
CREATE TABLE ks.tab (p int, c timeuuid, PRIMARY KEY (p, c));
INSERT INTO ks.tab (p, c) VALUES (0, 00000257-0efc-11ee-9547-00006490e9a6);
INSERT INTO ks.tab (p, c) VALUES (0, fed35080-0efb-11ee-a1ca-00006490e9a4);
SELECT p, c FROM ks.tab;
 p | c
---+--------------------------------------
 0 | fed35080-0efb-11ee-a1ca-00006490e9a4
 0 | 00000257-0efc-11ee-9547-00006490e9a6

(2 rows)

Because it has a lower timestamp value:

 timeuuid_col                         | system.totimestamp(timeuuid_col)
--------------------------------------+----------------------------------
 fed35080-0efb-11ee-a1ca-00006490e9a4 |  2023-06-19 23:49:56.990000+0000
 00000257-0efc-11ee-9547-00006490e9a6 |  2023-06-19 23:49:58.961000+0000

But UUID comparison says the opposite, as it compares the bytes in lexicographical order:

>>> import uuid
>>> a = uuid.UUID('fed35080-0efb-11ee-a1ca-00006490e9a4')
>>> b = uuid.UUID('00000257-0efc-11ee-9547-00006490e9a6')
>>> a < b
False
>>> b < a
True

It would be useful to have a class that represents timeuuid values and has the same semantics as the values in Cassandra.

Here's an example implementation that I wrote, based on the cassandra implementation:

https://gist.github.com/cvybhu/ed5b64d8b62eff51dc46258157a92e41

Ideally python driver would return values of this type when some row is selected, but this would be a breaking change.

@cvybhu
Copy link
Author
cvybhu commented Jun 23, 2023

I tried to use the existing python libraries, but it looks like we can't use any of them.

time_uuid does an unsigned comparison (In Cassandra 8080... is smaller than 0000 because it does signed comparison):

>>> import time_uuid
>>> a = time_uuid.TimeUUID('00000000-0000-1000-0000-000000000000')
>>> b = time_uuid.TimeUUID('00000000-0000-1000-8080-808080808080')
>>> a < b
True

python-timeuuid doesn't work with Python3:

root@dfc90c4a88f5:/# pip install python-timeuuid
Collecting python-timeuuid
  Using cached python-timeuuid-0.3.5.tar.gz (27 kB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [7 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-ptk7ypbx/python-timeuuid_4457d3d65d544d868b98787f21893a11/setup.py", line 20
          print 'Building with Cython %s' % cython_version
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      SyntaxError: Missing parentheses in call to 'print'. Did you mean print(...)?
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

@mykaul
Copy link
mykaul commented Jun 23, 2023

How does Cassandra's driver work, btw? Can you compare?

@cvybhu
Copy link
Author
cvybhu commented Jun 23, 2023

How does Cassandra's driver work, btw? Can you compare?

Same thing.

Looking at the code, they don't have any TimeUUID class: https://github.com/search?q=repo%3Adatastax%2Fpython-driver%20timeuuid&type=code

Here's a small test case:

# cassandra-driver==3.28.0
import cassandra
from cassandra.cluster import Cluster

s = Cluster(["127.0.0.1"]).connect()
s.execute("DROP KEYSPACE IF EXISTS ks")
s.execute("CREATE KEYSPACE ks WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 1}")
s.execute("CREATE TABLE ks.t (p int, t timeuuid, PRIMARY KEY (p, t))")

queries = [
"INSERT INTO ks.t (p, t) VALUES (0, 00000000-0000-1000-0000-000000000000);",
"INSERT INTO ks.t (p, t) VALUES (0, 00000000-0000-1000-8080-808080808080);",
"INSERT INTO ks.t (p, t) VALUES (0, 00000000-0000-1000-7f7f-7f7f7f7f7f7f);",
"INSERT INTO ks.t (p, t) VALUES (0, 00000000-0000-1000-f7f7-f7f7f7f7f7f7);",
"INSERT INTO ks.t (p, t) VALUES (0, 00000000-0000-1000-ffff-ffffffffffff);",
"INSERT INTO ks.t (p, t) VALUES (0, 00000000-0000-1000-a1ca-00006490e9a4);"]

for q in queries:
    s.execute(q)

rows = list(s.execute("SELECT t FROM ks.t"))

print("rows:", rows)

uuids = [r[0] for r in rows]
print("\nuuids:", uuids)

sorted_uuids = sorted(uuids)
print("\nsorted_uuids:", sorted_uuids)

Output:

rows: [Row(t=UUID('00000000-0000-1000-8080-808080808080')), Row(t=UUID('00000000-0000-1000-a1ca-00006490e9a4')), Row(t=UUID('00000000-0000-1000-f7f7-f7f7f7f7f7f7')), Row(t=UUID('00000000-0000-1000-ffff-ffffffffffff')), Row(t=UUID('00000000-0000-1000-0000-000000000000')), Row(t=UUID('00000000-0000-1000-7f7f-7f7f7f7f7f7f'))]

uuids: [UUID('00000000-0000-1000-8080-808080808080'), UUID('00000000-0000-1000-a1ca-00006490e9a4'), UUID('00000000-0000-1000-f7f7-f7f7f7f7f7f7'), UUID('00000000-0000-1000-ffff-ffffffffffff'), UUID('00000000-0000-1000-0000-000000000000'), UUID('00000000-0000-1000-7f7f-7f7f7f7f7f7f')]

sorted_uuids: [UUID('00000000-0000-1000-0000-000000000000'), UUID('00000000-0000-1000-7f7f-7f7f7f7f7f7f'), UUID('00000000-0000-1000-8080-808080808080'), UUID('00000000-0000-1000-a1ca-00006490e9a4'), UUID('00000000-0000-1000-f7f7-f7f7f7f7f7f7'), UUID('00000000-0000-1000-ffff-ffffffffffff')]

@mykaul
Copy link
mykaul commented Jun 23, 2023

We need to be 'Cassandra-compatible' in that sense, even bug compatible, or be able to turn on/off different behavior. @nyh - thoughts on the above?

@cvybhu
Copy link
Author
cvybhu commented Jun 23, 2023

We can provide the wrapper as a separate class without converting the rows automatically to TimeUUID.
The users would manually convert their UUID values to TimeUUID when needed.

@nyh
Copy link
nyh commented Jun 23, 2023

We need to be 'Cassandra-compatible' in that sense, even bug compatible, or be able to turn on/off different behavior. @nyh - thoughts on the above?

I think it's a mistake to change the meaning of existing classes, it can break existing applications. In this sense, it's fine to add a new timeuuid class. Another alternative is to declare the existing situation a bug, i.e., a real application might get confused by the wrong "<" operator. If it's a bug, the it's fine to change the existing classes. But to be honest, I'm not sure that any real application tries to use <... An application typically trusts Scylla to return the correct order, and not in the habit of checking the order of the results.

@cvybhu
Copy link
Author
cvybhu commented Jun 23, 2023

A short description of how timeuuid comparison works, to make reviewing code easier.

A timeuuid is a UUID v1 value, which contains a timestamp and some other data:

image

The first 64 bits contain a 60 bit timestamp and 4 bit version

Each hex character is 4 bits, the 1 denotes the version:

00000000-0000-1000-0000-000000000000

This character will always be 1, valid timeuuids are always UUID v1.

The first 3 parts is the timestamp (+ version which is always 1).
For example in fed35080-0efb-11ee-a1ca-00006490e9a4, the fed35080-0efb-11ee represents timestamp 2023-06-19 23:49:56.990000+0000 (and version 1)

The remaining 64 bits have some special meaning in the standard, but Cassandra/Scylla just treats them like random data.

Cassandra compares timeuuid values the following way:

  1. Take the timestamp (60 bit integer), and compare them. If one is smaller, this value is smaller.
  2. Interpret the remaining 64 bits as a 64 bit signed integer, compare those integers.

https://github.com/apache/cassandra/blob/ae537abc6494564d7254a2126465522d86b44c1e/src/java/org/apache/cassandra/utils/TimeUUID.java

@fruch
Copy link
fruch commented Jun 25, 2023

I would raise is on the upstream drivers as well, as a bug. (and can open this PR on upstream as well)

as a user I would expect the object returned by the driver to do the right ordering.

@cvybhu
Copy link
Author
cvybhu commented Jun 26, 2023

I would raise is on the upstream drivers as well, as a bug. (and can open this PR on upstream as well)
as a user I would expect the object returned by the driver to do the right ordering.

Opened https://datastax-oss.atlassian.net/browse/PYTHON-1358

@nyh
Copy link
nyh commented Jun 27, 2023

@cvybhu did you notice that the Python driver already has datetime_from_uuid1() in cassandra.util?

This can be used to convert timeuuid to time and compare it like time, without needing a new class, although I admit it doesn't give you all the power of comparing all the extra bits just like Cassandra does (but does anyone need this extra power?)

There's also a delicate issue of timezone but it's not important if all you want is to compare the order.
I used this function when I translated the Cassandra tests: cassandra_tests/validation/entities/timeuuid_test.py

@cvybhu
Copy link
Author
cvybhu commented Jun 27, 2023

@cvybhu did you notice that the Python driver already has datetime_from_uuid1() in cassandra.util?

uuid.UUID already has a time attribute, which returns the integer timestamp, so it's the easiest to just use it:

>>> import uuid
>>> u = uuid.uuid1()
>>> u.time
139071763117282130
>>> 

@fruch
Copy link
fruch commented Oct 19, 2023

@cvybhu FYI someone working on this on upstream
You might want to share what we have in dtest with him

@cvybhu
Copy link
Author
cvybhu commented Oct 19, 2023

@cvybhu FYI someone working on this on upstream
You might want to share what we have in dtest with him

That's great to hear :) I shared my code with them in the JIRA issue, so they should have access to everything.

@fruch
Copy link
fruch commented Oct 19, 2023

and now with the link:
datastax#1175

@avelanarius avelanarius self-assigned this Jan 15, 2024
@avelanarius
Copy link

Closing the issue here since upstream is working on it.

@avelanarius avelanarius closed this as not planned Won't fix, can't repro, duplicate, stale Jan 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants
0