[go: up one dir, main page]

underlap

OpenDevelopment

List of posts in this series

In the previous post I defined the ordering function and found an implementation I could reuse. This was both good news and bad news: good news because someone else had already done the work of implementing an ordering function using a Feistel network, but bad news because I didn't get to have some fun writing bit-twiddling code.

In this eleventh and final post of the series, I'll reflect on the outcome of the experiment and also on what I've learned about truly open development.

Admittedly the experiment was small and bounded. But it was also non-trivial and interesting. The outcome was a fairly typical trade-off between writing new code and reusing someone else's.

What were the downsides of reusing another implementation?

Apart from not getting to write more code, the main downside was that the code wasn't quite what I would have written myself. The code was clean and understandable and had good documentation.

But I wouldn't want to ship code that panics when presented with invalid inputs. An interface is often used quite differently from how I expect. If I make sure that error handling is explicit in the interface, then it encourages all code calling the interface to handle errors suitably. The using code might panic, but at least the panic will point to the code that needs fixing.

That said, I was able to wrap the code in a clean interface that returns errors rather than panicking. That was fine because my performance requirements weren't too exacting. A URL shortener would call the ordering function just once per request to shorten a URL (and wouldn't call it at all when mapping a shortened URL to the original URL). There's a little extra complexity in maintaining a wrapper, but not much and certainly less than maintaining the ordering function itself.

Then there was a part of the code that I didn't understand: a test that the permutation was pseudorandom. I spent a while searching for information on testing pseudorandomness in terms of a triangular distribution, but couldn't find anything. However, this test is unlikely to cause trouble. In the worst case, it might fail if the permutation implementation was changed significantly. But in that case, the project owner would almost certainly change the test to make it pass.

What were the upsides?

The main upside was having a clean implementation with good documentation and decent tests, but with minimal effort.

Also, if I'd written my own code, it may well have remained a solo effort. But by reusing someone else's code, at least two pairs of eyes have looked over it carefully.

Then there's maintenance. If bugs are found, there's a good chance someone else will fix them. Maintaining my wrapper would be very straightforward. I'd just have to decide if and when to bump the dependency to pick up changes.

I also felt I added a little value to the reused code by proving the correctness (see the previous post) of part of it which wasn't obviously correct (at least to me).

What about the experiment as a whole?

I enjoyed the process of writing up my thinking as I went along. I don't mind working quite slowly on projects these days as I'm doing them for fun and personal satisfaction rather than for an employer and on a schedule.

Parts of the process were quite tricky to write down in real time. When I was mulling over the requirements and design, I had to force myself to write down some things that would turn out to be dead ends. If instead I'd written just one post after the project was complete, it wouldn't have had so many lose ends. But hopefully that gives readers a more realistic impression of the way I work.

Insights into the way I work

I go off on tangents occasionally. The commit history of my code describes just a small part of my development process. This was always the case throughout my career. My goal was to grow my understanding and skills as well as deliver working software. This approach helped make even the dullest piece of work enjoyable and thereby helped me do it well.

Going off on tangents is also about the way I learn: I need to explore aspects of a problem that interest me. I find skipping over things I don't understand deeply unsatisfactory. I guess that tended to take the edge off my raw productivity, but it also increased my longevity.

Would I repeat the experiment?

Probably not. Although I tend to write stuff down as I go along, I wouldn't necessarily want to polish the material for publication. In fact, I think going back to using a physical development notebook is preferable. The ability to work in pencil, draw rough diagrams, scribble questions, and jot down half-formed thoughts is a great freedom compared to writing up my thoughts coherently for consumption by others.

Conclusion

Working in a completely open way — by which I mean publishing my thinking as I go along, as in this experiment — certainly influences the development process and the ultimate outcome, but I'm not sure these effects will always be positive.

The extra time investment represents inertia that could reduce the chances of switching to a better design or feeling free to make certain kinds of improvement. Perhaps a sweet spot would be to keep notes as I go and use these as the basis for a blog post if that seems appropriate.

#OpenDevelopment

[ favicon | about | blogroll | contact | notes | now | search | subscribe | © | feed icon ]

List of posts in this series

After reviewing the state of play in the previous post, I'm now going to start defining an ordering function in code. First I need to decide where to put the code.

Code repository

I took a closer look at the URL shortener repository. The code to generate a new short link simply returns a random adjective separated from a random name by a hyphen, e.g. “clever-knuth”. No state is maintained except for a database mapping previously generated short links to their original, unshortened URLs.

So there is actually very little in the URL shortener's repository of relevance to the ordering function I want to develop. To focus on the task at hand, I decided to create a new repository. I might eventually port some code back to my fork of the URL shortener repository, but I'll think about that later.

Repository name

As Leon Bambrick quipped, there are only two hard problems in computer science: cache invalidation, naming things, and off-by-1 errors. I want a repository to host the ordering function. I don't want to tie the name to the specific application of URL shortening, especially as the ordering function will be much more general purpose than that. I'll probably use a Feistel network, but that's an implementation choice, so I don't think Feistel network should necessarily appear in the name. So what will the ordering function actually do? It will permute a range of integers.

After searching crates.io for “permute” and “permutation”, just in case that turned up an existing piece of code, I settled on a repository name of range-permutation-rs (the “-rs” suffix is a convention indicating Rust code).

Setting up the repository

If you're happy setting up a git repository with Rust content, you may wish to skip this section.

I created the repository in a new directory, with the usual empty commit:

mkdir range-permutation-rs
cd range-permutation-rs
git init
git commit --allow-empty

I then set up a remote repository and pushed the main branch up.

Before I get into Rust, I populated the repository with some basic documentation: a README, a LICENSE, code of conduct, and contributor guide. This probably seems over the top, but I think it's good practice to take 10 minutes to set these things up early, so anyone who comes across the repository knows the ground rules. I also added a .editorconfig file to set my default editor conventions and try to ensure some consistency between editors.

I intend the repository to be a library, so I then checked my Rust toolchain was up to date and used cargo to initialise the Rust side of the repository:

rustup update
cargo init --lib --name range-permutation --vcs git

The option --vcs git does not affect the existing git repository, but adds a suitable .gitignore file listing just /target.

Defining an interface

I want to start off with a very simple interface: a function that takes the size of the range and returns an ordering function that permutes integers in the specified range. What does this look like in Rust?

I want to separate out the creation of the permutation function from the function itself, because I want to develop some generic tests of the permutation function and I expect the creation function to become more complicated over time (e.g. to take a key which will act as a randomisation seed).

The creation function will take an integer to set the size of the range. The permutation function will take an integer in the range and return another in the range. I expect the permutation function not to mutate any internal state, so it will be a pure function with no side-effects. I looked at some ways of representing an integer in a range in Rust, but decided that they didn't make the functions any clearer. So I ended up with this:

pub fn create(size: u64) -> impl Fn(u64) -> Result<u64, &'static str> {
    ...
}

This is admittedly quite busy, especially if you're not used to Rust. Let's go through it in detail.

The input parameter size is an unsigned 64-bit integer specifying the size of the range we want to permute.

The return type of create is the Rust way of specifying a function that takes an unsigned 64-bit integer and returns a result which contains either an unsigned 64-bit integer (in the successful case) or a string error message (when the input is outside the range).

As a placeholder, I made the permutation function simply return its input value, wrapped in Ok(), or an error if the input was out of range.

Docs and tests

Next I wrote some documentation, with a simple example, and some tests and checked the example and tests passed.

Good news, bad news

While I was writing the docs, I had another look at Tom Conway's feistel-permutation-rs repository. After he merged my PR to add an open source license, I was delighted to find he had written up a very clear README.

In fact, his code pretty much meets my requirements for an ordering function. I could press on with my own implementation, but my motivation has disappeared. Indeed, to write my own code would amount to NIH (“not invented here”).

Apologies to anyone who was hoping to see me develop some Feistel network code from scratch in Rust. The good news is you can now read Tom's clear and well-tested code, which I'll explore further below. The bad news (kind of) is that I enjoy writing bit-twiddling code, but now I'm not going to have the pleasure.

Implementing in terms of feistel-permutation-rs

One aspect where feistel-permutation-rs didn't match my ideal interface was the way it handled a range of numbers. The permutation (get) function asserts that the input value is in the relevant range. The effect is that the code panics if an out of range value is passed to get. I prefer to return an error rather than panic.

So I implemented my ordering function in terms of feistel-permutation-rs, like this:

pub fn create(size: u64, seed: u64) -> impl Fn(u64) -> Result<u64, &'static str> {
    let perm = Permutation::new(size, seed, DefaultBuildHasher::new());
    move |n| {
        if n >= size {
            Err("out of range")
        } else {
            Ok(perm.get(n))
        }
    }
}

As expected, I added the key or pseudorandom seed to the signature of create.

The code above is straightforward, except for the use of move to capture the value of size in the closure, rather than a reference to size. Capturing a reference fails to compile because the lifetime of the closure exceeds the lifetime of size. In other words, size goes out of scope and cannot be referred to afterwards.

With this code, it's not surprising my tests passed. So let's take a look at Tom's code.

A Feistel network: feistel.rs

The core of feistel-permutation-rs is a Feistel network with a user-specified round function, a range, and a key schedule.

The range (of values to be encrypted) is specified as an even number of bits, so that values in the range can be split into left and right portions with the same number of bits.

The round function is applied on each iteration, or “round”, of the Feistel network and is a way of making the order of the results pseudorandom. The inputs to the round function are an encryption key (described next) and the right portion of the data being encrypted. The round function is supplied via a hash function builder (conventionally named “bob” – geddit?).

The key schedule is a vector of encryption keys which are passed in turn to the round function. The number of rounds is determined by the length of the vector.

The code is straightforward¹:

pub fn encrypt(&self, x: u64) -> u64 {
    let (mut l, mut r) = self.split(x);
    for k in self.keys.iter() {
        l ^= self.hash(*k, r);
        (l, r) = (r, l);
    }
    self.combine(r, l)
}

The value to be encrypted (x) is split into left and right portions. If the number of bits in the range is 2n the valid range of inputs is 0..2^(2n). Then the input value has 2n significant bits and these are splits into left and right portions, each with n bits.

There are some interesting (at least to me) edge cases. It's perfectly valid to specify a range of 0 bits which will then encrypt just one value (0).

It's also valid to provide an empty vector for the key schedule, in which case the encryption will be the identity function.

Crucially, the encrypt function permutes the range. The body of the for loop is clearly invertible (one-one) and it follows that so is encrypt. Since all the values of the range are mapped to distinct values, also in the range, encrypt is onto. In other words encrypt is a bijection from the range to itself, i.e. a permutation.

Arbitrary range permutation: permutation.rs

feistel-permutation-rs builds on the Feistel network to create a permutation of an arbitrary unsigned 64-bit range.

The first step is to build a key schedule of five keys from a single input seed. The first key in the schedule is the hash of the input seed. Each subsequent key is the hash of the previous seed. This relieves the user of the responsibility to provide a complete key schedule.

The second step is to obtain a permutation of an arbitrary range by taking the next even power of 2 higher than the size of the range and using a Feistel network with the corresponding number of bits.

This is combined with an algorithm that applies the Feistel network encryption and, if the encrypted value exceeds the range, keeps on applying the Feistel network encryption until the encrypted value falls in the range. The code looks like this:

pub fn get(&self, x: u64) -> u64 {
    assert!(x < self.n);
    let mut res = self.feistel.encrypt(x);
    while res >= self.n {
        res = self.feistel.encrypt(res);
    }
    res
}

To see that this provides a permutation of the range 0..n, first observe that all the results are in the range. Secondly, let's prove get() is one-one.

Proof

proof get is bijective

(The proof was formatted with Typst. Click here to enlarge.)

permutation.rs also includes an interesting test that the generated permutation is random. I don't pretend to understand the relevant theory, so I wouldn't have included such a test.

Contributions

While reading through feistel-permutation-rs, I made a number of very small contributions by way of pull requests:

I also raised Fix test which wasn't valid, but which prompted Tom to add the interesting test mentioned above.

Conclusion

By depending on feistel-permutation-rs the ordering function, with my preferred interface, is now complete.

I'll have to think about whether it's worth the effort to use this in the URL shortener. The URL shortener is sufficient for practical use and I'd have to fork the code and add a dependency on the above ordering function. I don't think I could justify making such a substantial change in the upstream repository, so I'd end up with a hard fork.

In fact, as I've been working through this experiment, I've become less keen to support a URL shortener on my site, so I may just ditch it.

But the point of the experiment was to see what it was like doing a piece of development in the open and to have some fun. The actual coding side of the experiment ended up being a lot smaller than expected, but I've enjoyed reusing and analysing Tom Conway's code.

The experiment demonstrates one of the trade-offs in open source development: whether to reuse existing code or write your own. It also demonstrates some of the costs associated with reusing code:

  • wrapping reused code to provide a different interface
  • analysing the correctness of reused code
  • taking some parts of the reused code on trust (the permutation randomness test, which I don't understand).

Footnotes:

¹ The original used a temporary value to swap l and r.

² In mathematical proofs, “without loss of generality” (“WLOG”) is a way of abbreviating a proof by considering just one of a number of very similar cases. Here the cases are symmetric. See WLOG on Wikipedia for more explanation.

#OpenDevelopment #Mathematics

[ favicon | about | blogroll | contact | notes | now | search | subscribe | © | feed icon ]

List of posts in this series

It's a month since my last post in this series in which I reflected on the interactions this experiment has generated. My wife and I have spent three weeks on a road trip to Italy during which I've only fleetingly thought about the experiment. So as I re-enter, I've got an opportunity to see things from a fresh perspective.

Re-entry sign

Fresh perspective

One of the challenges of software development is to be able to keep in mind high-level requirements and goals while simultaneously diving down into the detail of algorithms, code, tests, etc.

Having written up some of the high-level requirements in posts for this experiment is a help as I can always go back to those posts if I feel I'm losing sight of the big picture. But there's also the risk that re-reading my earlier thoughts will cause me to miss some new insight. The fresh perspective made possible by a significant break is a precious thing, and I don't want to waste it.

How to proceed?

Since there are eight posts in this series, one option would be to re-read those posts to come back up to speed and then think about what to do next. That's certainly an option, but I may lose the benefit of my fresh perspective.

Another option would be to completely ignore previous posts, start from scratch, and build on what recollections I have. But I doubt that's going to be particularly effective.

Instead, I'm going to work through the posts slowly and, when I get fresh insights, I'll pause, and discuss them below. Let's see how this goes.

Introductory post

In the introductory post, the following jumped out:

I suspect I'll want to plug the algorithm into the existing code, but in such a way that alternative algorithms can be used in future. The idea is to come up with a usable, testable interface to the set of possible algorithms. So designing a suitable interface, in Rust, will be part of the challenge.

I think testing could be tricky, so designing the interface and some tests would be a helpful next step. It could also benefit from my current fresh perspective.

(At this point, I was tempted to start work on the interface, but I suspect I would start to lose the fresh perspective during that task, so I continued to re-read the next post.)

Requirements

The requirements were mostly about the usability — shortness, memorability, unambiguity — of shortened URLs, but also about unpredictability: generating URLs in a pseudorandom order.

I anticipate introducing an interface that returns a pseudorandom permutation of a range of integers, one at a time, and use the returned integer to select a specific shortened URL and then implement that interface (using a Feistel network, or whatever). The URL usability requirements will be satisfied above the interface (i.e. by the code calling the interface) and the unpredictability requirement will be satisfied below the interface (i.e. by the implementation of the interface).

Given the purpose of the experiment is mainly to have fun, I intend to focus on the algorithm (the interesting bit) below the interface and make do with the current format of shortened URLs (the boring bit) above the interface. Once the interface is in place, it should be relatively easy to evolve the code above and below the interface independently should I, or anyone else, wish to do that.

Design

The design post reminded me of already having had the above ideas about a suitable interface! It seems my fresh perspective didn't make me think of a different approach.

To recap, the interface will provide a bijection from the range 1..n to the range 1..n, where n will be the number of possible shortened URLs. A bijection is a function that is one-one (no two distinct inputs produce the same output) and onto (each member of the output range 1..n is generated). This bijection can also be thought of as a permutation.

Having thought of this interface twice seems to be a validation that the approach is reasonable (at least to me). Alternatively, I might be failing to see a better alternative, but there's not much I can do about that right now.

Search for Feistel networks

This post found a couple of Rust open source projects that implemented a Feistel network. Whether I can use one of those or I have to “roll my own crypto” remains to be seen.

Next step?

But for the moment, I plan to look at defining a suitable Rust interface and thinking about how to test it. I have a couple of questions at this stage:

  • How easy will it be to map the range 1..n to a Feistel network, given that, as far as I've seen, Feistel networks tends to permute sequence of bits? I suppose in the worst case, I could restrict n to be a power of 2.

  • Should the interface take a key as an input, so that the pseudorandom sequence isn't predetermined by the implementation of the interface? The format of the key would seem to be a big design decision.

I'll have to think more about these questions as I start to flesh out the interface, which I'll save for the next post.

#OpenDevelopment

[ favicon | about | blogroll | contact | notes | now | search | subscribe | © | feed icon ]

List of posts in this series

As a little light relief from the search for, and possible development of, an ordering function, I thought I'd mention some interactions related to this series.

Open source licensing

When I encountered the Rust feistel-permutation-rs crate, I was disappointed there was no license file. I could have called the code in the crate, but if I'd wanted to extend or modify it, the lack of an open source license would have stopped me doing that without infringing the author's copyright.

Thankfully after I raised an issue asking for a license to be added, the author responded positively and was happy to merge a pull request adding an Apache 2.0 license file (and some license headers, a necessary part of applying the Apache 2.0 license). He also added an interesting comment to the issue explaining some of the motivation for the crate (in bioinformatics).

While writing this post, I noticed that the crate appeared to be licensed as GPL-3.0-or-later on crates.io. It turned out this was declared in the Cargo.toml in the git repository for the crate. I submitted another PR to change this to the Apache 2.0 license.

Error handling

While examining the Rust cryptographic_primitives crate, I noticed that it was not clear when some of the errors were returned and also that the code sometimes panicked instead of returning an error.

I raised an issue on the panicking behaviour and the author kindly fixed this to return errors instead. He even went to the trouble of publishing a new version of the crate.

He helpfully explained his view of the likelihood of certain errors being returned and even added some tests to clarify when errors occur.

There is still a certain amount of uncertainty in my mind, so I'm not inclined to use the crate. It didn't feel appropriate to press for more clarity as the author had already been very responsive and helpful and we probably have different opinions on the acceptability of assuming that a highly unlikely error will never be returned¹.

Feedback on posts

There have been a few responses on Mastodon to some of the posts in this series. It's nice to know people are reading the posts and perhaps finding them interesting or helpful.

Introduction to the experiment

I got a couple of responses which encouraged me from the start.

James Ravenscroft wrote:

very cool – looking forward to following along!

while Henrik Jernevad wrote:

Awesome idea! 😊

Requirements

Andy Scott commented:

I enjoyed catching up on this series this morning. Reading about requirements always leaves me thinking more creatively, and now I'm aware of the Feistel cipher ;)

Ordering function

Jeff Slarve offered me an ordering function written in TypeScript. We had a nice interaction about this.

Search for Feistel networks

Tomas Stropus replied:

Oh man, I feel this so hard. This is giving me flashbacks! I recently needed a PELT library for C# and ended up in the same boat – zilch options, so I had to code it myself.

Gotta say, I'm impressed by your patience. I probably would've thrown in the towel and started rolling my own way sooner. But hey, that Cloudproof FPE does sound promising. Curious to hear how it goes when you jump back in.

Next steps

Postscript: after a month's break, the next post turned out to be looking back on the posts so far to get back up to speed.

Footnote:

¹ In my experience, I've found that if something can go wrong, it probably will at some stage in the future. On several occasions, it seemed that simply noticing the possibility of a bug caused the bug to manifest itself in production within a few days or weeks, but correlation does not imply causation.

#OpenDevelopment

[ favicon | about | blogroll | contact | notes | now | search | subscribe | © | feed icon ]

List of posts in this series

In the previous post, I started looking at the Cryptographic primitives crate, but was put off by some fairly inscrutable conditions under which the code would panic. So here I continue looking at the most recently updated crates found by searching crates.io for “Feistel”.

Dandelion random

The next crate in the list, dandelion-random, refers to Feistel only in the README, although it makes the following reassuring statement:


Indeed, a Feistel round

(x, y) ⇒ (y + H(x), x)

is a permutation irrespective of any properties of H.


But there's no Feistel network code so we pass on to bidirectional_hash.

Bidirectional hash

This is a Rust implementation of Pseudo encrypt. This is a keyless Feistel network with three rounds and a simple — maybe too simple — round function:

let r2 = l1 ^ ((((1366 * r1 + 150889) % 714025) as f64 / 714025.0 * 32767.0) as i32);

Arweave RS packing

This crate's documentation assumes the reader is familiar with “Arweave data”. The Arweave website is off-putting in stating “Dive in! The rabbit hole is deep, and the possibilities are endless.” Next!

ore.rs

This crate also refers to Feistel networks only in its README, to point out it doesn't use a Feistel network. Moving on...

Cloudproof FPE

This looks promising as the crate is specifically designed to offer FPEs and it has a good README. The crate has a commercial backer, but is open sourced under Affero GPLv3, which would probably be ok for my use.

Feistel permutation rs

This also looks promising. The code is very simple and there are tests. But there's no README to give any clue about the author's intentions. There's no LICENSE either, so it's not open source. It may be useful to refer to.

(Postscript: the author kindly merged a PR to add an open source license.)

Next steps

I'm probably AFK for several weeks now, but my next step may be to start to look at the Cloudproof FPE code and its documentation.

(Postscript: the next post turned out to be some light relief looking at some of the interactions with others that this experiment has generated.)

#OpenDevelopment

[ favicon | about | blogroll | contact | notes | now | search | subscribe | © | feed icon ]

List of posts in this series

So far, I've adopted a top-down approach to developing a better URL shortener algorithm. I've already encountered a couple of “icebergs” (see the previous post) and am getting distracted and bogged-down. Time to try a different approach: bottom-up. But what does that mean?

I certainly don't want to roll my own cryptography code if I can possibly find something suitable to reuse or adapt¹.

Searching for Rust code

Since the URL shortener code that I'm currently using is written in Rust, I'd like to implement the ordering function in Rust.

Recall an ordering function is a permutation of the numbers in the range 1..n or, more formally, a bijection from the range 1..n to itself. I also have a clue that a Feistel cipher (a.k.a. a Feistel network) might provide a suitable format-preserving encryption, particularly since Feistel ciphers are reversible (implying the one-to-one property of a bijection).

So I searched crates.io for “Feistel” and sorted by most recent update. At least this way, I'll find code that someone bothered to publish as a crate and which may be being actively developed.

crates.io search results for "Feistel"

Let's take a look at the first of these. We may come back to the others in a later post.

Cryptographic primitives

This initially looked promising. I liked the statement in the README: “The goal is to provide a comprehensive set of cryptographic primitives that are easy to use and hard to misuse.” There was a clean separation of encryption and decryption functions in src/lib.rs and helper functions in src/helpers.rs. There were even a few simple “doc tests” in the form of examples. The code also builds cleanly using the latest Rust toolchain. So what's available in terms of Feistel networks?

pub fn feistel_network_encrypt(input: &[u8], key: &u128, rounds: u32) -> Result<Vec<u8>, &'static str>

This encryption function takes a slice of bytes and, if successful, returns an encrypted vector of bytes. The other inputs are an encryption key and the number of rounds to apply. There are a couple of warning signs however.

A function from one sequence of bytes to another doesn't appear to be suitable as an ordering function. It's doubtful that the function is onto (a property of a bijection). Also, I can't immediately see how to represent the range 0..n in such a way that I could reuse this function.

Also, the function can return errors. This would be highly inconvenient for an ordering function. Unless we could definitely avoid errors (e.g. by a suitable choice of key and number of rounds), the ordering function would not be a bijection, since some input byte slices might cause it to fail.

Let's look a bit more closely at where errors can occur. Thankfully, the possible errors are documented in the function header:

  • Input must not be empty – if the input [slice] is empty.
  • Number of rounds must be greater than 0 – if the number of rounds is less than 1.
  • HMAC creation failed – if there is an error in creating the HMAC for the round function or the Key Derivation Function.

The first two of these are avoidable by providing suitable inputs, but the third looks pretty scary and seems related to the following two lines of code in the function:

let subkeys: Vec<Vec<u8>> = kdf(&key.to_le_bytes(), rounds as usize).unwrap();

and:

let round_function_result = feistel_round_function(&right_side, &subkeys[i as usize]).unwrap();

However, the code is calling unwrap() which will panic in the error case, rather than returning a value of type Result. Quite apart from the question of what a HMAC creation failure is or when it will happen, I think the code needs to differentiate between error results and panics, so I raised an issue.

(Postscript: the author kindly fixed this issue so that the code never panics.)

Next, I started reading the code to find out when HMAC creation failures would occur. I soon got into some HMAC code in another crate, so I stepped back and checked the dependency graph of the project:

$ cargo tree
cryptographic_primitives v0.1.1 (/home/glyn/dev/rust/cryptographic_primitives)
├── bitvec v1.0.1
│   ├── funty v2.0.0
│   ├── radium v0.7.0
│   ├── tap v1.0.1
│   └── wyz v0.5.1
│       └── tap v1.0.1
├── hmac v0.12.1
│   └── digest v0.10.7
│       ├── block-buffer v0.10.4
│       │   └── generic-array v0.14.7
│       │       └── typenum v1.17.0
│       │       [build-dependencies]
│       │       └── version_check v0.9.4
│       ├── crypto-common v0.1.6
│       │   ├── generic-array v0.14.7 (*)
│       │   └── typenum v1.17.0
│       └── subtle v2.5.0
└── sha2 v0.10.8
    ├── cfg-if v1.0.0
    ├── cpufeatures v0.2.12
    └── digest v0.10.7 (*)

hmac is looking like it might turn out to be a heavy dependency with quite a bit of its own complexity. I'm starting to go off the idea of reusing or adapting the cryptographic primitives crate. Maybe on reflection, I'll see that hmac is needed. Or maybe one of the other search results will turn out to be more suitable.

Either way, this is a good point at which to end the current post. In the next post, I'll look more specifically at Feistel networks.

#OpenDevelopment

Footnote:

¹ I realise that the security requirements for a URL shortener are not particularly onerous and if I get it wrong, or reuse someone else's insecure code, the worse thing that can happen is that someone will be able to guess the sequence in which shortened URLs are generated: no big deal.

[ favicon | about | blogroll | contact | notes | now | search | subscribe | © | feed icon ]

List of posts in this series

Thus far, I've been ahead of the game in this experiment. Each post has been a dump of my thoughts on the various topics. Also, I've not learned a great deal from writing the posts.

This is what it's like when I develop a piece of software: my brain goes into overdrive thinking about the various aspects from requirements and design through to coding details. Unfortunately, this often happens in the middle of the night. Writing my thoughts down — in the case of this experiment, as a draft post — is a way of clearing my mind and helping get back to sleep.

Exploration starting

But at this point, I need to start exploring. and understanding the ordering function.

Recall from the previous post that the ordering function is a permutation of the numbers in the range 1..n. Or, putting this more formally, an ordering function is a bijection from the range 1..n to itself. As I mentioned before, a bijection is one-one (no two distinct inputs produce the same output) and onto (each member of the output range 1..n is generated).

So far, I've skimmed an article on small-domain ciphers (Feistel cipher in CTR mode), which seem eminently suitable for the task. But I haven't dug into small-domain ciphers in general or specific examples such as Feistel ciphers, which means I'm carrying a risk.

Iceberg ahead?

The risk is the ordering function might turn out to be an “iceberg”: something that seems straightforward, but which turns out to take a lot longer to understand and implement than initially expected.

Incidentally, that's one reason why a waterfall process doesn't work too well. Some areas need investigation, and often prototyping, before their complexity is properly understood. That's one purpose of an agile “spike”, where a rough prototype of a particular aspect is developed and then put to one side while informing the proper development (with tests and what-not).

The only way to tell whether the ordering function is an iceberg is to look further into it. I'll start by exploring the idea of a small-domain cipher which is better described as as small-space encryption or format-preserving encryption.

Format-preserving encryption

Bellare et al. ([3]) state the goal of format-preserving encryption (FPE):

The goal [of FPE] is this: under the control of a symmetric key K, deterministically encrypt a plaintext X into a ciphertext Y that has the same format as X.

Examples of formats would be 16 digit credit card numbers and N-bit strings.

The symmetric key refers to encryption, but also effectively acts as a seed for the pseudo-random sequence of applying certain “useful” FPEs to an ordered sequence of all values in the given format.

The specific example of a FPE that I aim to understand is a Feistel cipher. This depends on using a cryptographically secure pseudorandom round function – another potential “iceberg”.

To get more familiar with the terminology and plug some basic gaps in my understanding, I've picked a key paper and started reading it.

Reading

I chose Bellare et al. (Format-Preserving Encryption) and have taken a look at some of the fundamental work on encryption such as the Data Encryption Standard (DES) and its successor, the Advanced Encryption Standard (AES).

AES and DES are both reasonably straightforward algorithms involving bit manipulation, the application of a “round function” over a series of rounds, and a transformation involving a lookup table to prevent the encryption being “linear” (which would make it trivially breakable).

I also dipped into Liskov et al. (Tweakable Block Ciphers) to understand the concept of “tweaks” applied to an encryption algorithm.

But since then I must admit to having diverted into reading more than I probably need to about cryptography.

Cryptography

As a subject, cryptography seems to be large and well-developed. It has a sound mathematical basis and many practical applications. As much as I find it fascinating, I'm pretty sure I'll run out of steam before I've scratched the surface.

Here are two seminal documents I've been dipping into:

I really have only dipped into these, but I've realised in doing so what a large and complex subject cryptography is. Worse than that, it has an unusual form of iceberg: there is a great deal of published material but also an unknown quantity of secret material¹, perhaps including cryptanalysis techniques for breaking current encryption algorithms.

Next steps

I'm not sure how to proceed in this open development experiment. I am fascinated by cryptography, but I'm pretty sure I haven't got the time, energy, or motivation to go into any depth.

On the other hand, the encryption requirements for an ordering function are not particularly onerous, so I might start to focus in on Feistel ciphers and get my hands dirty with some code.

For various reasons, I don't expect to do much more for several weeks, so there is likely to be a pause before any potential next post in this series.

Reflection

What I'm just starting to realise is that it's hard to post about the process of research and learning. The process is a little like solving a jigsaw with an unknown number of pieces. I have some “edge pieces” that I know will be relevant and quite a few other pieces that will probably turn out to be irrelevant. I don't understand any of the pieces well enough to explain them in writing. I'm not even sure how the pieces fit together or whether some of the pieces actually belong to a different puzzle.

This is reminiscent of a situation I faced when I was involved in pair programming for a year. If my pair needed to work on an area we didn't understand, we had to do some research and learning before we could even write a failing test. That was very difficult to do as a pair. We would typically read around a subject separately and then discuss what we had found. But at least one of us needed to have gotten our head round the subject matter before we could talk about it intelligently.

In the current experiment, the open-endedness of the research and learning process is compounded by the lack of deadline or accountability to someone else. See Tomas Stropus's recent post The Art of Finishing for his thoughts on personal projects.

#OpenDevelopment

Footnote:

¹ If I'd taken up a job offer from GCHQ, I'd presumably know a lot about the secret side.

[ favicon | about | blogroll | contact | notes | now | search | subscribe | © | feed icon ]

List of posts in this series

Having spent the previous post examining the requirements of this open development experiment, I'd like to spend a little time on design before looking at the detailed algorithms and how to implement and test them. (When I say “design”, I'm referring to how the code is structured at a high level rather than “design” in the user interface sense.)

The main design decision is how to separate out these concerns:

  • The shortening algorithm,
  • The order in which shortened URLs are generated.

Let's look at the shortening algorithm first, as that's straightforward.

Shortening algorithm

The current shortening algorithm builds the paths of shortened URLs by combining a string prefix (chosen from a predefined list of p adjectives) with a string suffix (chosen from a predefined list of s nouns).

Suppose we list all p×s possibilities in the order:

(1, 1), (1, 2), ..., (1, s),
(2, 1), (2, 2), ..., (2, s),
...
(p, 1), (p, 2), ..., (p, s)

Then, for any number i in the range 1..(p×s), there is a pair (j, k) at the ith position in the list (where the first item has position 1).

So we can encapsulate the current shortening algorithm as a function from a number in the range 1..(p×s) to a string which will serve as the path of the shortened URL.

Generalising this a little, we can model a shortening algorithm as a function from a number in the range 1..n, where n is the number of possible shortened URLs, to a string path.

We can imagine plugging in any number of shortening algorithms, including the current algorithm, behind such an interface.

Next let's look at the order in which shortened URLs are generated.

Generation order

Abstractly, suppose there are n possible shortened URLs. I'd like one component to enumerate the n possibilities in a random or pseudo-random order (see the requirements).

The order of the generated URLs is a permutation of the numbers in the range 1..n. In other words, each number in the range must be generated exactly once.

This is more stringent than the current algorithm which can produce duplicates (and hence require the user to retry the request). But it avoids the poor performance associated with the current algorithm, which we examined in the mathematical interlude.

We can model a permutation as a function from the range 1..n to the range 1..n which is a bijection. A bijection is one-one (no two distinct inputs produce the same output) and onto (each member of the output range 1..n is generated).

Aside

I must confess that the idea of modelling the generation order as a permutation function came from my cursory reading about small-domain ciphers (see “Next steps” in the introductory post of this series).

Design

So the core of the URL shortener will consist of two functions:

  • An ordering function that takes a number in the range 1..n and returns the index of the shortened URL to generate, and
  • A shortening function that takes the index of the shortened URL to generate and returns the path of the shortened URL.

Around this core will be some other code, which will be an adaptation of the current implementation. We'll need to keep track of (and store persistently in case the web server process is restarted) how many shortened URLs have been generated. When we receive a shortening request, we'll take that number, increment it (and persist the new value), pass the new number through the composition of the above functions to produce the path of the shortened URL, construct the complete shortened URL from that (by adding the domain etc.), store the mapping from the shortened URL to the input URL in a database, and return the shortened URL.

Next step

The aspect of this experiment that I find most interesting is the ordering function and I'll turn to that in the next post.

#OpenDevelopment

[ favicon | about | blogroll | contact | notes | now | search | subscribe | © | feed icon ]

List of posts in this series

In my previous post, the mathematical interlude, I analysed the main downside of the current URL shortener algorithm: its very poor performance as the available shortened URLs are used up.

Again, I must reiterate that the point of this experiment is to have some fun and learn something along the way. Otherwise, I'd simply make do with the current algorithm. My need for shortened URLs is so minimal that I probably only use one or two per year. So the current number of some 26,000 shortened URLs is vastly more than I need and the current algorithm will perform quite adequately as long as I need it to.

But I feel I should at least pay lip service to the requirements.

So, what are the ideal characteristics of a URL shortener? Aside from the ability to choose a custom path instead of an automatically generated one and some runtime requirements such as suitable scalability, there are some requirements on the shortened URLs:

  • Shortness
  • Memorability
  • Unambiguity
  • Unpredictability

Shortness

While some domain names may not be as short as others, the remainder of a shortened URL should be short. This enables the shortened URL to be used where there is limited space, such as on a handwritten note, read out over the phone, etc.

Memorability

It's very handy, although not essential, that shortened URLs should be memorable.

The current algorithm builds the paths of shortened URLs by combining a random string prefix (chosen from a predefined list of p adjectives) with a random string suffix (chosen from a predefined list of s nouns). So there are p×s possible shortened URLs. The current code maintains a database which records the long equivalent of each shortened URL that has been used so far. When a new shortened URL is generated, the database is consulted and, if the shortened URL is already in use, the request fails (and it's up to the user to try again).

The database also allows the user to choose their own path, which further improves memorability.

Unambiguity

Shortened URLs should not be ambiguous. For instance, they should be careful in their use of characters and digits that are easily confused (o vs 0, l vs 1, s vs 5).

The current implementation's use of word combinations avoids such ambiguities at the expense of brevity.

Unpredictability

While it would be possible to generate shortened URLs in a predictable order, this is generally frowned upon in the world of URL shorteners. Instead a random, or pseudorandom, order is preferred. Why should that be?

Simplifying Links: A Deep Dive into URL Shortener System Architecture explains this in terms of the following non-functional requirement:

Unpredictability: From a security standpoint, the short links generated by the system should be highly unpredictable. This ensures that the next-in-line short URL is not serially produced, eliminating the possibility of someone guessing all the short URLs that the system has ever produced or will produce.

This requirement would seem to require truly random shortened URLs drawn from a very large set of possibilities. However, the appearance of security is just that. Users should not be encouraged to think that, just because a shortened URL is difficult to guess, it is somehow secure. For example, suppose a user stored some sensitive material using a shortened URL. There are still numerous ways the shortened URL could “leak out” to someone who the user never intended to give the URL.

So, I regard it as sufficient to generate shortened URLs in a pseudorandom sequence. The value of doing that is not security. A pseudorandom sequence improves memorability, e.g. compared to successive URLs all starting with the same adjective.

Conclusion

The above requirements are not particularly compelling and the more I think about URL shorteners, the more I dislike them. However, I am interested in some of the algorithms involved, so I'm going to pretend that a better URL shortener is actually needed and move on to examine the algorithms.

This predicament is not entirely unlike situations I faced in my career. Sometimes I found myself working on a project where there were some, or maybe multiple, aspects that I didn't find convincing. But either because of the software fashion industry or some other reason, I had to deal with situations where I felt out of step with the organisation.

Sometimes the solution was to jump ship to another project. But other times, I found some aspect of the project or some component that was valuable and/or interesting and focussed my attention on that.

Next step

In the next post, I'll look at the design, by which I mean the structure of the code to meet the above requirements.

#OpenDevelopment

[ favicon | about | blogroll | contact | notes | now | search | subscribe | © | feed icon ]

List of posts in this series

This is the most mathematical post I expect to write in this series and (I sincerely hope) you don't need to read it in order to follow the rest of the series. So if you have a maths allergy, skip to the next post.

For anyone still remaining, strap in for a little maths, at approximately A level¹ or early university level.

Recap

In the previous post, I mentioned that the current URL shortener algorithm, while completely adequate for my practical use, has some unfortunate properties. In particular, the performance tails off badly as a large proportion of available shortened links is used up. Here's a graph that gives an indication of the problem:

random cell algorithm cost

There is a design issue in the current algorithm: a URL suffix is chosen at random from some 26,000 possibilities, but if the chosen suffix has already been used, the choice needs to be repeated. Once a large proportion of the 26,000 possibilities is in use, there is a high probability of failure and a very high CPU cost per shortening request (think denial of service).

The underlying maths

Although the number of attempts to shorten a URL successfully remains low until fewer possibilities remain and the average number of attempts is just a little over 10, the figure ultimately rises rapidly as shown in the graph. But where does the figure of 10 and the shape of the graph come from?

In general, if there are n possible shortened URLs, the number of attempts in shortening n URLs using the above algorithm turns out (see below) to be nHₙ where Hₙ is the nth Harmonic number (the sum of the reciprocals of all the numbers in the range 1..n).

When n is 26,000, the total number of attempts is over 279,000, since H₂₆₀₀₀ , is approximately 10.743. Another way of putting this is that, on average, each shortened URL needs just over 10 attempts.

Deriving the formula nHₙ

The number of attempts required when k out of n shortened URLs are already taken is n/(n-k). The total number of attempts for shortening n URLs is the sum of this term with k ranging from 0 to n-1, and that turns out to be nHₙ. If you'd like more detailed working, please see below.

maths 1

maths 2

maths 3

Reminder

Like I said in the previous post, I want this series to be done in the open. So if you have feedback on this post or any of the others, I'd love to hear it. You can contact me using the link below.

Postscripts

  1. My good friend Steve Powell derived the nHₙ result independently. He also spotted a transcription error — a pair of missing parentheses — in the second sheet of working above (now fixed). Thank you Steve!

  2. The nHₙ result turned out to be the solution to the “coupon collector's problem”. The corresponding wikipedia article has a more compact derivation of the result (under “Calculating the expectation”).

  3. I replaced the handwritten working with formatted versions using Typst. I also wrote up the experience.

#OpenDevelopment #Mathematics

Footnote:

¹: For American readers, I gather high school AP courses are similar to A levels.

[ favicon | about | blogroll | contact | notes | now | search | subscribe | © | feed icon ]