My choice for the Zettel ID
I thought it a good idea to document my choice for the Zettel ID and to make it a post to see whether others agree with my thinking.
Initially I planned to use UUIDv4 encoded as base58 getting shorter IDs like
5yiKADArgWL6xjMPkzg3wz, but then I stumbled upon Zettelkasten and read the post of using GUIDs for Zettel IDs and it was a bit of an eye opener in that my Zettel IDs also have much use outside of the tool in which the Zettels are written, forming a link between them. And using Zettel IDs on physical paper or in a digital environment where you do not have access to your clipboard or some other means to easily copy them over, shows the value of having short Zettel IDs. If I had to pen down or type over those 22 base58 characters, that would become a pain and would likely inhibit me from doing so as often as I maybe should. So I set out to design myself a better Zettel ID. So given the above I had the following requirements, the Zettel ID must be:
- Unique, but only to me. They are personal IDs. Even if I generate them in bulk, I am the only "actor", there will still not be the problem of IDs being requested simultaneously.
- The friction to copy them over should be small. This could be realized by making them short of length, or just easy to remember, and unambiguous.
I did not like a simple counter (i.e. the 407th Zettel), which would fit the bill on the above two requirements, more on this later.
Keeping with random IDs I thought to look at just shortening my 22 random ID to a shorter version, thinking of short commit IDs used in Git (version control). The problem with short hashes is that the likelihood of collisions increases severely. This is called the Birthday attack based on the Birthday problem. Using that we can calculate the likelihood of collisions, with the following formula:
sqrt(2 * (2^(log2(B) * C)) * ln(1 / (1 - P))). Here
B is the base (e.g. decimals are base10, hexadecimals are base16, using alphanumerics are base62 (10 + 26 + 26)),
C is the count, length, or amount of characters being used, and P is the probability (e.g. 0.01 = 1% and 0.5 = 50%). Filling this in for git short IDs gives us
sqrt(2 * (2^(log2(16) * 7)) * ln(1 / (1 - 0.01))) = approx. 2322.8717, so after roughly 2300 Git commits (or Zettels if you will) there will be a 1% chance that there will be a collision. They would not give me a peace of mind when used in a Zettelkasten that is meant to grow big.
What about using base58 rather than base16 (i.e. hexadecimals)? That would give us
sqrt(2 * (2^(log2(58) * 7)) * ln(1 / (1 - 0.01))) = approx. 210670.28, so roughly 210k unique IDs need to be generated before there is a 1% chance of a collision. Not bad, but you have to remember that this is just to get to the point where the chance becomes 1%, a collision might happen before that. To give an example of that, the Linux kernel apparently already needs short hashes of length 12 to remain unique, even though they are at roughly 900k commits and
sqrt(2 * (2^(log2(16) * 12)) * ln(1 / (1 - 0.01))) = approx. 2378620.6, so in this case a collision happened way before the 1% mark of roughly 2400k.
We could increase the length, making the chance of a collision so astronomical small that there is no point in worrying about it, but that goes against our requirement. A counter or a timestamp have the nice property that you do not have to worry about this, given you properly keep track of the count or in the case of the timestamp make sure new ones are only created after the time frame you defined for your timestamps.
So why not a counter? With length 3 base58 IDs, there would be
58^3 = 195112 unique IDs, likely to be enough for most Zettelkasten. Well a counter requires state of some kind. That could be remembering the count yourself or keeping track of this within the Zettelkasten application. In the first case it could become a pain and you might lose track if you haven't used your Zettelkasten in a while for some reason. I personally would not want to remember such a thing just for the sake of having unique IDs. In the second case you would need some centralization/synchronization to keep proper track of it. What if you work on your Zettelkasten on your laptop without internet, come home, work on your Zettelkasten on your desktop, only to remember you did some offline work. That is a recipe for collisions right there.
Thinking some more about counters I realized that we already have a global counter that is synchronized between devices and works well even when offline, timestamps! This is hardly a new insight, but I just had not viewed them from the perspective of being a globally synchronized counter before. So is my conclusion to just use IDs like used in The Archive, i.e.
202004162116? They would not be a good fit for my use case, since I am a programmer, I foresee myself generating Zettels in the future, and a minute based timestamp would not be unique enough. Also I think they are rather long, although we are already mentally trained to remember date and times, making it somewhat less of a burden, they are formatted in such a way that we mostly lose this benefit, requiring getting used to them to be able to decode them fast again. And there is not just one way to format a timestamp, there could be use in keeping track of week numbers or the day of the week (something I find more useful for close by dates), yearly quarters or seasons, school weeks, etc. And the usefulness of one representation might differ on your current occupation or other things, so it is likely not to be static.
At some point I planned to use
YYMMDD followed by the day of week (I find myself asking that a lot), followed by a base58 encoding of the every 8th millisecond. However with a clear date/time, you can be ensured you will be using this information when looking at Zettels, like, ah, it was the Zettel from July 2017. This might be useful, but I like what Sasha keeps repeating, that we should aim for meaningful associations/links. The timestamp in itself does not tell you anything about the time relation, did you create your initial draft at the time, or did you just think of a potential association, but did not have the time or knowledge to write about it, or was it just a structural placeholder, who knows!
So with timestamps I would add the following requirements:
- Use timestamps only as a global counter, do not represent the ID in any date/time format.
- Make the scale of the timestamp small enough so that they do not inhibit their use when generating them in programs, e.g. when generating Zettels.
The most straightforward thing would be to use a millisecond timestamp, like
9198325815, but that does not fit the requirement of having low friction to copy, but what if we base58 encode it, like
1f1PPNp? We end up with a short length, that looks like it is randomly generated, while still having the nice properties of timestamps, and if for some reason we still do want to know the creation time of the ID, its an encoding, so we can simply decode it back to get the timestamp. I generally like fixed-length IDs, so what length should we have to use. For that I used the following formula:
log2(365.2425 * 24 * 60 * 60 * 1000) / log2(58) = approx. 5.9537986, basically the bits needed to hold the millisecond timestamp of a year divided by the bits per base58 character. This means we will need 6 character for one year, so it will have to be at least length 7. This gives us the formula
log2(N * 365.2425 * 24 * 60 * 60 * 1000) / log2(58) = 7 where we are interested in N, so I put it in Wolfram Alpha, which gives us
N == 69.9682, so roughly 70 years. Enough for my remaining life.
So what do you think, would such Zettel IDs (e.g.
1f1PPNp) be any good?
- They do not win the price of shortest, but they should still be short enough to not inhibit you from writing them over manually if need be.
- They have the benefits that timestamps give, i.e. globally synchronized counter.
- They contain the time to the millisecond if need be.
- They do not let you make meaningless associations.
Any big downsides I have missed? Or improvements that could be made keeping my use case in mind (potentially programmatically create Zettels)?
It looks like you're new here. If you want to get involved, click one of these buttons!