Zettelkasten Forum


My choice for the Zettel ID

I thought it a good idea to document my choice for the Zettel ID and to make it a post to see whether others agree with my thinking.

Initially I planned to use UUIDv4 encoded as base58 getting shorter IDs like 5yiKADArgWL6xjMPkzg3wz, but then I stumbled upon Zettelkasten and read the post of using GUIDs for Zettel IDs and it was a bit of an eye opener in that my Zettel IDs also have much use outside of the tool in which the Zettels are written, forming a link between them. And using Zettel IDs on physical paper or in a digital environment where you do not have access to your clipboard or some other means to easily copy them over, shows the value of having short Zettel IDs. If I had to pen down or type over those 22 base58 characters, that would become a pain and would likely inhibit me from doing so as often as I maybe should. So I set out to design myself a better Zettel ID. So given the above I had the following requirements, the Zettel ID must be:

  • Unique, but only to me. They are personal IDs. Even if I generate them in bulk, I am the only "actor", there will still not be the problem of IDs being requested simultaneously.
  • The friction to copy them over should be small. This could be realized by making them short of length, or just easy to remember, and unambiguous.

I did not like a simple counter (i.e. the 407th Zettel), which would fit the bill on the above two requirements, more on this later.

Keeping with random IDs I thought to look at just shortening my 22 random ID to a shorter version, thinking of short commit IDs used in Git (version control). The problem with short hashes is that the likelihood of collisions increases severely. This is called the Birthday attack based on the Birthday problem. Using that we can calculate the likelihood of collisions, with the following formula: sqrt(2 * (2^(log2(B) * C)) * ln(1 / (1 - P))). Here B is the base (e.g. decimals are base10, hexadecimals are base16, using alphanumerics are base62 (10 + 26 + 26)), C is the count, length, or amount of characters being used, and P is the probability (e.g. 0.01 = 1% and 0.5 = 50%). Filling this in for git short IDs gives us sqrt(2 * (2^(log2(16) * 7)) * ln(1 / (1 - 0.01))) = approx. 2322.8717, so after roughly 2300 Git commits (or Zettels if you will) there will be a 1% chance that there will be a collision. They would not give me a peace of mind when used in a Zettelkasten that is meant to grow big.

What about using base58 rather than base16 (i.e. hexadecimals)? That would give us sqrt(2 * (2^(log2(58) * 7)) * ln(1 / (1 - 0.01))) = approx. 210670.28, so roughly 210k unique IDs need to be generated before there is a 1% chance of a collision. Not bad, but you have to remember that this is just to get to the point where the chance becomes 1%, a collision might happen before that. To give an example of that, the Linux kernel apparently already needs short hashes of length 12 to remain unique, even though they are at roughly 900k commits and sqrt(2 * (2^(log2(16) * 12)) * ln(1 / (1 - 0.01))) = approx. 2378620.6, so in this case a collision happened way before the 1% mark of roughly 2400k.

We could increase the length, making the chance of a collision so astronomical small that there is no point in worrying about it, but that goes against our requirement. A counter or a timestamp have the nice property that you do not have to worry about this, given you properly keep track of the count or in the case of the timestamp make sure new ones are only created after the time frame you defined for your timestamps.

So why not a counter? With length 3 base58 IDs, there would be 58^3 = 195112 unique IDs, likely to be enough for most Zettelkasten. Well a counter requires state of some kind. That could be remembering the count yourself or keeping track of this within the Zettelkasten application. In the first case it could become a pain and you might lose track if you haven't used your Zettelkasten in a while for some reason. I personally would not want to remember such a thing just for the sake of having unique IDs. In the second case you would need some centralization/synchronization to keep proper track of it. What if you work on your Zettelkasten on your laptop without internet, come home, work on your Zettelkasten on your desktop, only to remember you did some offline work. That is a recipe for collisions right there.

Thinking some more about counters I realized that we already have a global counter that is synchronized between devices and works well even when offline, timestamps! This is hardly a new insight, but I just had not viewed them from the perspective of being a globally synchronized counter before. So is my conclusion to just use IDs like used in The Archive, i.e. 202004162116? They would not be a good fit for my use case, since I am a programmer, I foresee myself generating Zettels in the future, and a minute based timestamp would not be unique enough. Also I think they are rather long, although we are already mentally trained to remember date and times, making it somewhat less of a burden, they are formatted in such a way that we mostly lose this benefit, requiring getting used to them to be able to decode them fast again. And there is not just one way to format a timestamp, there could be use in keeping track of week numbers or the day of the week (something I find more useful for close by dates), yearly quarters or seasons, school weeks, etc. And the usefulness of one representation might differ on your current occupation or other things, so it is likely not to be static.

At some point I planned to use 2004162DArG, being YYMMDD followed by the day of week (I find myself asking that a lot), followed by a base58 encoding of the every 8th millisecond. However with a clear date/time, you can be ensured you will be using this information when looking at Zettels, like, ah, it was the Zettel from July 2017. This might be useful, but I like what Sasha keeps repeating, that we should aim for meaningful associations/links. The timestamp in itself does not tell you anything about the time relation, did you create your initial draft at the time, or did you just think of a potential association, but did not have the time or knowledge to write about it, or was it just a structural placeholder, who knows!

So with timestamps I would add the following requirements:

  • Use timestamps only as a global counter, do not represent the ID in any date/time format.
  • Make the scale of the timestamp small enough so that they do not inhibit their use when generating them in programs, e.g. when generating Zettels.

The most straightforward thing would be to use a millisecond timestamp, like 9198325815, but that does not fit the requirement of having low friction to copy, but what if we base58 encode it, like 1f1PPNp? We end up with a short length, that looks like it is randomly generated, while still having the nice properties of timestamps, and if for some reason we still do want to know the creation time of the ID, its an encoding, so we can simply decode it back to get the timestamp. I generally like fixed-length IDs, so what length should we have to use. For that I used the following formula: log2(365.2425 * 24 * 60 * 60 * 1000) / log2(58) = approx. 5.9537986, basically the bits needed to hold the millisecond timestamp of a year divided by the bits per base58 character. This means we will need 6 character for one year, so it will have to be at least length 7. This gives us the formula log2(N * 365.2425 * 24 * 60 * 60 * 1000) / log2(58) = 7 where we are interested in N, so I put it in Wolfram Alpha, which gives us N == 69.9682, so roughly 70 years. Enough for my remaining life.

So what do you think, would such Zettel IDs (e.g. 1f1PPNp) be any good?

  • They do not win the price of shortest, but they should still be short enough to not inhibit you from writing them over manually if need be.
  • They have the benefits that timestamps give, i.e. globally synchronized counter.
  • They contain the time to the millisecond if need be.
  • They do not let you make meaningless associations.

Any big downsides I have missed? Or improvements that could be made keeping my use case in mind (potentially programmatically create Zettels)?

Comments

  • Lovely explanation of the thinking behind this! The result looks like an alphanumeric counter I saw the other day

    The names of the files are simply three-character IDs starting from 000 and including numbers as well as lowercase characters (e.g., 0b9.md is my note about the city of Macau). This seems to work well for me.

    With this schema I can create a total of (10 + 26)³ = 45e3 notes, which sounds adequate. If I ever reach that limit, that probably means I've been very successful with my Zettelkasten; I can, at that point, adopt uppercase characters, which would increase the namespace to 238e3 notes. As of 2020-04-11 I'm still far from that, though: the last note I've created is 0r1.md (the 974th note).

    I do agree that a counter is inferior in principle, because you have to manually resolve conflicts (maybe not often, but again: in principle) and "remember" the latest ID so you can increment it properly. You may accidentally introduce gaps. (Which are to be expected with timestamps, but my OCD wouldn't be able to stand gaps in a counter :))

    A context-free ID generator is better, yes. That's why we went for timestamps.

    A context free generator can also be the output of the uuid program from the command-line, and it can be an encoded timestamp to save some space. You can type timestamps on any device though, without having to have an ID generator tool handy.

    That's why Sascha and I stick to timestamps, come hell or high water: you cannot beat its simplicity. Five-year-olds can generate the ID for you, provided you taught them how to read the calendar and clock.

    Every computational algorithm on top is a potential cause for failure when the tool doesn't work anymore, your computer breaks down, you take notes on mobile, etc. -- If you assume none of this happens and you work from 1 computer exclusively for all of eternity, most arguments pro and contra UUIDs, complex ID generators, and even counters won't matter, because in that case you don't plan for eventualities anyway and can stick to whatever ID generator you came up with.

    You might as well use the counter, then.

    When you have to rely on a computer program to create new IDs for you anyway, it doesn't matter how you reach that goal. If saving space is no. 1 priority, and depending on an ID generator is absolutely acceptable, counters as implemented in the link I shared save even more screen real estate.


    Most of this puzzles me, though.

    I think that if timestamps go on someones nerves soooo much, then simply confine the ID to the note content (in a way that clearly marks it as the ID, not just a link) and leave it out from the file name. You will not have redundant access to the ID from a file listing, yes, but how often do you browse your notes from the command line or web interface anyway? -- And the information is not lost, it's just not encoded in the file name anymore. If you change your mind, you can automate bulk renaming your notes.

    Author at Zettelkasten.de • https://christiantietze.de/

  • As @ctietze points out, timestamp-based IDs have the HUGE advantage that they can be generated manually, w/o the need for a machine. However, for my own app, and similar to your needs, I've needed millisecond precision to allow for automated ID generation (e.g. on import) w/o the need for manual intervention. Still, the IDs shouldn't get too long, and stay somewhat readable/recognizable. I'm thus base32-encoding the millisecond timestamps for my app. For more, see this forum comment and the preceeding comments.

    I plan to also support regular timestamps (w/o the encoding). This will allow users to handcraft IDs if necessary (e.g. when the Zettel note was started in a different app).

  • @ctietze Well that is the nice thing about using an encoded timestamp, I do not lose the ability to determine the ID at a later point as long as a timestamp is available somewhere, which could be handwritten, but most note apps, even on the phone will have creation time being tracked. And those kind of notes I consider to be drafts and I am unlikely to already embed links at that point, as it would be inconvenient for me to do on a smartphone or on small piece of paper, I generally only take short draft notes on my phone that I later work out or they are throwaways (shopping list) and never get close to my Zettelkasten. However I do plan to just create a simple website that just displays and copies an ID to the clipboard, which I try to make work on my phone even in offline mode (I believe such things are possible).

    I like your point about the complexity of producing IDs and their implications, but the algorithm used in my case is just a simple base encoding, which is a trivial algorithm and base58 is well documented, so it is not as bad as you make it sound. I and others could easily reproduce it elsewhere. However your point about the dangers of overly complex ID generation is clear.

    Personally I consider the file system just to be a database, in the case of Zettelkasten at least, so my file names only contain their IDs. I do give them extensions, but only to make them more convenient to work with in editors.

    @msteffens Nice to see others with similar needs come up with a similar ID! My OCD / aesthetics won't allow for mixed ID formats, so in my situation I would probably convert them automatically when importing them in my Zettelkasten.

  • @grayen said:
    Nice to see others with similar needs come up with a similar ID! My OCD / aesthetics won't allow for mixed ID formats, so in my situation I would probably convert them automatically when importing them in my Zettelkasten.

    Yes, I‘d also convert it on the fly. I just meant to allow for it as a valid input format.

  • @grayen said:
    I do plan to just create a simple website that just displays and copies an ID to the clipboard, which I try to make work on my phone even in offline mode (I believe such things are possible).

    Thought the same :) I think that's a sensible approach. It's also confusingly elaborate to have a "web service" to generate Ids for you. These modern times, I tell ya ...

    I like your point about the complexity of producing IDs and their implications, but the algorithm used in my case is just a simple base encoding, which is a trivial algorithm and base58 is well documented

    I think you downplay the problem when you call it "trivial". It's not trivial for humans. Reading the clock is simple for a 5yo, as I said, but I wouldn't even rely on human output by fellow programmers without double-checking their result when they base58-encode anything manually/on paper. See the reference with some readable code, and the IETF algorithm definition

    For tech-savvy folks who always work on their computers, I mean, go ahead and use different ID formats: I think you folks can dig yourself out of any such holes later because you are competent enough. It's just bad advice for a layperson, I think. I didn't make it clear that my critique is purely academic, inasmuch as I don't understand why people have a problem with date/time IDs, really, and there usually is no disclaimer about assumptions made ("I will 100% of the time work on my laptop, never on mobile, never from the web, never from an IBM DOS PC from 198x ..." etc.).

    The good thing is that base58-encoded date/time strings can be decoded later as well, should anyone change their mind :)

    Author at Zettelkasten.de • https://christiantietze.de/

  • @ctietze said:
    Thought the same :) I think that's a sensible approach. It's also confusingly elaborate to have a "web service" to generate Ids for you. These modern times, I tell ya ...

    Haha, thinking about it, I could also just display them on my smartwatch, modern times indeed...

    I think you downplay the problem when you call it "trivial". It's not trivial for humans. Reading the clock is simple for a 5yo, as I said, but I wouldn't even rely on human output by fellow programmers without double-checking their result when they base58-encode anything manually/on paper. See the reference with some readable code, and the IETF algorithm definition

    Completely agree, that was my skewed mindset as a programmer speaking. Its trivial, if you happen to have enough experience as a programmer, so actually, not that trivial.

    And to make things worse, I am using a different base58, where lowercase characters come first. I do this because this matches the human friendly sort order, which is used when listing files. That way they remain in chronological order.

    For tech-savvy folks who always work on their computers, I mean, go ahead and use different ID formats: I think you folks can dig yourself out of any such holes later because you are competent enough. It's just bad advice for a layperson, I think. I didn't make it clear that my critique is purely academic, inasmuch as I don't understand why people have a problem with date/time IDs, really, and there usually is no disclaimer about assumptions made ("I will 100% of the time work on my laptop, never on mobile, never from the web, never from an IBM DOS PC from 198x ..." etc.).

    I agree, its much safer to use human-producible (good luck with producing seconds since 1970 :wink:) timestamps for laypersons. If it is not future proof, I think we have bigger problems at that time.

    The good thing is that base58-encoded date/time strings can be decoded later as well, should anyone change their mind :)

    That is what makes me comfortable about using them, the information about the time is still available if I wish for it, unlike when simple counters are used.

Sign In or Register to comment.