If you're not using date-based IDs, you're doing it wrong

pat · December 2020

I want to make a case for why you should use date-based IDs in your file names, unless you have a really, REALLY good reason not to. In fact, I’ll go so far as to say that using date-based IDs is the only correct approach to note-taking in 99.999% of cases.

Every note needs a unique ID. This is one of the core zettelkasten principles. The ID exists to give you a handle to the note, so you can find it. Simple enough.

For our purposes, date-based IDs mean the kind used by The Archive: a filename prefix. The power of date-based IDs is actually the power of filename prefixes, although the date information does have some fringe benefits.

In short, filename prefixes are a simple, fast way to find content.

Feel free to stop reading if you don’t care to know why

Performant search generally requires the use of an index. The software constructs an index, and begins by searching it. If it needs to search the content directly, it can then do so.

It’s the difference between searching for information in a book whose title and cover you recognize, versus searching for information in a stack of papers of the equivalent text. The index (book title) helps you constrain the search space. Books even usually provide an… index! so that you can find the information you’re looking for even faster.

Any filename-based ID requires a smaller search space than the associated text does. The total length of your filenames is a small fraction of the total length of your content. Here’s an illustration from the 10,000 markdown files repo:

$ ls 10000\ markdown\ files/ | wc -c
  263876

$ cat 10000\ markdown\ files/* | wc -c
-bash: /bin/cat: Argument list too long # cat won't even attempt it!

$ gdu -b 10000\ markdown\ files
46955871    10000 markdown files

Right off the bat, you can see that a naive search algorithm would have at most 263 KB of data to search using file names, compared to 45 MB of data in text content. The filenames are only 0.6% of the total size of the content.

Here’s how long it takes to search that repo for a unique string:

$ time ls 10000\ markdown\ files/yankee*
10000 markdown files/yankee loranthus.md
real    0m0.030s
user    0m0.019s
sys 0m0.011s

$ time ack yankee 10000\ markdown\ files/
real    0m1.566s
user    0m0.981s
sys 0m0.554s

That is admittedly the worst-case scenario for full-text search, because the search set doesn’t contain the term. I’m not going to get into search algorithm complexity here. The text search takes two orders of magnitude more time than the filename search. The difference is clear.

Really, any form of file-based unique ID will be sufficient for our note-taking purposes. Still, there is a little bit more performance to be gained from using filename prefixes. With a 10-byte ID, the search space shrinks to 100 KB - 38% of the full filename space. It’s not enough to make a difference for `ls` though:

$ time ls 10000\ markdown\ files/*loranthus*
10000 markdown files/openhearted genus loranthus.md  10000 markdown files/yankee loranthus.md

real    0m0.032s
user    0m0.020s
sys 0m0.011s

Filename prefixes aren’t just fast, they’re dirt simple to use. Here’s the equivalent lookup in Ruby:

Dir['yankee*']

If you want widespread tool support, you need portable conventions. Filename prefixes are fast, portable, and simple to work with. Tools can benefit from relying on the ID being represented in the first 10 characters of the file name. The regular expression for date-based IDs is child’s play: `/\d{10}/`

Filename prefixes leave room for metadata - your personal index

We’ve already seen how the filename prefixes serve as a fast index to your notes. Once you construct an index using a fixed part of the filename (the first 10 characters) and/or a specific pattern (a 10 digit sequence), you can reliably find that note.

You now have the freedom to add any additional keywords you want, to assist you in finding that note. They can be unique keywords - human-readable aliases - that you use to locate the note. You can enforce uniqueness by convention, or with tool-support. You can also use non-unique keywords, to return sets of notes - exactly like the index at the back of a book.

Search now takes place in three stages:

ID - as fast as it gets; there can be only one!
keywords - still quite fast; will only return the notes that have those keywords
full-text search - slowest; casts a wider net, and will return the most results

With proper tooling, you can apply those search stages to constrain the number of results you get. Regardless of whether you choose to work that way, the computer will benefit from that approach.

Most importantly, you can continually update the index as you grow in understanding your content, and your links will never break.

You can still have the power that other tools provide

You might think that this approach sacrifices power, that you don’t get some of the fancy features that more sophisticated tools use.

I’m going to let you in on a little secret: those sophisticated tools are not doing anything differently. They're taking the same information, constructing indexes, and searching those indexes. They might use SQL databases to do it, but it’s fundamentally the same principle.

The difference is, they sacrifice the ability to do the simplest thing possible: to construct a powerful system using simple tools, and to gradually build more sophisticated functionality on top of that.

We’ve already seen other tools that use a similar approach, but for reasons that I can’t understand, have chosen not to support the fundamental principle of ID prefixes. Anyway, they demonstrate that programmers who apply this principle can deliver functionality that benefits you - with no effort on your part!

Why date-based IDs?

Everything I’ve said up to this point is rooted in fact. We need unique IDs, and it’s beneficial to implement them as filename prefixes. Now we take a little detour into the subjective.

Date-based IDs work well because they’re fairly logical and easy for humans to work with. That’s it. I can look at a list of IDs, and although I’m not figuring out all the dates in my brain, I can sense order in them. It’s not a collection of random characters, like short GUIDs.

I can manipulate date-based IDs. I don’t actually care to use them to indicate creation time - I have version control for that, if I really need to know. Instead, I can manipulate the order that files appear in other software - because lots of software relies solely on filename sorting. If I want certain notes to appear at the top of a list, I can set their IDs to be some time in the future or past, depending on my sorting preferences.

Essentially, date-based IDs provide a good tradeoff of enabling uniquness for the computer’s sake, while still being relatively usable by humans.

Still not convinced?

Okay. I tried

ctietze · December 2020

This is an interesting angle, thanks for sharing the post!

Your actual point seems to be date-time based IDs are a must and keywords in the file name are a very good idea. What about a human-readable title? Optional? Leave out?

Why 10 characters? 202012301110 is 12 chars long, not 10; are you using a truncated timestamp?

Using keywords (or tags) in the filename to aid with filtering is an interesting idea! I wouldn't then care to read the file names in a listing, because that'd amount to a fair bit of noise, I imagine.

Before:

202005271519 Schumpeters Creative Destruction.txt
202010280809 Extend symbol web fonts with custom glyphs.txt
202010280907 Use web fonts for iconography in web design.txt
202010301446 Creative Destruction as commoditization of invention.txt
202010311046 Likes als Interaktion mit Respect und Disagree auf Nachrichtenplattformen ersetzen.txt
202011020957 Stir Espresso for even taste distribution and richer flavor.txt
202011031526 Low-level structure notes are like music playlists.txt

After adding hashtags to filename:

202005271519 Schumpeters Creative Destruction #market #kapitalismus.txt
202010280809 Extend symbol web fonts with custom glyphs #css #svg #web #font.txt
202010280907 Use web fonts for iconography in web design #webdesign #font.txt
202010301446 Creative Destruction as commoditization of invention #profit #invention #niche #marketshare.txt
202010311046 Likes als Interaktion mit Respect und Disagree auf Nachrichtenplattformen ersetzen #social-media #like.txt
202011020957 Stir Espresso for even taste distribution and richer flavor #coffee #barista.txt
202011031526 Low-level structure notes are like music playlists #matahpor #strukturzettel #zettelkasten.txt

I'd really like to tone-down the tags in that case. Unix tools to the rescue, basic coloring would even work in the shell (using some random colors and bold font here).

pat · December 2020

10 is just because I misremembered how many digits The Archive uses I sort of feel like 14 would be better since it avoids collisions, but ultimately if you're going to script a UID gen then you'll need to account for collisions anyway, so it may not be so bad.

I am opting to leave a human readable title out of the file name, and let tools parse and present it to me.

deft will show the title for org-mode and markdown files:

I actually just started digging into org-mode last night. One of the things I hoped for from The Archive was that it would hide the file names (or at least IDs), and render the human readable titles in the note. Turns out Deft does that out of the box.

You can see a mix of files using different conventions, but the most recent ones at the top have been edited to have a #+TITLE: as the first line. Looking at the screen shot, it's apparent to me that a good convention would be to have the title on the first line, tags on second line, then the content. Here's an example of how that turns out:

I am not super interested in tags though, and don't care about the ID at all. File name keywords, and content tags are shortcuts to help me find a file. But the title probably shows me most of what I need, and there's a small bit of content preview to help.

I will say I am surprised at how quickly I've taken to org-mode. I suspect I'll be abandoning The Archive and Markdown for the most part (as much as I truly have loved working with The Archive). The benefits of emacs and org-mode are just too strong for me to pass up, so I'm gradually changing .md files to .org files as I touch them. I do know that deft can be slow as the number of files increases, but I have some time to figure it out.

henrikenggaard · December 2020

I'm not using a numerical ID for my notes. Instead, I use a unique title to identify notes. This topics has been discussed in numerous iterations, but I'll give my point of view and leave it at that.

I use TiddlyWiki for my notes and it has excellent support for a title-based workflow: titles are guaranteed to be unique, can be renamed effortlessly (keeping all links intact) and can contain special characters not available in the file systems. To clear up a common misconception: TiddlyWiki can (and does in my setup) store everything in plain-text and individual files -- there is no proprietary format or inaccessible data.

With that foundation laid, here is why I vastly prefer title-based naming rather than ID-based naming: it forces/encourages me to write a good title for my notes.

Every single of my 1001 notes has a short-ish title which can be read and which describes the content.

It is time consuming and difficult. Yes. But it is, I believe, an important step in ensuring high quality notes. Numerous times the inability to formulate a good title has prompted me try and understand material better.

To me, the ability to create "unnamed" notes is a disadvantage since it makes it easier to do something I don't feel that I need

Of course, this highly personal. I don't think there is a right or wrong way. I think date IDs have a lot to speak for them (higher cross-tool compatibility for one). But so do I think title-based IDs do serve a purpose and it might indeed be preferable to some.

And regarding speed of search: I tried to add 40000 notes to TiddlyWiki at one point. Search was still as fast as I could type. I don't buy the speed argument

sepuku · December 2020

I use a physical ZK. In a piece of software where you aren’t constrained by a physical size, you can add as many links as you want to (and actually need to when using time stamps as you’re unique ID), but on a 6x4 notecard, the amount of space you have is limited so much, that your connections to other notes need to be very carefully chosen, otherwise your cards get filled with links, and there will be no room left for actual content.

Also, when branching using a physical ZK, there is no real need to add a link to the child card, as it just lives behind the parent. That being said, trying to ensure a new entry is always linked to an existing note, is great practice, so this tends to happen anyway. But writing 11a5 is much easier than looking at the current time, and writing 202012301330 twice. Also remembering 11 is my entry point for mental models/cognitive biases and having a card board marker stick out the top of the ZK box is easier than remembering 202004121734 and having a larger bit of card...

Maybe a physical ZK is one of the “really REALLY good reasons” you were referring to...

Delos · December 2020

Having a date based uid is the best way to easily assign a unique reference, I have to agree. I experimented with as many options as I could dream up or find. By far, the easiest to use without requiring multiple lookups or memorizations was the date reference, since it is ever updating. And I do use UID and SUMMARY TITLE, as I find it has assorted benefits to also.

However, having a ten digit (or more) uid makes my brain try to decipher it every time I see it and I really don't need all that precision, based on the number of notes I make in a day. Simpler is better, for me. I found someone using partial hexadecimal uid's and adapted it in order to shorten parts of the uid.

Y (year abbreviated as p for 'pandemic' since I just started late this year. Seemed semi-logical.)
M (months 1-9 then a, b, c.)
D (days 1-9 then letters up to x, skipping lowercase l and o.)
NN (two digit counter for notes made that day, starting at 01.)

So today's first note would be [pcw01] instead of [20201230104916].

On the the first of next month, it will be [q1101].

Caveat: It does have one lookup for the day, but usually I can jump to the next from my last note without actually looking the letter up. For example, I can intuit pcw comes after pcv, if I have recorded notes on pcv.

Here's my chart, in case this proves useful for anyone...
a 10/Oct e 14 i 18 n 22 s 26 w 30
b 11/Nov f 15 j 19 p 23 t 27 x 31
c 12/Dec g 16 k 20 q 24 u 28 Alt175 »
d 13 h 17 m 21 r 25 v 29 Alt174 «

Oh, and if I link a note, after the reason-for-the-link-text, I use [»pcw01] to send me to further information and I use [«pwc01] to back link to a supporting reference. I can separately search for the note itself, links to, links from and all links.

zk_1000 · December 2020

@Delos the idea to use a single letter for the year is new to me, i like it a lot. What happens in 2030, you use Greek letter or append another letter? A 6 character id is still shorter than mine (shorter = better). There are also enough characters in Cyrillic, Arabic, Chinese, Tibetan and more for many centuries to come.

Delos · December 2020

I was thinking to go to A but I like your Greek Letter idea better!

pat · December 2020

@henrikenggaard said:
I use TiddlyWiki for my notes and it has excellent support for a title-based workflow: titles are guaranteed to be unique, can be renamed effortlessly (keeping all links intact)

Sure. I’m not saying it can’t work. It’s just not as good, from a technical standpoint.

The unique title approach couples identity and meaning. When one changes, both change. This means that every interface you use to work with your data must support the same convention. Fixed filename prefixes provide high flexibility for little effort.

I will tell you this: I downloaded TW and started playing with it. The first thing I did was create two Tiddlers, link one to the other, and change the title of the linked one. TW was at least kind enough to tell me that it would break the reference (I added the red box for emphasis).

I have no doubt that there’s some mechanism for changing the titles, while maintaining links. Whether it’s a plugin, configuration, or a setting I overlooked - I don’t know. The point is, out of the box, changing links breaks them.

Here’s a GitHub thread describing this exact issue, and a plugin that provides the desired behavior. I suppose you have it installed, or something like it? In any case, a bit of machinery is required to do the work. It’s like how if I want to change my legal name, I’ll need to do a lot of paperwork - but people can call me “Pat” or “bozo” with no issues.

I believe TW illustrates my point. As a new TW user, I change a link, find that it breaks, search for why, find a GitHub issue, scroll all the way to the bottom, find a plugin, install it, and change my note title. With fixed IDs, I just change my title and my keywords, no machinery required.

@henrikenggaard said:
Here is why I vastly prefer title-based naming rather than ID-based naming: it forces/encourages me to write a good title for my notes.

…

To me, the ability to create “unnamed” notes is a disadvantage since it makes it easier to do something I don’t feel that I need

That problem can be addressed by workflow and/or tools. You could choose to commit to a workflow where you always enter a title. If you won’t do that, you can use a tool to enforce it - which is exactly what you’ve done.

The difference is, you sacrifice the benefits that come from having an ID that never changes. With an unchanging ID, it’s trivial to add a functional layer that requires you to enter a title.

@henrikenggaard said:
And regarding speed of search: I tried to add 40000 notes to TiddlyWiki at one point. Search was still as fast as I could type. I don’t buy the speed argument

Yeah, I just checked it out, and it looks like a clever bit of software. It runs everything in memory, so it’s lightning fast as they say.

Of course, you can do this same trick with a folder of files. Moby Dick in plain-text is only 1.2 MB. If you want lightning fast processing, you don’t do it one file at a time. You load the whole set in memory, process it there, and write things to disk as needed. Again, the fixed ID makes this tooling possible.

@sepuku said:
Maybe a physical ZK is one of the “really REALLY good reasons” you were referring to…

Indeed!

@Delos said:
However, having a ten digit (or more) uid makes my brain try to decipher it every time I see it and I really don’t need all that precision, based on the number of notes I make in a day. Simpler is better, for me.

Yeah, I can see that. That is, after all, the part that I admitted was subjective. I would prefer to use UUIDs, because then I wouldn’t have to worry about collision. But then I would need my tools to hide the hideous UUID for me, which most won’t do. So whether it’s decimal or hex or ASCII doesn’t matter so much, and same goes for the length. You can come up with a scheme that works for you. As long as the ID doesn’t change, you’re good to go.

micahredding · January 2021

@ctietze said:
I'd really like to tone-down the tags in that case. Unix tools to the rescue, basic coloring would even work in the shell (using some random colors and bold font here).

Can we get something like this for the note list in the Archive? Is that already a Feature Request?

ctietze · January 2021

@micahredding Toning-down Ids or showing them differently in the note list is part of our features-to-discuss-and-plan list for upcoming iterations. That means we need to experiment and haven't decided what works best -- but we did already agree on making the app more aware of Zettel IDs (and thus making this feature configurable). No timeline for that, yet, though.

henrikenggaard · January 2021

@pat said:
I believe TW illustrates my point. As a new TW user, I change a link, find that it breaks, search for why, find a GitHub issue, scroll all the way to the bottom, find a plugin, install it, and change my note title. With fixed IDs, I just change my title and my keywords, no machinery required.

I don't really see how this is so bad? You have a problem and with no prior experience in the software found a solution -- a solution which requires no subsequent work.

But that is besides the point.

My main counter-point is simply that the notion of calling people's approaches "wrong" is... well, ... wrong, when it really doesn't need to be so black and white

Title-based workflows work very well and carry their own benefits. But I don't even think the "advantages and disadvantages" story is that meaningful either. Tools and workflows have to fit into the context of other tools and workflows.

I understand the benefits of IDs. I really do. I have used it. Thought about what I wanted to use my note collection for and what tools I wanted to use. And I came to the conclusion that IDs would be a distraction. But that doesn't make them a bad choice for other people -- just a different choice

To illustrate my preference, I'll reiterate that all my notes have a (hopefully) well-formulated title. Why is this important? Because my primary usage of my notes is to (1) write other notes and (2) to write outlines for writing (papers). Thus, since the title is the link, I reuse the title over and over and over. Here is part of an outline from a paper:

Fiber capacity crunch

Coherent optical transmission systems are limited by nonlinear effects

The characteristics of a transmission system has dictate the limiting parameter

Constellation shaping increases spectral efficiency

AWGN channel assumption

Maxwell-Boltzmann distribution is near optimal constellation shaping for AWGN channels

Increasing the single fiber achievable rate will require operation in nonlinear channels. Nonlinearity tolerance allows for more amplification.
Nonlinearity means not linear

For constellation shaping "nonlinearity" means that Maxwell-Boltzmann is suboptimal

Almost all of these are actual titles (and thus IDs/links). To enter a link, I press ctrl+L and start typing the title. Inserting the link also inserts the title and thus the item in the outline. Writing outlines is a breeze, since I can just type words (and I type words a lot faster than numbers), plus whilst searching for titles I get recommendations for other similar titles, which helps inspire the outline.

Obviously, this can be done with other IDs, too But, what support does the tooling have for it? How much workflow and tooling has to be build and understood to support that? You see what I mean? I'll pick tools to support the workflows and outputs I need.

pat · January 2021

@henrikenggaard said:
I understand the benefits of IDs. I really do. I have used it. Thought about what I wanted to use my note collection for and what tools I wanted to use. And I came to the conclusion that IDs would be a distraction. But that doesn't make them a bad choice for other people -- just a different choice

Mutable IDs are a bad choice if you want long-term scalability and flexibility.

Obviously, this can be done with other IDs, too But, what support does the tooling have for it? How much workflow and tooling has to be build and understood to support that? You see what I mean? I'll pick tools to support the workflows and outputs I need.

The tooling to support immutable numeric IDs is less than is required to support mutable full-title IDs.

It's worth noting that your example outline has nothing at all to do with the ID format, only the presentation format. A TiddlyWiki plugin could just as easily use numeric IDs, present the same auto-complete list of link targets, and render links without the IDs. The screenshot in this post demonstrates how you can have file names with ID prefixes, but the tool hides them in favor of presenting the titles.

henrikenggaard · January 2021

Sorry for the slight delay in replying. I'm back at work, so I'm more busy

@pat said:

@henrikenggaard said:
I understand the benefits of IDs. I really do. I have used it. Thought about what I wanted to use my note collection for and what tools I wanted to use. And I came to the conclusion that IDs would be a distraction. But that doesn't make them a bad choice for other people -- just a different choice

Mutable IDs are a bad choice if you want long-term scalability and flexibility.

I don't really understand what you mean. What kind of scalability is limited by using titles? Why is it less flexible? I really don't understand what you mean, but I also don't want to assume something you are not saying

I included the example outline to bring an actual workflow on the table instead of making it entirely hypothetical. It removes the ambiguity about what is actually going on

If the interaction is primarily through the titles, then, to me, it is so much simpler to just use those directly. Sure, changing them requires search and replace, but that is hardly sophisticated technology. Why is search and replace a technology trap?

Zettelkasten Forum

If you're not using date-based IDs, you're doing it wrong

In short, filename prefixes are a simple, fast way to find content.

Filename prefixes leave room for metadata - your personal index

You can still have the power that other tools provide

Why date-based IDs?

Still not convinced?

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion