A Python script for a quantitative look at your ZK

Will · June 2023

I've been working in Python on a script that presents a history of zettelkasting in a table. It shows notes created each month—a fascinating look at how time has impacted zettelkasting. I'm into quantitative analysis, and this was fun learning a few tricks in Python.

If your Zk is file-based, uses a UID as part of the file name, like 202306XXX..., and you have Python and dependencies installed; it should work for you.

You can clone the repo and contribute if you want.
GitHub - woodenzen/monthly_stats

Sample output.

+-------+------+------+------+------+------+------+  
| Stats | 2018 | 2019 | 2020 | 2021 | 2022 | 2023 |  
+-------+------+------+------+------+------+------+  
|  Jan  |  0   | 145  | 115  |  54  |  73  |  50  |  
|  Feb  |  0   |  86  |  75  |  45  |  45  |  49  |  
|  Mar  |  0   |  91  |  84  |  84  |  62  |  48  |  
|  Apr  |  0   | 113  |  49  |  71  |  65  |  40  |  
|  May  |  0   |  70  |  46  |  95  |  67  |  80  |  
|  Jun  |  0   |  30  |  68  |  82  |  48  |  19  |  
|  Jul  |  0   |  25  |  82  |  44  |  67  |  0   |  
|  Aug  |  0   |  10  |  89  |  43  |  89  |  0   |  
|  Sep  |  0   |  16  |  80  |  71  |  35  |  0   |  
|  Oct  |  0   |  24  |  44  |  57  |  48  |  0   |  
|  Nov  |  44  |  46  |  58  |  24  |  43  |  0   |  
|  Dec  |  58  |  66  |  89  |  37  |  61  |  0   |  
+-------+------+------+------+------+------+------+

ctietze · June 2023

Worked for me!

+-------+------+------+------+------+------+------+
| Stats | 2018 | 2019 | 2020 | 2021 | 2022 | 2023 |
+-------+------+------+------+------+------+------+
|  Jan  |  70  |  29  |  65  |  42  |  25  |  36  |
|  Feb  |  48  |  17  |  23  |  57  |  43  |  8   |
|  Mar  |  55  |  29  |  20  |  47  |  49  |  59  |
|  Apr  |  47  |  20  |  9   |  38  |  37  |  63  |
|  May  |  31  |  27  |  24  |  72  |  75  |  52  |
|  Jun  |  25  |  20  |  41  |  52  |  58  |  33  |
|  Jul  |  18  |  32  |  32  |  58  |  53  |  0   |
|  Aug  |  22  |  17  |  31  |  81  |  48  |  0   |
|  Sep  |  41  |  33  |  12  |  29  |  51  |  0   |
|  Oct  |  23  |  38  |  29  |  81  |  57  |  0   |
|  Nov  |  25  |  24  | 121  |  52  |  45  |  0   |
|  Dec  |  20  |  36  |  38  |  55  |  23  |  0   |
+-------+------+------+------+------+------+------+

Nice table

I needed to adjust the date number pattern to remove the leading space here:

https://github.com/woodenzen/monthly_stats/blob/460bff13f02979af26acb3fc7505c96ced4041ce/monthly_stats.py#LL58C22-L58C28

(My IDs are at the beginning of the file name.)

Is there maybe a way to make the search pattern's literal space optional?

Will · June 2023

Cool. Your pull request that included the summary row for the table was merged. Thanks, it was my first pull request merge, and it went smoothly.

Is there maybe a way to make the search pattern's literal space optional?

Probably. We'd have to ask for user input. This is a little more important than it first appears. The program, as it is now, counts all the 2023XX (01-12) hits in the directory. A file could have a title with a timestamp of 202303202312. This file gets counted twice. Once for the year and date and a second time for the hour, minutes, and second. This is a slim chance, but one I confronted at least twice in my stats. I'm not usually at the computer zettelkasting at 10:23 PM.

Seeing your stats made me realize I had hard-coded the years based on my ZK. This is another thing that the program could prompt for or check for the oldest file in the directory and make a table based on that.

Will · June 2023

I made a slight change in the program. It is now merged into the main branch. It now prompts for the oldest year you want stats for, so those of you who have older ZKs than mine will get full stats.

ctietze · June 2023

Thanks for merging so quickly!

Is there maybe a way to make the search pattern's literal space optional?

Probably. We'd have to ask for user input. This is a little more important than it first appears. The program, as it is now, counts all the 2023XX (01-12) hits in the directory. A file could have a title with a timestamp of 202303202312. This file gets counted twice. Once for the year and date and a second time for the hour, minutes, and second. This is a slim chance, but one I confronted at least twice in my stats. I'm not usually at the computer zettelkasting at 10:23 PM.

To avoid detecting "2023" twice in "202303202312" would require a string parser that consumes its string: i.e. move the "pointer" in the title string forward character by character, and never back; once "2023" is detected, consume up to 8 more digits (12 in total). With this, you literally can't detect the "202312" after "202303..." was consumed. This is done to parse strings, but probably overkill for this date detection.

If regular expressions are an option, you would get not checking the same characters twice for free, actually:

https://regex101.com/r/EeQKgz/1

Seeing your stats made me realize I had hard-coded the years based on my ZK. This is another thing that the program could prompt for or check for the oldest file in the directory and make a table based on that.

Later:

I made a slight change in the program. It is now merged into the main branch. It now prompts for the oldest year you want stats for, so those of you who have older ZKs than mine will get full stats.

It is exciting to see you pick up these coding workflows and make adjustments like that!

Will · June 2023

I have made a couple of updates.
1. Thanks for mentioning using regex with the string parser. I changed the count_files_zettelkasten(partial_UID) function, and the counter will now only count the date str once per file name.
2. I changed the count_files_zettelkasten(partial_UID) function so the user can place the UID anywhere in the filename, not just at the end, which is best.
3. I added a test to the "Enter the year [XXXX]: " prompt to be sure we get a good year.

I am thinking about the year thing. There is a way to find the oldest note and use its data str and the stat_years input rather than asking the user.

Will · June 2023

Another update. GitHub - woodenzen/monthly_stats now looks into your archive to find the year you started zettelkasting rather than asking you. Then it creates a table with the appropriate columns.

+-------+------+------+------+------+------+------+
| Stats | 2018 | 2019 | 2020 | 2021 | 2022 | 2023 |
+-------+------+------+------+------+------+------+
|  Jan  |  0   | 145  | 115  |  54  |  73  |  50  |
|  Feb  |  0   |  86  |  75  |  45  |  45  |  49  |
|  Mar  |  0   |  91  |  83  |  84  |  62  |  48  |
|  Apr  |  0   | 113  |  49  |  71  |  65  |  40  |
|  May  |  0   |  70  |  46  |  95  |  67  |  80  |
|  Jun  |  0   |  30  |  68  |  82  |  48  |  30  |
|  Jul  |  0   |  25  |  82  |  44  |  67  |  0   |
|  Aug  |  0   |  10  |  89  |  43  |  89  |  0   |
|  Sep  |  0   |  16  |  80  |  71  |  35  |  0   |
|  Oct  |  0   |  24  |  44  |  57  |  48  |  0   |
|  Nov  |  44  |  46  |  58  |  24  |  43  |  0   |
|  Dec  |  58  |  66  |  89  |  37  |  61  |  0   |
+-------+------+------+------+------+------+------+
| Total | 102  | 722  | 878  | 707  | 703  | 297  |
+-------+------+------+------+------+------+------+

ctietze · June 2023

Nice!

Because my bible notes start with 000000000001, it takes 10 minutes on my machine, and also produces too many columns, like this one, with 2024 columns total

But bear with me for a moment:

My notes span from 2009 to 2023, and that takes 5 seconds. The script needs to go through the process of finding stuff 14 times. Each time, it iterates over my ~8600 note file names and applies the regular expression. Plus one more iteration to find the earliest date. That's 129,000 evaluations. And regexes aren't suuuuper fast. Do this 2024/14=144 more times and that takes >10min.

One way to rewrite a note finding and categorization algorithm like this would be to:

go through all files (ls), process each file exactly once (ignore all that aren't files, like you already do)
store the titles/filenames of the files in a data structure like an array, then treat this as the source of truth for future steps:
process the filenames to match the year and month in one go (so you loop over the array only once), then increment a count in a dictionary for the year, which contains a dictionary for its months.

The upside is that dictionaries can have "holes" and you don't need 2024 array elements for 0000 and 2009--2023. Also, you don't need to order by year first. You can store 20121224 in counts[2024][12] more or less directly (you need to start the year with an empty months-dictionary if it doesn't exist, yet, and start the value for the month with 0 if that doesn't exist, i.e. fill the holes).

Will · June 2023

Thanks for the tutorial on speeding up the app by removing the recursive reading of the filenames. I feel like I graduated from 1st-grade beginner to 2nd-grade beginner.

I've created a v.2 of the app, which reads the filenames into a dictionary and then uses it "as the source of truth for future steps." This makes the program "129,000" times faster!

Another change is the app will only look at note files in the specified directory with time-based 12-digit UIDs from the 21st century and an extension of .md or .txt. (I added the .txt extension because my test archive uses the .txt extension.)

Otherwise, the output looks the same.

Give it a try now.

Git Repository for Monthly Stats

ctietze · June 2023

The code's structure truly looks like a leap in thinking about the problem, congrats for graduating

The output took 0.05s or so (compared to ~5 seconds before when capped at 2009!) and looks great:

Stats	2008	2009	2010	2011	2012	2013	2014	2015	2016	2017	2018	2019	2020	2021	2022	2023
Jan	0	6	1	12	59	71	75	25	37	34	70	28	65	37	22	36
Feb	0	2	0	14	48	50	25	21	5	37	48	16	23	54	41	8
Mar	0	3	6	14	29	28	26	38	11	21	55	29	20	47	49	59
Apr	1	5	8	19	39	62	8	10	14	36	47	19	9	38	37	62
May	0	6	6	21	78	67	13	21	12	42	30	26	24	72	74	52
Jun	0	2	5	22	68	40	24	31	30	43	25	20	41	52	58	43
Jul	6	0	13	29	98	56	52	21	43	102	18	32	31	58	53	0
Aug	0	0	21	15	47	18	17	14	20	62	22	17	31	81	48	0
Sep	2	2	24	11	33	49	19	30	25	47	41	33	12	28	51	0
Oct	0	1	11	11	46	53	25	4	81	52	23	38	28	79	57	0
Nov	0	1	20	27	75	31	45	15	100	84	25	24	109	49	44	0
Dec	7	0	32	10	52	41	27	10	21	73	20	36	38	52	23	0
Total	16	28	147	205	672	566	356	240	399	633	424	318	431	647	557	260

When I change

pattern = r'\d{12}'

I also get the year 0000 for the bible, but since the rest of the ID is nonsensical (i.e. month 00), it's empty. But (!) the important insight here is that the script does indeed account for holes in the statistics. Great work.

Will · June 2023

Thanks for the feedback. It continues to be a fun and learning process. Your feedback has been precious. I want to explore making this more graphical and self-contained so the user doesn't have to install anything but a single package. Maybe the best approach would be some browser interface. Do you have any suggestions?

The pattern = r'20\d{10}' should have limited the results to only those files containing a pattern of 20XXXXXXXXXX. This limits reasuts to the 21 century only. I don't think you'd want pattern = r'\d{12}'. It would catch not only 0000 but 1890, 1642, 7343, and any other inappropriate year date.

Will · June 2023

In my working branch, I'm working on prettifying the table. So far-

What concussions can I derive from the table?

I was more prolific in note production early on.
The first couple of months of a year are my most prolific.
Mostly a practical exercise in learning to program.

ctietze · July 2023

@Will said:
I want to explore making this more graphical and self-contained so the user doesn't have to install anything but a single package. Maybe the best approach would be some browser interface. Do you have any suggestions?

Browser as in web browser? -- If so: The only way I see that a browser interface could help is if you host the web application somewhere -- but then the app doesn't have access to the local file system or its folder of notes. (Except if you bundle the web UI in an Electron app, but that weighs in at ~160MB and doesn't make things easier for you.)

Maybe check out Platypus or PyInstaller instead to make .app bundles for Mac from your script? For more, see here: https://stackoverflow.com/questions/7404792/how-to-create-mac-application-bundle-for-python-script-via-python

Packaging a thing that behaves like an app on multiple platforms is no small task, though, so maybe start with adding an "Installation" or "How to Use" section to the README, copying instructions from other Python command line tools to help the newbie install dependencies and run the script from the Terminal.

The pattern = r'20\d{10}' should have limited the results to only those files containing a pattern of 20XXXXXXXXXX. This limits reasuts to the 21 century only. I don't think you'd want pattern = r'\d{12}'. It would catch not only 0000 but 1890, 1642, 7343, and any other inappropriate year date.

That is 100% correct!

It's an assumption you have baked in that is sensible, but that also doesn't fit a particular niche of my Zettelkasten containing structure notes of the Bible The goal of your script is to provide statistics, so my Bible structure notes don't fit anyway, but I was curious nevertheless!

Users are just the worst, I'm sorry!

Zettelkasten Forum

A Python script for a quantitative look at your ZK

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion