Share with us what is happening in your ZK this week. February 9, 2024

2»

• @ZettelDistraction said:
@Jeremy said:
Now I see here the discussion of the size distribution of Zettels as a good starting point for Zettel maintenance and I wonder, first, which is the currently "best" python script for producing the histogram?

I doubt this is the best, but it does work. It does not include the Zettel format analysis.

import os
import matplotlib.pyplot as plt
from collections import Counter
import seaborn as sns
from statistics import median

def calculate_median_word_count(word_counts):
"""Calculate the median word count from a list of word counts"""
return median(word_counts)

def initialize_word_freq_bins(max_bin_left_endpoint=1001, bin_width=50):
"""Initialize word frequency bins"""
word_freq_bins = {f"{i}-{i+49}": 0 for i in range(1, max_bin_left_endpoint, bin_width)}
word_freq_bins[f"{max_bin_left_endpoint}+"] = 0
return word_freq_bins

def categorize_word_count(word_count, word_freq_bins, max_bin_left_endpoint):
"""Categorize the Zettel based on the number of words into the word frequency bins"""
if word_count >= max_bin_left_endpoint:
word_freq_bins[f"{max_bin_left_endpoint}+"] += 1
else:
bin_label = f"{(word_count // 50) * 50 + 1}-{(word_count // 50) * 50 + 50}"
if bin_label in word_freq_bins:
word_freq_bins[bin_label] += 1

# Apply the ggplot style
sns.set(style="whitegrid")

# Initialize variables
word_counts = []
word_freq_bins = initialize_word_freq_bins()

# Directory processing
zettel_directory = 'C:\\Users\\fleng\\OneDrive\\Documents\\Zettelkasten'
for file in os.listdir(zettel_directory):
full_path = os.path.join(zettel_directory, file)
if os.path.isfile(full_path) and file.endswith('.md'):
with open(full_path, 'r', encoding='utf-8') as f:
word_count = len(text.split())
word_counts.append(word_count)
categorize_word_count(word_count, word_freq_bins, 1001)

# Visualization and statistics display functions would follow here

# Displaying word frequency bins
plt.figure(figsize=(10, 7.5)) # Set the size of the plot
plt.bar(word_freq_bins.keys(), word_freq_bins.values(), color='skyblue')
plt.title('Word Frequency Bins')
plt.xlabel('Word Count')
plt.ylabel('Number of Zettels')
plt.xticks(rotation=90)
plt.show()

# Displaying word count statistics
print(f"Total number of words: {sum(word_counts)}")
print(f"Average number of words per Zettel: {sum(word_counts) / len(word_counts) if word_counts else 0}")
print(f"Median number of words in a Zettel: {calculate_median_word_count(word_counts)}")
print(f"Minimum number of words in a Zettel: {min(word_counts, default=0)}")
print(f"Maximum number of words in a Zettel: {max(word_counts, default=0)}")
print(f"Most common word count: {Counter(word_counts).most_common(1)[0] if word_counts else 'N/A'}")
print(f"Least common word count: {Counter(word_counts).most_common()[-1] if word_counts else 'N/A'}")



And also, have people looked at the distribution of tags (how many per Zettel, how many Zettels per tag) as an indication of "goodness"?

Thanks

I haven't looked into this, but I will add it at some point.

GitHub. Erdős #2. CC BY-SA 4.0. Problems worthy of attack / prove their worth by hitting back. -- Piet Hein.

• @ZettelDistraction, thanks for the code!

Will Simpson
My zettelkasten is for my ideas, not the ideas of others. I will try to remember this. I must keep doing my best even though I'm a failure. My peak cognition is behind me. One day soon, I will read my last book, write my last note, eat my last meal, and kiss my sweetie for the last time.
kestrelcreek.com

• edited February 20

Here is a graph of different data, but surprisingly to me is that it has about the same curve.

Python Code
import os
import re
import matplotlib.pyplot as plt
import numpy as np

def get_word_count(file_path):
with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
return len(re.findall(r'$\[\d+$\]', f.read()))

def get_file_paths(directory_path):
return [os.path.join(directory_path, f) for f in os.listdir(directory_path) if os.path.isfile(os.path.join(directory_path, f))]

directory_path = '/Users/will/Dropbox/zettelkasten'
file_paths = get_file_paths(directory_path)
word_counts = [get_word_count(f) for f in file_paths]

# Define the bins
bins = [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, np.inf]
# Use numpy's histogram function to divide the data into bins
counts, bins = np.histogram(word_counts, bins=bins)
# Convert the bins to string labels, excluding the last bin
labels = [f'{int(bins[i])}-{int(bins[i+1])-1}' for i in range(len(bins)-2)]

# Handle the last label separately
labels.append(f'{int(bins[-2])}+')

# Convert the histogram data to a bar graph
plt.bar(labels, counts, color='#82D6F0', edgecolor="black", zorder=2)  # Light blue color

# Set the labels for the x-axis and y-axis
plt.ylabel('Number of Zettels')

plt.grid(True, which='both', color='grey', linewidth=0.5, linestyle='--')
# Tilt the labels on the x-axis
plt.xticks(rotation=45)

# Show the plot
plt.show()



Edit @ctietze: Fixed code

Post edited by ctietze on

Will Simpson
My zettelkasten is for my ideas, not the ideas of others. I will try to remember this. I must keep doing my best even though I'm a failure. My peak cognition is behind me. One day soon, I will read my last book, write my last note, eat my last meal, and kiss my sweetie for the last time.
kestrelcreek.com

• I doubt this is the best, but it does work. It does not include the Zettel format analysis.

Many thanks for that. I will try it was soon as I have a moment and will report back.

• Here is the first histogram. I think I know what the long Zettels are all about; dumping grounds where I save things for later (yes, I'm a collector) under large umbrella concepts and that I have failed to keep up with and process.

Summary stats:
Average words per Zettel: 228.940876656473
Median per Zettel: 84

Suggest that the distribution is pretty skewed and that once I have processed the really long notes I should see about breaking up some of the not-quite-so-long notes.

Thanks for the program.