Share with us what is happening in your ZK this week. February 9, 2024

ZettelDistraction · February 2024

@ZettelDistraction said:
@Jeremy said:
Now I see here the discussion of the size distribution of Zettels as a good starting point for Zettel maintenance and I wonder, first, which is the currently "best" python script for producing the histogram?

I doubt this is the best, but it does work. It does not include the Zettel format analysis.

import os
import matplotlib.pyplot as plt
from collections import Counter
import seaborn as sns
from statistics import median

def calculate_median_word_count(word_counts):
    """Calculate the median word count from a list of word counts"""
    return median(word_counts)

def initialize_word_freq_bins(max_bin_left_endpoint=1001, bin_width=50):
    """Initialize word frequency bins"""
    word_freq_bins = {f"{i}-{i+49}": 0 for i in range(1, max_bin_left_endpoint, bin_width)}
    word_freq_bins[f"{max_bin_left_endpoint}+"] = 0
    return word_freq_bins

def categorize_word_count(word_count, word_freq_bins, max_bin_left_endpoint):
    """Categorize the Zettel based on the number of words into the word frequency bins"""
    if word_count >= max_bin_left_endpoint:
        word_freq_bins[f"{max_bin_left_endpoint}+"] += 1
    else:
        bin_label = f"{(word_count // 50) * 50 + 1}-{(word_count // 50) * 50 + 50}"
        if bin_label in word_freq_bins:
            word_freq_bins[bin_label] += 1

# Apply the ggplot style
sns.set(style="whitegrid")

# Initialize variables
word_counts = []
word_freq_bins = initialize_word_freq_bins()

# Directory processing
zettel_directory = 'C:\\Users\\fleng\\OneDrive\\Documents\\Zettelkasten'
for file in os.listdir(zettel_directory):
    full_path = os.path.join(zettel_directory, file)
    if os.path.isfile(full_path) and file.endswith('.md'):
        with open(full_path, 'r', encoding='utf-8') as f:
            text = f.read()
        word_count = len(text.split())
        word_counts.append(word_count)
        categorize_word_count(word_count, word_freq_bins, 1001)

# Visualization and statistics display functions would follow here

# Displaying word frequency bins
plt.figure(figsize=(10, 7.5)) # Set the size of the plot
plt.bar(word_freq_bins.keys(), word_freq_bins.values(), color='skyblue')
plt.title('Word Frequency Bins')
plt.xlabel('Word Count')
plt.ylabel('Number of Zettels')
plt.xticks(rotation=90)
plt.show()

# Displaying word count statistics
print(f"Total number of words: {sum(word_counts)}")
print(f"Average number of words per Zettel: {sum(word_counts) / len(word_counts) if word_counts else 0}")
print(f"Median number of words in a Zettel: {calculate_median_word_count(word_counts)}")
print(f"Minimum number of words in a Zettel: {min(word_counts, default=0)}")
print(f"Maximum number of words in a Zettel: {max(word_counts, default=0)}")
print(f"Most common word count: {Counter(word_counts).most_common(1)[0] if word_counts else 'N/A'}")
print(f"Least common word count: {Counter(word_counts).most_common()[-1] if word_counts else 'N/A'}")

And also, have people looked at the distribution of tags (how many per Zettel, how many Zettels per tag) as an indication of "goodness"?

Thanks

I haven't looked into this, but I will add it at some point.

Will · February 2024

@ZettelDistraction, thanks for the code!

Will · February 2024

Here is a graph of different data, but surprisingly to me is that it has about the same curve.

Links per zettel.

Python Code

import os
import re
import matplotlib.pyplot as plt
import numpy as np

def get_word_count(file_path):
    with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
        return len(re.findall(r'\[\[\d+\]\]', f.read()))

def get_file_paths(directory_path):
    return [os.path.join(directory_path, f) for f in os.listdir(directory_path) if os.path.isfile(os.path.join(directory_path, f))]

directory_path = '/Users/will/Dropbox/zettelkasten'
file_paths = get_file_paths(directory_path)
word_counts = [get_word_count(f) for f in file_paths]

# Define the bins
bins = [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, np.inf]
# Use numpy's histogram function to divide the data into bins
counts, bins = np.histogram(word_counts, bins=bins)
# Convert the bins to string labels, excluding the last bin
labels = [f'{int(bins[i])}-{int(bins[i+1])-1}' for i in range(len(bins)-2)]

# Handle the last label separately
labels.append(f'{int(bins[-2])}+')

# Convert the histogram data to a bar graph
plt.bar(labels, counts, color='#82D6F0', edgecolor="black", zorder=2)  # Light blue color

# Set the labels for the x-axis and y-axis
plt.xlabel('Link Count Bins')
plt.ylabel('Number of Zettels')
plt.title('Link Count Frequency by Zettel')

# Add a grid
plt.grid(True, which='both', color='grey', linewidth=0.5, linestyle='--')
# Tilt the labels on the x-axis
plt.xticks(rotation=45)

# Show the plot
plt.show()

Edit @ctietze: Fixed code

Jeremy · February 2024

I doubt this is the best, but it does work. It does not include the Zettel format analysis.

Many thanks for that. I will try it was soon as I have a moment and will report back.

Jeremy · February 2024

Here is the first histogram. I think I know what the long Zettels are all about; dumping grounds where I save things for later (yes, I'm a collector) under large umbrella concepts and that I have failed to keep up with and process.

Summary stats:
Average words per Zettel: 228.940876656473
Median per Zettel: 84

Suggest that the distribution is pretty skewed and that once I have processed the really long notes I should see about breaking up some of the not-quite-so-long notes.

Thanks for the program.

Zettelkasten Forum

Share with us what is happening in your ZK this week. February 9, 2024

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion