@ZettelDistraction said: @Jeremy said:
Now I see here the discussion of the size distribution of Zettels as a good starting point for Zettel maintenance and I wonder, first, which is the currently "best" python script for producing the histogram?
I doubt this is the best, but it does work. It does not include the Zettel format analysis.
import os
import matplotlib.pyplot as plt
from collections import Counter
import seaborn as sns
from statistics import median
def calculate_median_word_count(word_counts):
"""Calculate the median word count from a list of word counts"""
return median(word_counts)
def initialize_word_freq_bins(max_bin_left_endpoint=1001, bin_width=50):
"""Initialize word frequency bins"""
word_freq_bins = {f"{i}-{i+49}": 0 for i in range(1, max_bin_left_endpoint, bin_width)}
word_freq_bins[f"{max_bin_left_endpoint}+"] = 0
return word_freq_bins
def categorize_word_count(word_count, word_freq_bins, max_bin_left_endpoint):
"""Categorize the Zettel based on the number of words into the word frequency bins"""
if word_count >= max_bin_left_endpoint:
word_freq_bins[f"{max_bin_left_endpoint}+"] += 1
else:
bin_label = f"{(word_count // 50) * 50 + 1}-{(word_count // 50) * 50 + 50}"
if bin_label in word_freq_bins:
word_freq_bins[bin_label] += 1
# Apply the ggplot style
sns.set(style="whitegrid")
# Initialize variables
word_counts = []
word_freq_bins = initialize_word_freq_bins()
# Directory processing
zettel_directory = 'C:\\Users\\fleng\\OneDrive\\Documents\\Zettelkasten'
for file in os.listdir(zettel_directory):
full_path = os.path.join(zettel_directory, file)
if os.path.isfile(full_path) and file.endswith('.md'):
with open(full_path, 'r', encoding='utf-8') as f:
text = f.read()
word_count = len(text.split())
word_counts.append(word_count)
categorize_word_count(word_count, word_freq_bins, 1001)
# Visualization and statistics display functions would follow here
# Displaying word frequency bins
plt.figure(figsize=(10, 7.5)) # Set the size of the plot
plt.bar(word_freq_bins.keys(), word_freq_bins.values(), color='skyblue')
plt.title('Word Frequency Bins')
plt.xlabel('Word Count')
plt.ylabel('Number of Zettels')
plt.xticks(rotation=90)
plt.show()
# Displaying word count statistics
print(f"Total number of words: {sum(word_counts)}")
print(f"Average number of words per Zettel: {sum(word_counts) / len(word_counts) if word_counts else 0}")
print(f"Median number of words in a Zettel: {calculate_median_word_count(word_counts)}")
print(f"Minimum number of words in a Zettel: {min(word_counts, default=0)}")
print(f"Maximum number of words in a Zettel: {max(word_counts, default=0)}")
print(f"Most common word count: {Counter(word_counts).most_common(1)[0] if word_counts else 'N/A'}")
print(f"Least common word count: {Counter(word_counts).most_common()[-1] if word_counts else 'N/A'}")
And also, have people looked at the distribution of tags (how many per Zettel, how many Zettels per tag) as an indication of "goodness"?
Thanks
I haven't looked into this, but I will add it at some point.
GitHub. Erdős #2. Problems worthy of attack / prove their worth by hitting back. -- Piet Hein. Alter ego: Erel Dogg (not the first). CC BY-SA 4.0.
Will Simpson
My zettelkasten is for my ideas, not the ideas of others. I don’t want to waste my time tinkering with my ZK; I’d rather dive into the work itself. My peak cognition is behind me. One day soon, I will read my last book, write my last note, eat my last meal, and kiss my sweetie for the last time. kestrelcreek.com
Here is a graph of different data, but surprisingly to me is that it has about the same curve.
Links per zettel.
Python Code
import os
import re
import matplotlib.pyplot as plt
import numpy as np
def get_word_count(file_path):
with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
return len(re.findall(r'\[\[\d+\]\]', f.read()))
def get_file_paths(directory_path):
return [os.path.join(directory_path, f) for f in os.listdir(directory_path) if os.path.isfile(os.path.join(directory_path, f))]
directory_path = '/Users/will/Dropbox/zettelkasten'
file_paths = get_file_paths(directory_path)
word_counts = [get_word_count(f) for f in file_paths]
# Define the bins
bins = [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, np.inf]
# Use numpy's histogram function to divide the data into bins
counts, bins = np.histogram(word_counts, bins=bins)
# Convert the bins to string labels, excluding the last bin
labels = [f'{int(bins[i])}-{int(bins[i+1])-1}' for i in range(len(bins)-2)]
# Handle the last label separately
labels.append(f'{int(bins[-2])}+')
# Convert the histogram data to a bar graph
plt.bar(labels, counts, color='#82D6F0', edgecolor="black", zorder=2) # Light blue color
# Set the labels for the x-axis and y-axis
plt.xlabel('Link Count Bins')
plt.ylabel('Number of Zettels')
plt.title('Link Count Frequency by Zettel')
# Add a grid
plt.grid(True, which='both', color='grey', linewidth=0.5, linestyle='--')
# Tilt the labels on the x-axis
plt.xticks(rotation=45)
# Show the plot
plt.show()
Will Simpson
My zettelkasten is for my ideas, not the ideas of others. I don’t want to waste my time tinkering with my ZK; I’d rather dive into the work itself. My peak cognition is behind me. One day soon, I will read my last book, write my last note, eat my last meal, and kiss my sweetie for the last time. kestrelcreek.com
Here is the first histogram. I think I know what the long Zettels are all about; dumping grounds where I save things for later (yes, I'm a collector) under large umbrella concepts and that I have failed to keep up with and process.
Summary stats:
Average words per Zettel: 228.940876656473
Median per Zettel: 84
Suggest that the distribution is pretty skewed and that once I have processed the really long notes I should see about breaking up some of the not-quite-so-long notes.
Comments
I doubt this is the best, but it does work. It does not include the Zettel format analysis.
I haven't looked into this, but I will add it at some point.
GitHub. Erdős #2. Problems worthy of attack / prove their worth by hitting back. -- Piet Hein. Alter ego: Erel Dogg (not the first). CC BY-SA 4.0.
@ZettelDistraction, thanks for the code!
Will Simpson
My zettelkasten is for my ideas, not the ideas of others. I don’t want to waste my time tinkering with my ZK; I’d rather dive into the work itself. My peak cognition is behind me. One day soon, I will read my last book, write my last note, eat my last meal, and kiss my sweetie for the last time.
kestrelcreek.com
Here is a graph of different data, but surprisingly to me is that it has about the same curve.
Links per zettel.
Python Code
Edit @ctietze: Fixed code
Will Simpson
My zettelkasten is for my ideas, not the ideas of others. I don’t want to waste my time tinkering with my ZK; I’d rather dive into the work itself. My peak cognition is behind me. One day soon, I will read my last book, write my last note, eat my last meal, and kiss my sweetie for the last time.
kestrelcreek.com
Many thanks for that. I will try it was soon as I have a moment and will report back.
Here is the first histogram. I think I know what the long Zettels are all about; dumping grounds where I save things for later (yes, I'm a collector) under large umbrella concepts and that I have failed to keep up with and process.
Summary stats:
Average words per Zettel: 228.940876656473
Median per Zettel: 84
Suggest that the distribution is pretty skewed and that once I have processed the really long notes I should see about breaking up some of the not-quite-so-long notes.
Thanks for the program.