Zettelkasten Forum


Share with us what is happening in your ZK this week. February 9, 2024

2»

Comments

  • @ZettelDistraction said:
    @Jeremy said:
    Now I see here the discussion of the size distribution of Zettels as a good starting point for Zettel maintenance and I wonder, first, which is the currently "best" python script for producing the histogram?

    I doubt this is the best, but it does work. It does not include the Zettel format analysis.

    import os
    import matplotlib.pyplot as plt
    from collections import Counter
    import seaborn as sns
    from statistics import median
    
    def calculate_median_word_count(word_counts):
        """Calculate the median word count from a list of word counts"""
        return median(word_counts)
    
    def initialize_word_freq_bins(max_bin_left_endpoint=1001, bin_width=50):
        """Initialize word frequency bins"""
        word_freq_bins = {f"{i}-{i+49}": 0 for i in range(1, max_bin_left_endpoint, bin_width)}
        word_freq_bins[f"{max_bin_left_endpoint}+"] = 0
        return word_freq_bins
    
    def categorize_word_count(word_count, word_freq_bins, max_bin_left_endpoint):
        """Categorize the Zettel based on the number of words into the word frequency bins"""
        if word_count >= max_bin_left_endpoint:
            word_freq_bins[f"{max_bin_left_endpoint}+"] += 1
        else:
            bin_label = f"{(word_count // 50) * 50 + 1}-{(word_count // 50) * 50 + 50}"
            if bin_label in word_freq_bins:
                word_freq_bins[bin_label] += 1
    
    # Apply the ggplot style
    sns.set(style="whitegrid")
    
    # Initialize variables
    word_counts = []
    word_freq_bins = initialize_word_freq_bins()
    
    # Directory processing
    zettel_directory = 'C:\\Users\\fleng\\OneDrive\\Documents\\Zettelkasten'
    for file in os.listdir(zettel_directory):
        full_path = os.path.join(zettel_directory, file)
        if os.path.isfile(full_path) and file.endswith('.md'):
            with open(full_path, 'r', encoding='utf-8') as f:
                text = f.read()
            word_count = len(text.split())
            word_counts.append(word_count)
            categorize_word_count(word_count, word_freq_bins, 1001)
    
    # Visualization and statistics display functions would follow here
    
    # Displaying word frequency bins
    plt.figure(figsize=(10, 7.5)) # Set the size of the plot
    plt.bar(word_freq_bins.keys(), word_freq_bins.values(), color='skyblue')
    plt.title('Word Frequency Bins')
    plt.xlabel('Word Count')
    plt.ylabel('Number of Zettels')
    plt.xticks(rotation=90)
    plt.show()
    
    # Displaying word count statistics
    print(f"Total number of words: {sum(word_counts)}")
    print(f"Average number of words per Zettel: {sum(word_counts) / len(word_counts) if word_counts else 0}")
    print(f"Median number of words in a Zettel: {calculate_median_word_count(word_counts)}")
    print(f"Minimum number of words in a Zettel: {min(word_counts, default=0)}")
    print(f"Maximum number of words in a Zettel: {max(word_counts, default=0)}")
    print(f"Most common word count: {Counter(word_counts).most_common(1)[0] if word_counts else 'N/A'}")
    print(f"Least common word count: {Counter(word_counts).most_common()[-1] if word_counts else 'N/A'}")
    
    

    And also, have people looked at the distribution of tags (how many per Zettel, how many Zettels per tag) as an indication of "goodness"?

    Thanks

    I haven't looked into this, but I will add it at some point.

    GitHub. Erdős #2. CC BY-SA 4.0. Problems worthy of attack / prove their worth by hitting back. -- Piet Hein.

  • @ZettelDistraction, thanks for the code!

    Will Simpson
    I must keep doing my best even though I'm a failure. My peak cognition is behind me. One day soon I will read my last book, write my last note, eat my last meal, and kiss my sweetie for the last time.
    kestrelcreek.com

  • edited February 20

    Here is a graph of different data, but surprisingly to me is that it has about the same curve.

    Links per zettel.


    Python Code
    import os
    import re
    import matplotlib.pyplot as plt
    import numpy as np
    
    def get_word_count(file_path):
        with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
            return len(re.findall(r'\[\[\d+\]\]', f.read()))
    
    def get_file_paths(directory_path):
        return [os.path.join(directory_path, f) for f in os.listdir(directory_path) if os.path.isfile(os.path.join(directory_path, f))]
    
    directory_path = '/Users/will/Dropbox/zettelkasten'
    file_paths = get_file_paths(directory_path)
    word_counts = [get_word_count(f) for f in file_paths]
    
    # Define the bins
    bins = [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, np.inf]
    # Use numpy's histogram function to divide the data into bins
    counts, bins = np.histogram(word_counts, bins=bins)
    # Convert the bins to string labels, excluding the last bin
    labels = [f'{int(bins[i])}-{int(bins[i+1])-1}' for i in range(len(bins)-2)]
    
    # Handle the last label separately
    labels.append(f'{int(bins[-2])}+')
    
    # Convert the histogram data to a bar graph
    plt.bar(labels, counts, color='#82D6F0', edgecolor="black", zorder=2)  # Light blue color
    
    # Set the labels for the x-axis and y-axis
    plt.xlabel('Link Count Bins')
    plt.ylabel('Number of Zettels')
    plt.title('Link Count Frequency by Zettel')
    
    # Add a grid
    plt.grid(True, which='both', color='grey', linewidth=0.5, linestyle='--')
    # Tilt the labels on the x-axis
    plt.xticks(rotation=45)
    
    # Show the plot
    plt.show()
    
    


    Edit @ctietze: Fixed code

    Post edited by ctietze on

    Will Simpson
    I must keep doing my best even though I'm a failure. My peak cognition is behind me. One day soon I will read my last book, write my last note, eat my last meal, and kiss my sweetie for the last time.
    kestrelcreek.com

  • I doubt this is the best, but it does work. It does not include the Zettel format analysis.

    Many thanks for that. I will try it was soon as I have a moment and will report back.

  • Here is the first histogram. I think I know what the long Zettels are all about; dumping grounds where I save things for later (yes, I'm a collector) under large umbrella concepts and that I have failed to keep up with and process.

    Summary stats:
    Average words per Zettel: 228.940876656473
    Median per Zettel: 84

    Suggest that the distribution is pretty skewed and that once I have processed the really long notes I should see about breaking up some of the not-quite-so-long notes.

    Thanks for the program.

Sign In or Register to comment.