How Do You Structure Zettels from Long-form YouTube Tutorials?
I often watch 2–4 hour YouTube tutorials or lectures on topics like Python, data structures, or media theory. These videos contain both conceptual explanations and live demos, but I struggle to extract structured notes without pausing constantly.
Has anyone developed a Zettelkasten-friendly workflow for capturing insights from long-form videos?
Do you take fleeting notes first, then convert them into Zettels? Or pause regularly to write permanent notes on the spot?
Howdy, Stranger!
Comments
I essentially process the video in two phases.
I do phase 1 and then phase 2 for the whole video, or alternating the two phases in multiple sessions, sometimes I'm inspired to alternate.
Phase 1
One single "literature note" for every video, in which I write my reflections, questions, ideas and points captured from the video, as raw and frictionless as I can.
I stop the video according to the time need to write.
I use hierarchical bullet list style, that induce me to be raw and concise.
Sometimes capture a screenshot when it is full of meanings.
For some bullet it could be very useful to write the timestamp of the video.
Phase 2
I revisit the bullet list And I start to decide what to do with every bullet.
I can develop the bullet in a bigger piece of text, I can move and rearrange bullets and create small clusters, I can transform piece of texts in concepts, I can add new bullets as further developments. All this in the bullet list. And I try to make and write a "title" for relevant texts or clusters, in bold.
In the next section of the literature note, I start composing an outline of the titles I've expressed into the bullet list; these title almost always become links new notes, And I can start writing the body of these notes copying or developing further the stuff obtained in the outline, giving enough context. Other times texts are updates on already existent notes.
At the end of the process this outline represents the network of thoughts extracted from the video, and I can start to distribute this links into structure notes or other notes, and sometimes this outlines become an initial structure note about a field.
The process of writing directly main notes from the video is too inefficient to me.
It's better "grind" the video into raw pieces of texts during the watching, and then make the main notes at a later time starting from the raw pieces, with the necessary timing for thought
This is most useful for YouTube contents that are words-driven, may not be entirely suitable for visually-oriented tutorials, but one thing you could do when you import the material into your notes is to have some tool transcribe the audio, so that you don't have to do that manually.
For example, I have Snipd app which allows me to have AI summarize and transcribe one of Sascha's coaching sessions available on YouTube. I use it to seed a literature note, which looks like this:
This is prior to my own "processing" so pretty much everything is as is exported from Snipd. Starting from this, I can edit or add my own notes to make it my own, reading or capturing off of screens things that are not spoken.
I don't meant to shill for Snipd but it allows uploading videos for processing like this. I believe there are other transcriber tools/services, or reading subtitles directly from YouTube.
Yes, transcriptions are a good resource to facilitate work, but I'm personally a bit reluctant to rely on them. Exclusively on them, at least.
A video almost always contains a wealth of information and effectiveness of message delivery, that a transcript often doesn't capture.
For example, someone who speaks with a certain tone can highlight the importance of a concept.
Timings, body language, expressions, can make a difference when given the same amount of available textual information.
A conversation between two people on the topic can convey much more than the text. I remember di
Another example, the content of a single frame, which can be illuminating much more than the contextual explanation.
Non-verbal communication aspects of the video, in general.
A video experience can have its own specific benefits.
Different modes, different activations: Listening activates different cognitive processes compared to reading. Watching and listening can help reinforce certain concepts more effectively through multisensory engagement
So I still recommend watching the video at least once as part of the work.
Balancing time and effort needed with the importance of the work, of course. If I need only to capture a small main concept from a video, the quickest way could be the best way.
Transcript has its own advantage, like Speed, Searchability, easier annotation, and ability to pick up details from words read rather than heard above all
Machine transcription and AI summaries can indeed feed into the collector's fallacy, which is a major downside users should be aware of.
That said, processing video content is far more time-consuming without transcription and timestamping. You’d have to listen and manually transcribe quotes verbatim, which is fine if you’re diving straight into processing or rephrasing. But most people can’t meticulously handle every piece of content like that, especially given the sheer volume and varying quality of material out there.
With written sources, quoting and referencing is straightforward thanks to textual data and page numbers. For videos, machine transcription and timestamping make creating the framework for structure or literature notes much more time-efficient.