My very noobish attempt at making atomic extracts from a PDF

achamess · January 2018

Hi friends,

I work with PDFs a lot in my academic work. Mostly reading journal articles. Commonly, I make highlights of important parts of the text. To support my writing, I want to extract each highlight as an individual, atomic text file. Doing this by hand would be tedious. I have very novice code skills, but I know enough Python and CL to be dangerous. In any case, I set a minimal goal of getting highlights out of a Skim PDF and making individual files. Happily, I did that. The code is crude and there is a lot more to do, but I want to share this to invite anyone else who might be interested to contribute or provide feedback.

Here it is: skim_to_md

With this basic functionality, I will keep layering more on. My ultimate goal is to format the text with additional fields and metadata using Markdown. I could probably hard code this into the script, but I'm thinking that using a templating engine like Django or Jinja2 will allow this workflow to be more flexible and extensible in the future. So that's where I'll go next. I'm totally open to changing the plan.

I realize that these extract notes are information, not knowledge @ctietze. But they are the beginning of knowledge. I have different kinds of notes in my workflow, and this is the raw material. My knowledge zettel are where I string pieces together in an overview/outline note, linking to the raw extracts I create here.

Here is an example of what I want a full-fledged note to look like:

Title:

tags: #tag1 #tag2

Summary

Quote

[The highlighted text]

Reference: Smith et al. 2017 Journal of Neuroscience
Citekey: #smith2017a

Comments

Stuff I need to add:

Work on naming the files more informatively
Make different Markdown fields using templates
Add reference information/citekey
Clean up the quoted text more

Interacting with these notes

And of course, once you make these, you need to work with them. At a very basic level, I could just use the Finder. But having a very fast and user-friendly interface could help me make the most of these notes. nvALT.app is a great example. In the end, it's just a nice wrapper for text files in a folder. But that wrapper makes working with the files that much better. I'm trying to make SublimeText3 to work for me. @rene 's add-on gives a lot of the feel of nvALT, which is great. So I may do that. Or if Bitwriter.app ever comes out, that might be the best way to interact with the notes, since I hear it will support folders.

AsafKeller · January 2018

I am receiving the following error:
File "skim_to_md.py", line 20, in
pdf = pdf[0].replace('"','')
IndexError: list index out of range

achamess · January 2018

@AsafKeller said:
I am receiving the following error:
File "skim_to_md.py", line 20, in
pdf = pdf[0].replace('"','')
IndexError: list index out of range

Hi. Do you have the PDF in the same directory as the script? That is probably the issue.
I updated the script btw. You may want to re-download.

AsafKeller · January 2018

Do you have the PDF in the same directory as the script? That is probably the issue.

Yes, that was the problem. Thanks much!

achamess · January 2018

@AsafKeller said:

Do you have the PDF in the same directory as the script? That is probably the issue.

Yes, that was the problem. Thanks much!

@AsafKeller Excellent! And feel free to fork the script and make changes. You can go into the Python file and change the text layout too if you want to rearrange how the Markdown text look.

I'm only beginning to realize the power of making my own scripts. There are a lot of other things I'd like to do to make this even easier and more powerful, like integrate with my reference manager.

BTW. Is this you?
http://blog.devontechnologies.com/2017/11/productivity²/
Is Tinderbox + DTP still your preferred PKB?

Small world! I'm also a pain researcher (MD-PhD student at Duke). I've been trying to figure out how to make a knowledgebase for all of my scholarly activities. It's been a struggle throughout grad school but I think I'm finally getting some clarity. This script (and others I plan to make) will hopefully reduce the friction and make a timeless system that I actually use.

achamess · January 2018

I should be doing other stuff, but this has been a wonderful exercise in learning some code. I integrated some Applescript to get information from Skim.app and Papers3.app (which I use). So to use this, you must have both Skim and Papers3. But now everything is automatic, without any manual input of metadata.

https://github.com/achamess/skim_to_md_script

Stuff I added:

Automatically finds metadata from Papers3
No overwrite of previous notes
Naming of extracts contains summary one-liner (your own words)
Automatic folder creation based on name of PDF

Zettelkasten Forum

My very noobish attempt at making atomic extracts from a PDF

Title:

Summary

Quote

Comments

Stuff I need to add:

Interacting with these notes

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion