Tool For Thought

Published in

stevenberlinjohnson

6 min readJan 29, 2005

This week’s edition of the Times Book Review features an essay that I wrote about the research system I’ve used for the past few years: a tool for exploring the couple thousand notes and quotations that I’ve assembled over the past decade — along with the text of finished essays and books. I suspect there will be a number of you curious about the technical details, so I’ve put together a little overview here, along with some specific observations. For starters, though, go read the essay and then come back once you’ve got an overview.

The software I use now is called DevonThink, and I’m sorry to report that it is only available for Mac OS X. (I know there are a number of advanced search tools available for Windows, so I’m sure most of what I describe here could be reproduced — I just don’t know enough about the search tools on that platform to recommend anything.)

I talked in the Times essay about using the tool as a springboard for new ideas and inspiration. Here’s what that process looks like in practice. This is the window that shows me an overview of part of my “research library” in DevonThink:

These are all books that I have transcribed digital passages from over the past 10 years or so — you can see how many quotes for each book in the little number in parentheses after each title. Oftentimes I’ll start the exploration with a straightforward keyword search, in this case: “urban ecosystem.” I plug that in, and get back one result, a short quote from Manuel DeLanda’s excellent 10,000 Years Of Non-Linear History.

This is where it gets interesting. I take that quote, and click on the “see also” button, which generates an instant list of other documents or quotes that have some semantic connection to the original one. I can see a few words from the entry, along with the author and book title.

I find another, more elaborate quote from DeLanda in that bunch:

And then I perform a “see also” on that quote. I get back a few pointers to essays that I’ve actually written — and completely forgotten about — including a review of an E.O. Wilson book on biodiversity that I wrote about three years ago. Ultimately, I end up with this wonderful quote from Jane Jacobs that draws an explicit analogy between natural and made-made ecosystems. The whole process takes me no more than a minute.

Over the past few years of working with this approach, I’ve learned a few key principles. The system works for three reasons:

1) The DevonThink software does a great job at making semantic connections between documents based on word frequency.

2) I have pre-filtered the results by selecting quotes that interest me, and by archiving my own prose. The signal-to-noise ratio is so high because I’ve eliminated 99% of the noise on my own.

3) Most of the entries are in a sweet spot where length is concerned: between 50 and 500 words. If I had whole eBooks in there, instead of little clips of text, the tool would be useless.

I think #3 is the point that needs to be drilled home to people working on desktop search. It’s been hidden from us largely because the web itself is broken up into pages that are often in that 500 word sweet spot. Think about the difference between Google and Google Desktop: Google gives you URLs in return for your search request; Google Desktop gives you files (and email messages or web pages where appropriate.) On the web, a URL is an appropriate search result because it’s generally the right scale: a single web page generally doesn’t include that much information (and of course a blog post even less.) So the page Google serves up is often very tightly focused on the information you’re looking for.

But files are a different matter. Think of all the documents you have on your machine that are longer than a thousand words: business plans, articles, ebooks, pdfs of product manuals, research notes, etc. When you’re making an exploratory search through that information, you’re not looking for the files that include the keywords you’ve identified; you’re looking for specific sections of text — sometimes just a paragraph — that relate to the general theme of the search query. If I do a Google Desktop search for “Richard Dawkins” I’ll get dozens of documents back, but then I have to go through and find all the sections inside those documents that are relevant to Dawkins, which saves me almost no time.

So the proper unit for this kind of exploratory, semantic search is not the file, but rather something else, something I don’t quite have a word for: a chunk or cluster of text, something close to those little quotes that I’ve assembled in DevonThink. If I have an eBook of Manual DeLanda’s on my hard drive, and I search for “urban ecosystem” I don’t want the software to tell me that an entire book is related to my query. I want the software to tell me that these five separate paragraphs from this book are relevant. Until the tools can break out those smaller units on their own, I’ll still be assembling my research library by hand in DevonThink.

I wonder whether it might be possible to have software create those smaller clippings on its own: you’d feed the program an entire e-book, and it would break it up into 200–1000 word chunks of text, based on word frequency and other cues (chapter or section breaks perhaps.) Already Devonthink can take a large collection of documents and group them into categories based on word use, so theoretically you could do the same kind of auto-classification within a document. It still wouldn’t have the pre-filtered property of my curated quotations, but it would make it far more productive to just dump a whole eBook into my digital research library.

The other thing that would be fascinating would be to open up these personal libraries to the external world. That would be a lovely combination of old-fashioned book-based wisdom, advanced semantic search technology, and the personality-driven filters that we’ve come to enjoy in the blogosphere. I can imagine someone sitting down to write an article about complexity theory and the web, and saying, “I bet Johnson’s got some good material on this in his ‘library.’” (You wouldn’t be able to pull down the entire database, just query it, so there wouldn’t be any potential for intellectual property abuse.) I can imagine saying to myself: “I have to write this essay on taxonomies, so I’d better sift through Weinberger’s library, and that chapter about power laws won’t be complete without a visit to Shirky’s database.”

These extra features would be wonderful, but the truth is I’m thrilled to have the software work as well as it does in its existing form. I’ve been fantasizing about precisely this kind of tool for nearly twenty years now, ever since I lost an entire semester building a Hypercard-based app for storing my notes during my sophomore year of college. There’s a longstanding assumption that the modern, web-enabled PC is the realization of the Memex, but if you go back and look at Bush’s essay, he was describing something more specific — a personal research tool that would learn as you interacted with it. That’s what I think about whenever I use this system to stumble across a genuinely useful new idea: finally, I have a Memex!

Tool For Thought

Written by Steven Johnson