Document management

Posted on Sept. 12, 2021 by Ben Dickson.


For many many years my approach to paperwork management was: the drawer. There was a drawer, the paperwork got shoved in it. Sounds borderline madness, but remarkably worked quite well.

Filing was a non-issue (shove it in the drawer). To use some overly generously words, it was well optimized for this use case. Even finding documents wasn't too bad - if they were recent documents, they were near the top. Finding older documents was less common, and a bit more time consuming, but even finding that tax-file-number document from 10 years ago only took a few minutes.

However with buying a house, starting a business, and so on - this scheme doesn't work quite so well.

Simple first step

Actually organising the documents in any way was a good start. A few hours sorting through The Drawer, ditching old irrelevant documents, putting similar documents together in plastic document envelope things.

Living in a country which often catches fire, I also put some important documents in a cheap fireproof chest.

Practically, it is almost reasonable to stop here.

Digital

The main problem is the inexorable march of time. Most documents I get now are no longer paper. They are PDF's.

I could print out everything and store them the same way, however:

  1. It seems wasteful to make printouts of stuff that will, mostly, never be looked at again.
  2. It's tedious - slow to do, creates more paper which in turns makes more paper to organize.
  3. Has almost no other upsides.

Options

There is many options for managing documents, but I shall restrict them to a few options:

  1. Free and open source
  2. Actively maintained
  3. Preferably implemented in a language I'm familiar with

OCRmyPDF

Fairly "low level" tool which just takes a document as a PDF, runs OCR, then embeds the resulting text into the document in PDF/A format (an archival variation of PDF)

A resulting tool would basically take a folder full of PDF's and run the OCRmyPDF command over them. It would be up to me and the OS file-browser or similar to organize the files and search them.

With a bit of work a quite nice system could be assembled, but seems like more effort than I'd like.

There is some projects which possibly implement much of this, like https://github.com/cmccambridge/ocrmypdf-auto

paperless-ng

A long running project, with support for PDFs, images, text files, and Word/LibeOffice type documents. There is also an Android and iOS application as well as the main web interface.

Was quite simple to get running via it's recommended Docker setup.

I played around with it for a bit and it works. For some reason I didn't really like it's interface; I don't have any particular well articulated reasons for this, just seemed a little "too busy"

papermerge

Papermerge is a much younger project than paperless-ng. It is certainly not nearly as featured, but I found it's interface a little nicer in the main activities of "viewing uploaded documents".

It has quite a nice function to edit PDF's by reordering or removing pages (e.g if a PDF was scanned in the wrong order)

There is two problems I have with it currently:

  1. There's no obvious progress indication of processing - when a PDF is uploaded, the OCR can take a minute or two (running on an old HP Microserver) but the only obvious progress information was in the log view. Oddly this seems to be the case with paperless-ng also.
  2. Major thing: it doesn't embed the OCR'd text in the PDF. However in the next release (v2.1) this will be done

The plan

I have a Canon LiDE 300 scannner, which works fine in Debian Bullseye onwards.

My plan is to have a Raspberry Pi hooked up to this, so I can click the "scan now" button, and the document gets scanned, sent to Papermerge, where it would be OCRd.

This post (Archive.org link) has details on making the scan button work under Linux (basically using scanbd).