doc-toc

Homepage: https://github.com/dalanicolai/doc-tools-toc

Author: Daniel Laurens Nicolai

Updated:

Summary

Manage outlines/table of contents of pdf and djvu documents

Commentary

doc-toc.el is a package for creating and adding Tables of Contents to pdf
and djvu documents. It includes features for extracting the Table of Contents
from the textlayer of a document or via OCR if that last option is necessary
(or prefered). For 'software generated' PDFs it provides the option to use
pdf.tocgen (see URL `https://krasjet.com/voice/pdf.tocgen/'). Additionally,
this package implements various features for assisting in tidying up the
extracted Table of Contents, adjusting the pagenumbers and finally parsing
the Table of Contents into syntax that is understood by the `pdfoutline' and
`djvused' commands that are used to add the table of contents to pdf- and
djvu-files respectively.

Requirements: To use the pdf.tocgen functionality that software has to be
installed (see URL `https://krasjet.com/voice/pdf.tocgen/'). For the
remaining functions the package requires the `pdftotext' (part of
poppler-utils), `pdfoutline' (part of fntsample) and `djvused' (part of
http://djvu.sourceforge.net/) command line utilities to be available.
Extraction with OCR requires the tesseract command line utility to be
available.

Usage:


In each step below, check out available shortcuts using C-h m. Additionally
you can find available functions by typing the M-x mode-name (e.g. M-x
doc-toc-cleanup), or with two dashes in the mode name (e.g. M-x doc-toc--cleanup). Of
course if you use packages like Ivy or Helm you just use the fuzzy search
functionality.

Extraction and adding contents to a document is done in 4 steps:
1 extraction
2 cleanup
3 adjust/correct pagenumbers
4 add TOC to document

1. Extraction: For PDFs without TOC pages, with a very complicated TOC (i.e.
that require much cleanup work) or with headlines well fitted for automatic
extraction (you will have to decide for yourself by trying it) consider to
use the pdf.tocgen (URL `https://krasjet.com/voice/pdf.tocgen/')
functionality described below. Otherwise, start with opening some pdf or djvu
file in Emacs (pdf-tools and djvu package recommended). Find the pagenumbers
for the TOC. Then type M-x `doc-toc-extract-pages', or M-x
`doc-toc-extract-pages-ocr' if doc has no text layer or text layer is bad, and
answer the subsequent prompts by entering the pagenumbers for the first and
the last page each followed by RET. For PDF extraction with OCR, currently it
is required to view all contents pages once before extraction (doc-toc uses
the cached file data). Also the languages used for tesseract OCR can be
customized via the `doc-toc-ocr-languages' variable. A buffer with the, somewhat
cleaned up, extracted text will open in TOC-cleanup mode. Prefix command with
the universal argument (C-u) to omit clean and get the raw text. If the
extracted text is of too low quality you either can hack/extend the
`doc-toc-extract-pages-ocr' definition, or alternatively you can try to extract
the text with the python document-contents-extractor script (see URL
`https://pypi.org/project/document-contents-extractor/'), which is more
configurable (you are also welcome to hack and improve that script).

The documentation at URL
`https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage.html' might be
useful.

For TOC's that are formatted as two columns per page, prepend the
`doc-toc-extract-pages-ocr' command with two universal arguments. Then after you
are asked for the start and finish pagenumbers, a third question asks you to
set the tesseract psm code. For the double column layout it is best (as far
as I know) to use psm code '1'.

Software-generated PDF's with pdf.tocgen
For 'software-generated' (i.e. PDF's not created from scans) PDF-files it is
sometimes easier to use `doc-toc-extract-with-pdf-tocgen'. To use this function
you first have to provide the font properties for the different headline
levels. For that select the word in a headline of a certain level and then
type M-x `doc-toc-gen-set-level'. This function will ask which level you are
setting, the highest level should be level 1. After you have set the various
levels (1,2, etc.) then it is time to run M-x `doc-toc-extract-with-pdf-tocgen'.
If a TOC is extracted succesfully, then in the pdftocgen-mode buffer simply
press C-c C-c to add the contents to the PDF. The contents will be added to a
copy of the original PDF with the filename output.pdf and this copy will be
opened in a new buffer. If the pdf-tocgen option does not work well then
continue with the steps below.

If you merely want to extract text without further processing then you can
use the command `doc-toc-extract-only'.

2. TOC-Cleanup: In this mode you can further cleanup the contents to create a
list where each line has the structure:

TITLE (SOME) PAGENUMBER

(If the initial TOC looks bad/unusable then try to use then universal
argument C-u before extraction in the previous step and/or try the ocr option
with or without the universal argument)
There can be any number of spaces between TITLE and PAGE. The correct
pagenumbers can be edited in the next step. A document outline supports
different levels and levels are automatically assigned in order of increasing
number of preceding spaces, i.e. the lines with the least amount of preceding
spaces are assigned level 0 etc., and lines with equal number of spaces get
assigned the same levels.

Contents   1
Chapter 1      2
Section 1 3
Section 1.1     4
Chapter 2      5

There are some handy functions to assist in the cleanup. C-c C-j jumps
automatically to the next line not ending with a number and joins it with the
next line. If the indentation structure of the different lines does not
correspond with the levels, then the levels can be set automatically from the
number of separatorss in the indices with M-x doc-toc-cleanup-set-level-by-index.
The default separators is a . but a different separators can be entered by
preceding the function invocation with the universal argument (C-u). Some
documents contain a structure like

1 Chapter 1    1
Section 1      2

Here the indentation can be set with M-x replace-regexp ^[^0-9] -> \& (where
there is a space character before the \&).

Type C-c C-c when finished

3. TOC-tabular (adjust pagenumbers): This mode provides the functionality for
easy adjustment of pagenmumbers. The buffer can be navigated with the arrow
up/down keys. The left and right arrow keys will shift down/up all the page
numbers from the current line and below (combine with SHIFT for setting
individual pagenumbers).

The TAB key jumps to the pagenumber of the current line, while C-right/C-left
will shift all remaining page numbers up/down while jumping/scrolling to the
line its page in the document window. to the S-up/S-donw in the tablist
window will just scroll page up/down in the document window and, only for
pdf, C-up/C-down will scroll smoothly in that window.

Type C-c C-c when done.

4. TOC-mode (add outline to document): The text of this buffer should have
the right structure for adding the contents to (for pdf’s a copy of) the
original document. Final adjusments can be done but should not be necessary.
Type C-c C-c for adding the contents to the document.

By default, the TOC is simply added to the original file. ONLY FOR PDF’s, if
the (customizable) variable doc-toc-replace-original-file is nil, then the TOC is
added to a copy of the original pdf file with the path as defined by the
variable doc-toc-destination-file-name. Either a relative path to the original
file directory or an absolute path can be given.

Sometimes the `pdfoutline/djvused' application is not able to add the TOC to
the document. In that case you can either debug the problem by copying the
used terminal command from the `*messages*' buffer and run it manually in the
document's folder, or you can delete the outline source buffer and run
`doc-toc--tablist-to-handyoutliner' from the tablist buffer to get an outline
source file that can be used with HandyOutliner (see URL
`http://handyoutlinerfo.sourceforge.net/') Unfortunately the handyoutliner
command does not take arguments, but if you customize the
`doc-toc-handyoutliner-path' and `doc-toc-file-browser-command' variables, then Emacs
will try to open HandyOutliner and the file browser so that you can drag the
files directly into HandyOutliner).

Finally, if you just want to extract some text

Keybindings
Key Binding        Description

all-modes (i.e. all steps)
C-c C-c            dispatch (next step)

doc-toc-cleanup-mode
C-c C-j            doc-toc--join-next-unnumbered-lines
C-c C-s            doc-toc--roman-to-arabic

doc-toc (tablist)
TAB~               preview/jump-to-page
right/left         doc-toc-in/decrease-remaining
C-right/C-left     doc-toc-in/decrease-remaining and view page
S-right/S-left     in/decrease pagenumber current entry
C-down/C-up        scroll document other window (if document buffer shown)
S-down/S-up        full page scroll document other window ( idem )
C-j                doc-toc--jump-to-next-entry-by-level

Dependencies