Chapter 11. Data conversion

Table of Contents

11.1. Text data conversion tools
11.1.1. To convert a text file with iconv
11.1.2. To convert file names with iconv
11.1.3. EOL conversion
11.1.4. TAB conversion
11.1.5. Editors with auto-conversion
11.1.6. Plain text extraction
11.1.7. Highlighting and formatting plain text data
11.2. XML data
11.2.1. Basic hints for XML
11.2.2. XML processing
11.2.3. The XML data extraction
11.3. Printable data
11.3.1. Ghostscript
11.3.2. Merge two PS or PDF files
11.3.3. Printable data utilities
11.3.4. Printing with CUPS
11.4. Type setting
11.4.1. roff typesetting
11.4.2. TeX/LaTeX
11.4.3. Pretty print a manual page
11.4.4. Creating a manual page
11.5. The mail data conversion
11.5.1. Mail data basics
11.6. Graphic data tools
11.7. Miscellaneous data conversion

Tools and tips for converting data formats on the Debian system are described.

Standard based tools are in very good shape but support for proprietary data formats are limited.

11.1. Text data conversion tools

Following packages for the text data conversion caught my eyes:

Table 11.1. List of text data conversion tools.

package popcon size keyword function
libc6 V:95, I:99 11296 charset The text encoding converter between locales by iconv(1). (fundamental)
recode V:1.6, I:7 780 charset+eol The text encoding converter between locales. (versatile, more aliases and features)
konwert V:0.4, I:4 192 charset The text encoding converter between locales. (fancy)
nkf V:0.4, I:2 300 charset The character set translator for Japanese.
tcs V:0.02, I:0.16 544 charset The character set translator.
unaccent V:0.02, I:0.09 60 charset Replace accented letters by their unaccented equivalent.
tofrodos V:1.3, I:8 80 eol The text format converter between DOS and Unix: fromdos(1) and todos(1)
macutils V:0.07, I:0.7 356 eol The text format converter between Macintosh and Unix: frommac(1) and tomac(1)

11.1.1. To convert a text file with iconv

iconv(1) is provided as a part of the libc6 package and it is always available on all system to convert the encoding of characters:

$ iconv -f encoding1 -t encoding2 input.txt >output.txt

Encoding values are case insensitive and ignore "-" and "_" for matching. Supported encodings can be checked by the "iconv -l" command.

Table 11.2. List of encoding values and their usage.

encoding value usage
ASCII. American Standard Code for Information Interchange. 7 bit code w/o accented characters.
UTF-8 Standard multilingual compatibility for all modern OSs.
ISO-8859-1 Old standard for western European languages, ASCII + accented characters.
ISO-8859-2 Old standard for eastern European languages, ASCII + accented characters.
ISO-8859-15 Old standard for western European languages, ISO-8859-1 with euro sign.
CP850 Code page 850, Microsoft DOS characters with graphics for western European languages. ISO-8859-1 variant.
CP932 Code page 932, Microsoft Windows style Shift-JIS variant, for Japanese.
CP936 Code page 936, Microsoft Windows style GB2312, GBK, or GB18030 variant, for Simplified Chinese.
CP949 Code page 949, Microsoft Windows style EUC-KR or Unified Hangul Code variant, for Korean.
CP950 Code page 950, Microsoft Windows style Big5 variant, for Traditional Chinese.
CP1251 Code page 1251, Microsoft Windows style encoding for the Cyrillic alphabet.
CP1252 Code page 1252, Microsoft Windows style ISO-8859-15 variant for western European languages.
KOI8-R Old Russian UNIX standard for the Cyrillic alphabet.
ISO-2022-JP Standard encoding for Japanese e-mail which uses only 7 bit codes.
eucJP Old Japanese UNIX standard 8 bit code and completely different from Shift-JIS.
Shift-JIS JIS X 0208 Appendix 1 standard, for Japanese. See CP932 above.

[Note] Note

Some encodings are only supported for the data conversion and are not used as locale values (Section 8.3.1, “Basics of encoding”).

For character sets which fit in single byte such as ASCII and ISO-8859 character sets, the character encoding means almost the same thing as the character set.

For character sets with many characters such as JIS X 0213 for Japanese or Universal Character Set (UCS, Unicode, ISO-10646-1) for practically all languages, there are many encoding schemes to fit them into the sequence of the byte data:

The code page is used as the synonym to the character encoding tables for some vendor specific ones.

[Note] Note

Please note most encoding systems share the same code with ASCII for the 7 bit characters. But there are some exceptions. If you are converting old Japanese C programs and URLs data from the casually-called shift-JIS encoding format to UTF-8 format, use "CP932" as the encoding name instead of "shift-JIS" to get the expected results: 0x5C → "\" and 0x7E → "~" . Otherwise, these are converted to wrong characters.

[Tip] Tip

recode(1) may be used too and offers more than the combined functionality of iconv(1), fromdos(1), todos(1), frommac(1), and tomac(1). For more, see "info recode".

11.1.2. To convert file names with iconv

Here is an example script to convert encoding of file names from ones created under older OS to modern UTF-8 ones in a single directory.

#!/bin/sh
ENCDN=iso-8859-1
for x in *;
 do
 mv "$x" $(echo "$x" | iconv -f $ENCDN -t utf-8)
done

The "$ENCDN" variable should be set by the encoding value in Table 11.2, “List of encoding values and their usage.”.

For more complicated case, please mount disk drive containing such file names with proper encoding as the mount(8) option (see Section 8.3.6, “Filename encoding”) and copy entire disk to another disk drive mounted as UTF-8 with "cp -a" command.

11.1.3. EOL conversion

The text file format, specifically the end-of-line (EOL) code, is dependent on the platform:

Table 11.3. List of EOL conversion tools.

platform EOL code EOL control sequence EOL ASCII value
Debian (unix) LF ^J 10
MSDOS and Windows CR-LF ^M^J 13, 10
Apple's Macintosh CR ^M 13

The EOL format conversion programs, fromdos(1), todos(1), frommac(1), and tomac(1), are quite handy. recode(1) is also useful.

[Note] Note

Some data on the Debian system, such as the wiki page data for the python-moinmoin package, use MSDOS style CR-LF as the EOL code. So the above rule is just a general rule.

[Note] Note

Most editors (eg. vim, emacs, gedit, …) can handle files in MSDOS style EOL transparently.

[Tip] Tip

The use of "sed -e '/\r$/!s/$/\r/'" instead of todos(1) is better when you want to unify the EOL style to the MSDOS style from the mixed MSDOS and Unix style. (e.g., after merging 2 MSDOS style files with diff3(1).) This is because todos adds CR to all lines.

11.1.4. TAB conversion

There are few popular specialized programs to convert the tab codes:

Table 11.4. List of TAB conversion commands from bsdmainutils and coreutils packages.

function bsdmainutils coreutils
expand tab to spaces "col -x" expand
unexpand tab from spaces "col -h" unexpand

indent(1) from the indent package completely reformats whitespaces in the C program.

Editor programs such as vim and emacs can be used for TAB conversion, too. For example with vim, you can expand TAB with ":set expandtab" and ":%retab" command sequence. You can revert this with ":set noexpandtab" and ":%retab!" command sequence.

11.1.5. Editors with auto-conversion

Intelligent modern editors such as the vim program are quite smart and copes well with any encoding systems and any file formats. You should use these editors under the UTF-8 locale in the UTF-8 capable console for the best compatibility.

An old western European Unix text file, "u-file.txt", stored in the latin1 (iso-8859-1) encoding can be edited simply with vim as:

$ vim u-file.txt

This is possible since the auto detection mechanism of the file encoding in vim assumes the UTF-8 encoding first and, if it fails, assumes it to be latin1.

An old Polish Unix text file, "pu-file.txt", stored in the latin2 (iso-8859-2) encoding can be edited with vim as:

$ vim '+e ++enc=latin2 pu-file.txt'

An old Japanese unix text file, "ju-file.txt", stored in the eucJP encoding can be edited with vim as:

$ vim '+e ++enc=eucJP ju-file.txt'

An old Japanese MS-Windows text file, "jw-file.txt", stored in the so called shift-JIS encoding (more precisely: CP932) can be edited with vim as:

$ vim '+e ++enc=CP932 ++ff=dos jw-file.txt'

When a file is opened with "enc" and "ff" options, ":w" in the Vim command line stores it in the original format and overwrite the original file. You can also specify the saving format and the file name in the Vim command line, e.g., ":w ++enc=utf8 new.txt".

Please refer to the mbyte.txt "multi-byte text support" in vim on-line help and Table 11.2, “List of encoding values and their usage.” for locale values used with "++enc".

The emacs family of programs can perform the equivalent functions.

11.1.6. Plain text extraction

Following will read a web page into a text file. This is very useful when copying configurations off the Web or applying basic Unix text tools such as grep(1) on the web page.

$ w3m -dump http://www.remote-site.com/help-info.html >textfile

Similarly, you can extract plain text data from other formats using followings:

Table 11.5. List of tools to extract plain text data.

package popcon size keyword function
w3m V:21, I:84 1964 html→text An HTML to text converter with the "w3m -dump" command.
html2text V:14, I:40 308 html→text An advanced HTML to text converter. (ISO 8859-1)
lynx V:2, I:25 48 html→text An HTML to text converter with the "lynx -dump" command.
elinks V:2, I:6 1444 html→text An HTML to text converter with the "elinks -dump" command.
links V:3, I:9 1372 html→text An HTML to text converter with the "links -dump" command.
links2 V:0.9, I:4 3280 html→text An HTML to text converter with the "links2 -dump" command.
antiword V:1.2, I:2 796 MSWord→text,ps This converts MSWord files to plain text or ps.
catdoc V:0.9, I:2 2664 MSWord→text,TeX This converts MSWord files to plain text or TeX.
pstotext V:0.8, I:1.5 160 ps/pdf→text Extract text from PostScript and PDF files.
unhtml V:0.03, I:0.17 76 html→text Remove the markup tags from an HTML file.
odt2txt V:0.6, I:1.1 104 odt→text The converter from OpenDocument Text to text.
wpd2sxw V:0.02, I:0.14 156 WordPerfect→sxw WordPerfect to OpenOffice.org/StarOffice writer document converter.

11.1.7. Highlighting and formatting plain text data

Table 11.6. List of tools to highlight plain text data.

package popcon size keyword function
vim-runtime V:3, I:36 24676 highlight Vim can convert source code to HTML with ":source $VIMRUNTIME/syntax/html.vim" (vim MACRO)
cxref V:0.07, I:0.6 1104 c→html The converter for the C program to latex and HTML. (C language)
src2tex V:0.03, I:0.2 1968 highlight This convert many source codes to TeX. (C language)
source-highlight V:0.12, I:0.9 1940 highlight This convert many source codes to HTML, XHTML, LaTeX, Texinfo, ANSI color escape sequences and DocBook files with highlight. (C++)
highlight V:0.06, I:0.4 720 highlight This convert many source codes to HTML, XHTML, RTF, LaTeX, TeX or XSL-FO files with highlight. (C++)
grc V:0.03, I:0.12 164 text→color The generic colouriser for everything. (Python)
txt2html V:0.07, I:0.4 296 text→html Text to HTML converter. (Perl)
markdown V:0.09, I:0.4 96 text→html Markdown text document formatter to (X)HTML. (Perl)
asciidoc V:0.13, I:0.8 4316 text→any AsciiDoc text document formatter to XML/HTML. (Python)
python-docutils V:0.4, I:3 5060 text→any ReStructured Text document formatter to XML. (Python)
txt2tags V:0.06, I:0.3 1556 text→any The document conversion from text to HTML, SGML, LaTeX, man page, MoinMoin, Magic Point and PageMaker. (Python)
udo V:0.01, I:0.08 556 text→any universal document - text processing utility. (C language)
stx2any V:0.00, I:0.05 484 text→any The document converter from structured plain text to other formats. (m4)
rest2web V:0.01, I:0.10 576 text→html The document converter from ReStructured Text to html. (Python)
aft V:0.00, I:0.07 336 text→any The "free form" document preparation system. (Perl)
yodl V:0.00, I:0.06 520 text→any A pre-document language and tools to process it. (C language)
sdf V:0.00, I:0.10 1940 text→any The simple document parser. (Perl)
sisu V:0.01, I:0.08 7496 text→any The document structuring, publishing and search framework. (Ruby)

11.2. XML data

The Extensible Markup Language (XML) is a markup language for documents containing structured information.

XML.COM has good introductory information:

11.2.1. Basic hints for XML

XML text looks somewhat like HTML. It enables us to manage multiple formats of output for a document. One easy XML system is the docbook-xsl package, which is used here.

Each XML file starts with standard XML declaration:

<?xml version="1.0" encoding="UTF-8"?>

The basic syntax for one XML element is marked up as:

<name attribute="value">content</name>

XML element with empty content is marked up in the short form as:

<name attribute="value"/>

The "attribute="value"" in the above examples are optional.

The comment section in XML is marked up as:

<!-- comment -->

Other than adding markups, XML requires minor conversion to the content using predefined entities for the following character:

Table 11.7. List of predefined entities for XML.

predefined entity character to be converted from
&quot; " : quote
&apos; ' : apostrophe
&lt; < : less-than
&gt; > : greater-than
&amp; & : ampersand

[Caution] Caution

"<" or "&" can not be used in attributes or elements.

[Note] Note

When SGML style user defined entities, e.g. "&some-tag:", are used, the first definition wins over others. The entity definition is expressed in "<!ENTITY some-tag "entity value">".

[Note] Note

As long as the XML markup are done consistently with certain set of the tag name (either some data as content or attribute value), conversion to another XML is trivial task using Extensible Stylesheet Language Transformations (XSLT).

11.2.2. XML processing

There are many tools available to process XML files such as the Extensible Stylesheet Language (XSL).

Basically, once you create well formed XML file, you can convert it to any format using Extensible Stylesheet Language Transformations (XSLT).

The Extensible Stylesheet Language for Formatting Object (XSL-FO) is supposed to be solution for formatting. The fop package is in the Debian contrib (not main) archive still. So the LaTeX code is usually generated from XML using XSLT and the LaTeX system is used to create printable file such as DVI, PostScript, and PDF.

Table 11.8. List of XML tools.

package popcon size keyword function
docbook-xml V:17, I:53 2488 xml This package contains the XML document type definition (DTD) for DocBook.
xsltproc V:5, I:51 180 xslt XSLT command line processor. (XML→ XML, HTML, plain text, etc.)
docbook-xsl V:0.9, I:6 13000 xml/xslt This contains XSL stylesheets for processing DocBook XML to various output formats with XSLT.
xmlto V:0.4, I:2 272 xml/xslt XML-to-any converter with XSLT.
dblatex V:0.2, I:1.5 6844 xml/xslt This converts Docbook files to DVI, PostScript, PDF documents with XSLT.
fop V:0.13, I:0.9 2296 xml/xsl-fo This converts Docbook XML files to PDF.

Since XML is subset of Standard Generalized Markup Language (SGML), it can be processed by the extensive tools available for SGML, such as Document Style Semantics and Specification Language (DSSSL).

Table 11.9. List of DSSL tools.

package popcon size keyword function
openjade V:0.5, I:3 1212 dsssl Implementation of the DSSSL language based on James Clark's Jade software.
jade V:0.6, I:2 1056 dsssl James lark's DSSSL language.
docbook-dsssl V:0.9, I:5 3100 xml/dsssl This contains DSSSL stylesheets for processing DocBook XML to various output formats with DSSSL.
docbook-utils V:0.3, I:2 440 xml/dsssl The utilities for Docbook files including conversion to other formats (HTML, RTF, PS, man, PDF) with docbook2* commands with DSSSL.
sgml2x V:0.01, I:0.10 216 SGML/dsssl The converter from SGML and XML using DSSSL stylesheets.

[Tip] Tip

GNOMEyelp is sometimes handy to read DocBook XML files directly since it renders decently on X.

11.2.3. The XML data extraction

You can extract HTML or XML data from other formats using followings:

Table 11.10. List of XML data extraction tools.

package popcon size keyword function
wv V:1.6, I:3 2136 MSWord→any The document converter from Microsoft Word to HTML, LaTeX, etc..
texi2html V:0.4, I:3 1752 texi→html The converter from Texinfo to HTML.
man2html V:0.2, I:1.6 372 manpage→html The converter from manpage to HTML. (CGI support)
tex4ht V:0.2, I:2 884 tex↔html The converter between (La)TeX and HTML.
xlhtml V:0.6, I:1.5 184 MSExcel→html The converter from MSExcel .xls to HTML.
ppthtml V:0.5, I:1.5 120 MSPowerPoint→html The converter from MSPowerPoint to HTML.
unrtf V:0.4, I:1.0 276 rtf→html The document converter from RTF to HTML, etc..
info2www V:0.6, I:1.4 156 info→html The converter from GNU info to HTML. (CGI support)
ooo2dbk V:0.02, I:0.2 941 sxw→xml The converter from OpenOffice.org SXW documents to DocBook XML.
wp2x V:0.01, I:0.10 240 WordPerfect→any WordPerfect 5.0 and 5.1 files to TeX, LaTeX, troff, GML and HTML.
doclifter V:0.00, I:0.04 420 troff→xml The converter from troff to DocBook XML.

For non-XML HTML files, you can convert them to XHTML which is an instance of well formed XML and can be processed by XML tools.

Table 11.11. List of XML pretty print tools.

package popcon size keyword function
libxml2-utils V:4, I:52 152 xml↔html↔xhtml The command line XML tool with xmllint(1). (syntax check, reformat, lint, …)
tidy V:1.7, I:14 108 xml↔html↔xhtml HTML syntax checker and reformatter.

Once proper XML is generated, you can use XSLT technology to extract data based on the mark-up context etc.

11.3. Printable data

Printable data is expressed in the PostScript format on the Debian system. Common Unix Printing System (CUPS) uses Ghostscript as its rasterizer backend program for non-PostScript printers.

11.3.1. Ghostscript

The core of printable data manipulation is the Ghostscript PostScript (PS) interpreter which generates raster image.

The latest upstream Ghostscript from Artifex was re-licensed from AFPL to GPL and merged all the latest ESP version changes such as CUPS related ones at 8.60 release as unified release.

Table 11.12. List of Ghostscript PostScript interpreters.

package popcon size description
ghostscript V:16, I:49 3316 The GPL Ghostscript PostScript/PDF interpreter
ghostscript-x V:12, I:29 256 The GPL Ghostscript PostScript/PDF interpreter - X Display support
gs-cjk-resource I:0.5 4652 Resource files for gs-cjk, Ghostscript CJK-TrueType extension
cmap-adobe-cns1 I:0.4 1588 CMaps for Adobe-CNS1 (for traditional Chinese support)
cmap-adobe-gb1 I:0.4 1580 CMaps for Adobe-GB1 (for simplified Chinese support)
cmap-adobe-japan1 I:0.9 2476 CMaps for Adobe-Japan1 (for Japanese standard support)
cmap-adobe-japan2 I:0.4 440 CMaps for Adobe-Japan2 (for Japanese extra support)
cmap-adobe-korea1 I:0.2 912 CMaps for Adobe-Korea1 (for Korean support)
libpoppler4 I:13 2206 PDF rendering library based on xpdf PDF viewer
libpoppler-glib4 V:4, I:11 388 PDF rendering library (GLib-based shared library)
poppler-data I:0.4 12276 CMaps for PDF rendering library (for CJK support: Adobe-*)

[Tip] Tip

"gs -h" to display the configuration of Ghostscript.

11.3.2. Merge two PS or PDF files

You can merge two PostScript (PS) or Portable Document Format (PDF) files using gs(1) of Ghostscript.

$ gs -q -dNOPAUSE -dBATCH -sDEVICE=pswrite -sOutputFile=bla.ps -f foo1.ps foo2.ps
$ gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=bla.pdf -f foo1.pdf foo2.pdf
[Note] Note

The PDF, which is widely used cross-platform printable data format, is essentially the compressed PS format with few additional features and extensions.

[Tip] Tip

From command line, psmerge(1) and other commands from the psutils package are useful for manipulating PostScript documents. Commands in the pdfjam package work similarly for manipulating PDF documents. pdftk(1) from the pdftk package is useful for manipulating PDF documents, too.

11.3.3. Printable data utilities

The following packages for the printable data utilities caught my eyes:

Table 11.13. List of printable data utilities.

package popcon size keyword function
poppler-utils V:7, I:50 428 pdf→ps,text,… PDF utilities. (pdftops, pdfinfo, pdfimages, pdftotext, and pdffonts)
psutils V:4, I:25 408 ps→ps PostScript document conversion tools
poster V:2, I:15 80 ps→ps Create large posters out of PostScript pages.
xpdf-utils V:1.8, I:8 4688 pdf→ps,text,… PDF utilities. (pdftops, pdfinfo, pdfimages, pdftotext, and pdffonts)
enscript V:2, I:20 2432 text→ps, html, rtf Converts ASCII text to Postscript, HTML, RTF or Pretty-Print.
a2ps V:2, I:9 4288 text→ps 'Anything to PostScript' converter and pretty-printer.
pdftk V:0.8, I:4 3292 pdf→pdf PDF document conversion tool: (pdftk)
mpage V:0.2, I:1.8 224 text,ps→ps Print multiple pages per sheet.
html2ps V:0.2, I:2 260 html→ps The converter from HTML to PostScript.
pdfjam V:0.2, I:1.8 112 pdf→pdf PDF document conversion tools: pdf90, pdfjoin, and pdfnup
gnuhtml2latex V:0.11, I:0.8 24 html→latex The converter from html to latex.
latex2rtf V:0.15, I:1.0 544 latex→rtf This converts documents from LaTeX to RTF which can be read by MS Word.
ps2eps V:1.3, I:10 116 ps→eps The converter from PostScript to EPS (Encapsulated PostScript).
e2ps V:0.03, I:0.15 188 text→ps Text to PostScript converter with Japanese encoding support.
impose+ V:0.02, I:0.2 180 ps→ps Postscript utilities.
trueprint V:0.02, I:0.15 188 text→ps This pretty print many source codes (C, C++, Java, Pascal, Perl, Pike, Sh, and Verilog) to PostScript. (C language)
pdf2svg V:0.06, I:0.4 60 ps→svg Converter from PDF to Scalable vector graphics format.
pdftoipe V:0.03, I:0.18 648 ps→ipe Converter from PDF to IPE's XML format.

11.3.4. Printing with CUPS

Both lp(1) and lpr(1) commands offered by Common Unix Printing System (CUPS) provides options for customized printing the printable data.

For printing 3 copies of a file collated:

$ lp -n 3 -o Collate=True filename

, or

$ lpr -#3 -o Collate=True filename

You can further customize printer operation by using printer option such as "-o number-up=2", "-o page-set=even", "-o page-set=odd", "-o scaling=200", "-o natural-scaling=200", etc., documented at Command-Line Printing and Options.

11.4. Type setting

The Unix troff program originally developed by AT&T can be used for simple type setting. It is usually used to create manpages.

TeX created by Donald Knuth is very powerful type setting tool and is the de facto standard. LaTeX originally written by Leslie Lamport enables a high-level access to the power of TeX.

Table 11.14. List of type setting tools.

package popcon size keyword function
texlive-base V:6, I:20 17368 (La)TeX TeX system for typesetting, previewing and printing.
groff V:0.9, I:7 9360 troff GNU troff text-formatting system.

11.4.1. roff typesetting

Traditionally, roff is the main Unix text processing system. See roff(7), groff(7), groff(1), grotty(1), troff(1), groff_mdoc(7), groff_man(7), groff_ms(7), groff_me(7), groff_mm(7), and "info groff".

A good tutorial on "-me" macros is availabe:

  • Install the groff package,
  • find "/usr/share/doc/groff/meintro.me.gz", and
  • do the following:
$ zcat /usr/share/doc/groff/meintro.me.gz | \
     groff -Tascii -me - | less -R

The following will make a completely plain text file:

$ zcat /usr/share/doc/groff/meintro.me.gz | \
    GROFF_NO_SGR=1 groff -Tascii -me - | col -b -x > meintro.txt

For printing, use PostScript output.

$ groff -Tps meintro.txt | lpr
$ groff -Tps meintro.txt | mpage -2 | lpr

11.4.2. TeX/LaTeX

Preparation:

# aptitude install texlive

References for LaTeX:

  • The teTeX HOWTO: The Linux-teTeX Local Guide
  • tex(1)
  • latex(1)
  • "The TeXbook", by Donald E. Knuth, (Addison-Wesley)
  • "LaTeX - A Document Preparation System", by Leslie Lamport, (Addison-Wesley)
  • "The LaTeX Companion", by Goossens, Mittelbach, Samarin, (Addison-Wesley)

This is the most powerful typesetting environment. Many SGML processors use this as their back end text processor. Lyx provided by the lyx package and GNU TeXmacs provided by the texmacs package offer nice WYSIWYG editing environment for LaTeX while many use Emacs and Vim as the choice for the source editor.

There are many online resources available:

When documents become bigger, sometimes TeX may cause errors. You must increase pool size in "/etc/texmf/texmf.cnf" (or more appropriately edit "/etc/texmf/texmf.d/95NonPath" and run update-texmf(8)) to fix this.

[Note] Note

The TeX source of "The TeXbook" is available at http://tug.ctan.org/tex-archive/systems/knuth/dist/tex/texbook.tex.

This file contains most of the required macros. I heard that you can process this document with tex(1) after commenting lines 7 to 10 and adding "\input manmac \proofmodefalse". It's strongly recommended to buy this book (and all other books from Donald E. Knuth) instead of using the online version but the source is a great example of TeX input!

11.4.3. Pretty print a manual page

The following will print a manual page in PostScript and print it.

$ man -Tps some_manpage | lpr
$ man -Tps some_manpage | mpage -2 | lpr

11.4.4. Creating a manual page

Although writing a manual page (manpage) in the plain troff format is possible, there are few helper packages to create it.

Table 11.15. List of packages to help creating the manpage.

package popcon size keyword function
docbook-to-man V:0.6, I:3 248 SGML→manpage The converter from DocBook SGML into roff man macros.
help2man V:0.15, I:1.1 236 text→manpage Automatic manpage generator from --help.
info2man V:0.02, I:0.16 204 info→manpage The converter from GNU info to POD or man pages.
txt2man V:0.03, I:0.2 88 text→manpage Converts flat ASCII text to man page format.

11.5. The mail data conversion

The following packages for the mail data conversion caught my eyes:

Table 11.16. List of packages to help mail data conversion.

package popcon size keyword function
sharutils V:4, I:57 976 mail shar(1), unshar(1), uuencode(1), uudecode(1)
mpack V:2, I:47 84 mail The encoder and decoder MIME messages: mpack(1) and munpack(1).
tnef V:0.6, I:1.5 160 mail unpacking MIME attachments of type "application/ms-tnef" which is a Microsoft only format.
uudeview V:0.2, I:1.4 128 mail The encoder and decoder for the following formats: uuencode, xxencode, BASE64, quoted printable, and BinHex
mimedecode V:0.13, I:0.8 76 mail This decodes transfer encoded text type MIME messages.
readpst V:0.05, I:0.3 228 windows/mail This converts Outlook PST files to mbox format.

[Tip] Tip

The Internet Message Access Protocol version 4 (IMAP4) server (see Section 6.7, “POP3/IMAP4 server”) may be used to move mails out from proprietary mail systems if the mail client software can be configured to use IMAP4 server too.

11.5.1. Mail data basics

Mail (SMTP) data should be limited to 7 bit. So binary data and 8 bit text data are encoded into 7 bit format with the Multipurpose Internet Mail Extensions (MIME) and the selection of the charset (see Section 8.3.1, “Basics of encoding”).

The standard mail storage format is mbox formatted according to RFC2822 (updated RFC822). See mbox(5) (provided by the mutt package).

For European languages, "Content-Transfer-Encoding: quoted-printable" with the ISO-8859-1 charset is usually used since there are no much 8 bit characters. If the text is in UTF-8, "Content-Transfer-Encoding: quoted-printable" is also used since it is mostly 7 bit data.

For Japanese, traditionally "Content-Type: text/plain; charset=ISO-2022-JP" should be used to keep text in 7 bits. But mails from older Microsoft systems may use in Shift-JIS without proper declaration. For Japanese, if the text is in UTF-8, it contains many 8 bit data and is encoded into 7 bit data by Base64. The situation of other Asian languages is similar.

[Note] Note

If your non-Unix mail data is accessible by a non-Debian client software which can talk to the IMAP4 server, you may be able to move them out by running your own IMAP4 server (see Section 6.7, “POP3/IMAP4 server”).

[Note] Note

If you use other mail storage formats, moving them to mbox format is the good first step. The versatile client program such as mutt(1) may be handy for this.

You can split mailbox contents to each message using procmail(1) and formail(1).

Each mail message can be unpacked using munpack(1) from the mpack package (or other specialized tools) to obtain the MIME encoded contents.

11.6. Graphic data tools

The following packages for the graphic data conversion, editing, and organization tools caught my eyes:

Table 11.17. List of graphic data tools.

package popcon size keyword function
gimp V:13, I:49 13468 image(bitmap) The GNU Image Manipulation Program.
imagemagick V:15, I:32 304 image(bitmap) Image manipulation programs.
graphicsmagick V:1.5, I:3 3696 image(bitmap) Image manipulation programs. (folk of imagemagick)
xsane V:6, I:41 744 image(bitmap) GTK+-based X11 frontend for SANE (Scanner Access Now Easy).
netpbm V:4, I:23 4408 image(bitmap) Graphics conversion tools.
icoutils V:0.06, I:0.5 200 png↔ico(bitmap) Converts MS Windows icons and cursors to and from PNG formats (favicon.ico)
xpm2wico V:0.02, I:0.11 80 xpm→ico(bitmap) Converts XPM to MS Windows icon formats
scribus V:0.5, I:3 26864 ps/pdf/SVG/… The Scribus DTP editor.
openoffice.org-draw V:21, I:46 8808 image(vector) OpenOffice.org office suite - drawing
inkscape V:12, I:29 61584 image(vector) The SVG (Scalable Vector Graphics) editor.
dia-gnome V:1.5, I:4 620 image(vector) Diagram editor (GNOME)
dia V:2, I:5 620 image(vector) Diagram editor (Gtk)
xfig V:2, I:4 1768 image(vector) Facility for Interactive Generation of figures under X11
pstoedit V:1.2, I:9 880 ps/pdf→image(vector) PostScript and PDF files to editable vector graphics converter. (SVG)
libwmf-bin V:1.0, I:8 88 Windows/image(vector) Windows metafile (vector graphic data) conversion tools.
fig2sxd V:0.04, I:0.3 200 fig→sxd(vector) Convert XFig files to OpenOffice.org Draw format
unpaper V:0.2, I:1.4 736 image→image Post-processing tool for scanned pages for OCR.
tesseract-ocr V:0.4, I:2 2072 image→text Free OCR software based on the HP's commercial OCR engine.
tesseract-ocr-eng V:0.12, I:0.9 1760 image→text OCR engine data: tesseract-ocr language files for English text.
clara V:0.07, I:0.3 NOT_FOUND image→text Free OCR software.
gocr V:1.0, I:6 484 image→text Free OCR software.
ocrad V:0.8, I:6 364 image→text Free OCR software.
gtkam V:0.3, I:1.9 1348 image(Exif) Manipulates digital camera photo files (GNOME) - GUI
gphoto2 V:0.4, I:3 1008 image(Exif) Manipulates digital camera photo files (GNOME) - command line
kamera V:1.1, I:20 292 image(Exif) Manipulates digital camera photo files (KDE)
jhead V:0.7, I:3 128 image(Exif) Manipulates the non-image part of Exif compliant JPEG (digital camera photo) files
exif V:0.2, I:1.7 276 image(Exif) Command-line utility to show EXIF information in JPEG files
exiftags V:0.17, I:1.0 248 image(Exif) Utility to read Exif tags from a digital camera JPEG file
exiftran V:0.19, I:1.3 92 image(Exif) Transforms digital camera jpeg images
exifprobe V:0.07, I:0.5 484 image(Exif) Reads metadata from digital pictures
dcraw V:1.1, I:6 408 image(Raw)→ppm Decodes raw digital camera images
findimagedupes V:0.08, I:0.4 136 image→fingerprint Finds visually similar or duplicate images
ale V:0.03, I:0.2 768 image→image Merges images to increase fidelity or create mosaics
imageindex V:0.04, I:0.3 192 image(Exif)→html Generates static HTML galleries from images
f-spot V:0.5, I:1.8 10508 image(Exif) Personal photo management application (GNOME)
bins V:0.02, I:0.2 2008 image(Exif)→html Generates static HTML photo albums using XML and EXIF tags
galrey V:0.01, I:0.15 116 image(Exif)→html Generates browsable HTML photo albums with thumbnails
outguess V:0.03, I:0.15 252 jpeg,png Universal Steganographic tool
qcad V:1.3, I:2 3824 DXF CAD data editor (KDE)
blender V:0.7, I:3 28588 blend, TIFF, VRML, … 3D content editor for animation etc.
open-font-design-toolkit I:0.02 36 ttf, ps, … Metapackage for open font design
fontforge V:0.2, I:1.9 6028 ttf, ps, … Font editor for PS, TrueType and OpenType fonts
xgridfit V:0.00, I:0.05 752 ttf a program for gridfitting, or "hinting," TrueType fonts
gbdfed V:0.02, I:0.15 536 bdf Editor for BDF fonts

[Tip] Tip

Search more image tools using regex "~Gworks-with::image" in aptitude(8) (see Section 2.2.5, “Search method options with aptitude”).

Although GUI programs such as gimp(1) are very powerful, command line tools such as imagemagick(1) are quite useful for automating image manipulation with the script.

The de facto image file format of the digital camera is the Exchangeable Image File Format (EXIF) which is the JPEG image file format with additional metadata tags. It can hold information such as date, time, and camera settings.

The Lempel-Ziv-Welch (LZW) lossless data compression patent has been expired. Graphics Interchange Format (GIF) utilities which use the LZW compression method are now freely available on the Debian system.

[Tip] Tip

Any digital camera or scanner with removable recording media will work with Linux through USB Mass Storage readers since it follows the Design rule for Camera Filesystem.

11.7. Miscellaneous data conversion

There are many other programs for converting data. Following packages caught my eyes using regex "~Guse::converting" in aptitude(8) (see Section 2.2.5, “Search method options with aptitude”):

Table 11.18. List of miscellaneous data conversion tools.

package popcon size keyword function
alien V:1.5, I:12 276 rpm/tgz→deb The converter for the foreign package into the Debian package.
freepwing V:0.00, I:0.03 568 EB→EPWING The converter from "Electric Book" (popular in Japan) to a single JIS X 4081 format (a subset of the EPWING V1).

You can also extract data from RPM format with:

$ rpm2cpio file.src.rpm | cpio --extract