Chapter 10. Data management

Table of Contents

10.1. Sharing, copying, and archiving
10.1.1. Archive and compression tools
10.1.2. Copy and synchronization tools
10.1.3. Idioms for the archive
10.1.4. Idioms for the copy
10.1.5. Idioms for the selection of files
10.1.6. Backup and recovery
10.1.7. Backup utility suites
10.1.8. An example script for the system backup
10.1.9. A copy script for the data backup
10.1.10. Removable mass storage device
10.1.11. Sharing data via network
10.1.12. Archive media
10.2. The binary data
10.2.1. Making the disk image file
10.2.2. Writing directly to the disk
10.2.3. Mounting the disk image file
10.2.4. Making an empty disk image file
10.2.5. Viewing and editing binary data
10.2.6. Manipulating files without mounting disk
10.2.7. Data redundancy
10.2.8. Data file recovery and forensic analysis
10.2.9. Making the ISO9660 image file
10.2.10. Writing directly to the CD/DVD-R/RW
10.2.11. Mounting the ISO9660 image file
10.2.12. Spliting a large file into small files
10.2.13. Clearing file contents
10.2.14. Dummy files
10.2.15. Erasing an entire hard disk
10.2.16. Undeleting deleted but still open files
10.2.17. Searching all hardlinks
10.2.18. Invisible disk space consumption
10.3. Data security infrastructure
10.3.1. Key management for Gnupg
10.3.2. Using GnuPG with files
10.3.3. Using GnuPG with Mutt
10.3.4. Using GnuPG with Vim
10.3.5. The MD5 sum
10.4. Source code merge tools
10.4.1. Extracting differences for source files
10.4.2. Merging updates for source files
10.4.3. Updating via 3-way-merge
10.5. Version control systems
10.5.1. Comparison of VCS commands
10.6. CVS
10.6.1. Installing a CVS server
10.6.2. Using local CVS server
10.6.3. Using remote CVS pserver
10.6.4. Anonymous CVS (download only)
10.6.5. Using remote CVS through ssh
10.6.6. Creating a new CVS archive
10.6.7. Working with CVS
10.6.8. Exporting files from CVS
10.6.9. Administration of CVS
10.6.10. File permissions in CVS repository
10.6.11. Execution bit
10.7. Subversion
10.7.1. Installing a Subversion server
10.7.2. Setting up a repository
10.7.3. Configuring Apache2
10.7.4. Subversion usage examples
10.7.5. Creating a new Subversion archive
10.7.6. Working with Subversion
10.8. Git
10.8.1. Before using Git …
10.8.2. Git references
10.8.3. Git commands
10.8.4. Git for recording configuration history

Tools and tips for managing binary and text data on the Debian system are described.

10.1. Sharing, copying, and archiving

The security of the data and its controlled sharing have several aspects:

  • the creation of data archive,
  • the remote storage access,
  • the duplication,
  • the tracking of the modification history,
  • the facilitation of data sharing,
  • the prevention of unauthorized file access, and
  • the detection of unauthorized file modification.

These can be realized by using some combination of:

  • archive and compression tools,
  • copy and synchronization tools,
  • network filesystems,
  • removable storage media,
  • the secure shell,
  • the authentication system,
  • version control system tools, and
  • hash and cryptographic encryption tools.

10.1.1. Archive and compression tools

Here is a summary of archive and compression tools available on the Debian system:

Table 10.1. List of archive and compression tools.

package popcon size command extension comment
tar V:62, I:99 2456 tar(1) .tar the standard archiver (de facto standard)
cpio V:33, I:99 664 cpio(1) .cpio Unix System V style archiver, use with find(1)
binutils V:49, I:78 9736 ar(1) .ar archiver for the creation of static libraries
fastjar V:4, I:39 220 fastjar(1) .jar archiver for Java (zip like)
pax V:1.5, I:5 172 pax(1) .pax new POSIX standard archiver, compromise between tar and cpio
afio V:0.3, I:1.6 240 afio(1) .afio extended cpio with per-file compression etc.
gzip V:90, I:99 292 gzip(1), zcat(1), … .gz GNU LZ77 compression utility (de facto standard)
bzip2 V:56, I:80 132 bzip2(1), bzcat(1), … .bz2 Burrows-Wheeler block-sorting compression utility with higher compression ratio than gzip(1) (slower than gzip with similar syntax)
lzma V:9, I:67 172 lzma(1) .lzma LZMA compression utility with higher compression ratio than gzip(1) (slower than gzip with similar syntax)
p7zip V:3, I:25 1052 7zr(1), p7zip(1) .7z 7-Zip file archiver with high compression ratio (LZMA compression)
p7zip-full V:11, I:21 3612 7z(1), 7za(1) .7z 7-Zip file archiver with high compression ratio (LZMA compression and others)
lzop V:0.9, I:7 144 lzop(1) .lzo LZO compression utility with higher compression and decompression speed than gzip(1) (lower compression ratio than gzip with similar syntax)
zip V:9, I:59 628 zip(1) .zip InfoZIP: DOS archive and compression tool
unzip V:20, I:71 384 unzip(1) .zip InfoZIP: DOS unarchive and decompression tool

[Warning] Warning

Do not set the "$TAPE" variable unless you know what to expect. It will change tar(1) behavior.

[Note] Note

The gzipped tar(1) archive uses the file extension ".tgz" or ".tar.gz".

[Note] Note

cp(1), scp(1) and tar(1) may have some limitation for special files. cpio(1) and afio(1) are most versatile.

[Note] Note

cpio(1) and afio(1) are designed to be used with find(1) and other commands and suitable for creating backup scripts since the file selection part of the script can be tested independently.

[Note] Note

afio(1) compresses each file in the archive. This makes afio to be much safer for the file corruption than the globally compressed tar or cpio archives and to be the best archive engine for the backup script.

[Note] Note

Internal structure of OpenOffice data files are ".jar" file.

10.1.2. Copy and synchronization tools

Here is a summary of simple copy and backup tools available on the Debian system:

Table 10.2. List of copy and synchronization tools.

package popcon size tool function
coreutils V:91, I:99 12868 GNU cp Locally copy files and directories ("-a" for recursive)
openssh-client V:53, I:98 2084 scp Remotely copy files and directories (client, "-r" for recursive)
openssh-server V:66, I:78 812 sshd Remotely copy files and directories (remote server)
rsync V:16, I:42 640 - 1-way remote synchronization and backup
unison V:0.8, I:3 1644 - 2-way remote synchronization and backup
pdumpfs V:0.06, I:0.17 148 - Daily local backup using hardlinks (similar to Plan9's dumpfs)

[Tip] Tip

Execution of the bkup script mentioned in Section 10.1.9, “A copy script for the data backup” with the "-gl" option under cron(8) should provide very similar functionality as pdumpfs for the static data archive.

[Tip] Tip

Version control system (VCS) tools in Table 10.16, “List of version control system tools.” can function as the multi-way copy and synchronization tools.

10.1.3. Idioms for the archive

Here are several ways to archive and unarchive the entire content of the directory "/source".

With GNU tar(1):

$ tar cvzf archive.tar.gz /source
$ tar xvzf archive.tar.gz

With cpio(1):

$ find /source -xdev -print0 | cpio -ov --null > archive.cpio; gzip archive.cpio
$ zcat archive.cpio.gz | cpio -i

With afio(1):

$ find /source -xdev -print0 | afio -ovZ0 archive.afio
$ afio -ivZ archive.afio

10.1.4. Idioms for the copy

Here are several ways to copy the entire content of directories for locally and remotely.

  • local copy: "/source" directory → "/dest" directory
  • remote copy: "/source" directory at local host → "/dest" directory at "user@host.dom" host

With GNU cp(1) and openSSH scp(1):

# cp -a /source /dest
# scp -pr /source user@host.dom:/dest

With GNU tar(1):

# (cd /source && tar cf - . ) | (cd /dest && tar xvfp - )
# (cd /source && tar cf - . ) | ssh user@host.dom '(cd /dest && tar xvfp - )'

With cpio(1):

# cd /source; find . -print0 | cpio -pvdm --null --sparse /dest

With afio(1):

# cd /source; find . -print0 | afio -pv0a /dest

scp(1) can even copy files between remote hosts:

# scp -pr user1@host1.dom:/source user2@host2.dom:/dest

10.1.5. Idioms for the selection of files

find(1) is used to select files for archive and copy commands (see Section 10.1.3, “Idioms for the archive” and Section 10.1.4, “Idioms for the copy”) or for xargs(1) (see Section 9.5.9, “Repeating a command looping over files”). This can be enhanced by using its command arguments.

Basic syntax of find(1) can be summarized as:

  • Its conditional arguments are evaluated from left to right.
  • This evaluation stops once its outcome is determined.
  • "Logical OR" (specified by "-o" between conditionals) has lower precedence than "logical AND" (specified by "-a" or nothing between conditionals).
  • "Logical NOT" (specified by "!" before a conditional) has higher precedence than "logical AND".
  • "-prune" always returns logical TRUE and, if it is a directory, searching of file is stopped beyond this point.
  • "-name" matches the base of the filename with shell glob (see Section 1.5.6, “Shell glob”) but it also matches its initial "." with metacharacters such as "*" and "?". (New POSIX feature)
  • "-regex" matches the full path with emacs style BRE (see Section 1.6.2, “Regular expressions”) as default.
  • "-size" matches the file based on the file size (value precedented with "+" for larger, precedented with "-" for smaller)
  • "-newer" matches the file newer than the one specified in its argument.
  • "-print0" always returns logical TRUE and print the full filename (null terminated) on the standard output.

find(1) is often used with an idiomatic style. For example:

# find /path/to \
    -xdev -regextype posix-extended \
    -type f -regex ".*\.afio|.*~" -prune -o \
    -type d -regex ".*/\.git" -prune -o \
    -type f -size +99M -prune -o \
    -type f -newer /path/to/timestamp -print0

This means to do following actions:

  • search all files starting from "/path/to"
  • globally limit its search within its starting filesystem and uses ERE (see Section 1.6.2, “Regular expressions”) instead,
  • exclude files matching regex of ".*\.afio" or ".*~" from search by stop processing,
  • exclude directories matching regex of ".*/\.git" from search by stop processing,
  • exclude files larger than 99 Megabytes (units of 1048576 bytes) from search by stop processing, and
  • print filenames which satisfy above search conditions and newer than "/path/to/timestamp".

Please note the idiomatic use of "-prune -o" to exclude files in the above example.

[Note] Note

For non-Debian Unix-like system, some options may not be supported by find(1). In such a case, please consider to adjust matching methods and replace "-print0" with "-print". You may need to adjust related commands too.

10.1.6. Backup and recovery

We all know that computers fail sometime or human errors cause system and data damages. Backup and recovery operations are the essential part of successful system administration. All possible failure modes will hit you some day.

[Tip] Tip

Keep your backup system simple and backup your system often. Having backup data is more important than how technically good your backup method is.

There are 3 key factors which determine actual backup and recovery policy:

  1. Knowing what to backup and recover.

    • Data files directly created by you: data in "~/"
    • Data files created by applications used by you: data in "/var/" (except "/var/cache/", "/var/run/", and "/var/tmp/")
    • System configuration files: data in "/etc/"
    • Local softwares: data in "/usr/local/" or "/opt/"
    • System installation information: a memo in plain text on key steps (partition, …)
    • Proven set of data: confirmed by experimental recovery operations in advance
  2. Knowing how to backup and recover.

    • Secure storage of data: protection from overwrite and system failure
    • Frequent backup: scheduled backup
    • Redundant backup: data mirroring
    • Fool proof process: easy single command backup
  3. Assessing risks and costs involved.

    • Value of data when lost
    • Required resources for backup: human, hardware, software, …
    • Failure mode and their possibility

As for secure storage of data, data should be at least on different disk partitions preferably on different disks and machines to withstand the filesystem corruption. Important data are best stored on a write-once media such as CD/DVD-R to prevent overwrite accidents. (See Section 10.2, “The binary data” for how to write to the storage media from the shell commandline. GNOME desktop GUI environment gives you easy access via menu: "Places→CD/DVD Creator".)

[Note] Note

You may wish to stop some application daemons such as MTA (see Section 6.3, “Mail transport agent (MTA)”) while backing up data.

[Note] Note

You should pay extra care to the backup and restoration of identity related data files such as "/etc/ssh/ssh_host_dsa_key", "/etc/ssh/ssh_host_rsa_key", "~/.gnupg/*", "~/.ssh/*", "/etc/passwd", "/etc/shadow", "/etc/fetchmailrc", "popularity-contest.conf", "/etc/ppp/pap-secrets", and "/etc/exim4/passwd.client". Some of these data can not be regenerated by entering the same input string to the system.

[Note] Note

If you run a cron job as a user process, you must restore files in "/var/spool/cron/crontabs" directory and restart cron(8). See Section 9.5.14, “Scheduling tasks regularly” for cron(8) and crontab(1).

10.1.7. Backup utility suites

Here is a select list of notable backup utility suites available on the Debian system:

Table 10.3. List of backup suite utilities.

package popcon size description
rdiff-backup V:1.3, I:3 764 (remote) incremental backup
dump V:0.4, I:1.6 620 4.4BSD dump(8) and restore(8) for ext2/ext3 filesystems
xfsdump V:0.3, I:1.9 684 Dump and restore with xfsdump(8) and xfsrestore(8) for XFS filesystem on GNU/Linux and IRIX
backupninja V:0.4, I:0.5 408 lightweight, extensible meta-backup system
mondo V:0.12, I:0.7 1172 Mondo Rescue: disaster recovery backup suite
sbackup V:0.08, I:0.2 488 Simple Backup Suite for GNOME desktop
keep V:0.2, I:0.5 1196 backup system for KDE
bacula-common V:1.1, I:2 832 Bacula: network backup, recovery and verification - common support files
bacula-client I:0.9 60 Bacula: network backup, recovery and verification - client meta-package
bacula-console V:0.3, I:1.2 340 Bacula: network backup, recovery and verification - text console
bacula-server I:0.6 60 Bacula: network backup, recovery and verification - server meta-package
amanda-common V:0.4, I:1.0 3120 Amanda: Advanced Maryland Automatic Network Disk Archiver (Libs)
amanda-client V:0.4, I:0.9 560 Amanda: Advanced Maryland Automatic Network Disk Archiver (Client)
amanda-server V:0.14, I:0.3 1264 Amanda: Advanced Maryland Automatic Network Disk Archiver (Server)
backuppc V:0.7, I:0.8 2082 BackupPC is a high-performance, enterprise-grade system for backing up PCs (disk based)
backup-manager V:0.4, I:0.5 660 command-line backup tool
backup2l V:0.2, I:0.3 140 low-maintenance backup/restore tool for mountable media (disk based)
faubackup V:0.19, I:0.2 156 backup system using a filesystem for storage (disk based)

Backup tools have thier specialized focuses:

  • Mondo Rescue is a backup system to facilitate restoration of complete system quickly from backup CD/DVD etc. without going through normal system installation processes.
  • sbackup and keep packages provide easy GUI frontend for desktop users to make regular backups of user data. An equivalent function can be realized by a simple script (Section 10.1.8, “An example script for the system backup”) and cron(8).
  • Bacula, Amanda, and BackupPC are full featured backup suite utilities which are focused on regular backups over network.

Basic tools described in Section 10.1.1, “Archive and compression tools” and Section 10.1.2, “Copy and synchronization tools” can be used to facilitate system backup via custom scripts. Such script can be enhanced by:

  • the rdiff-backup package which enables incremental (remote) backups, and
  • the dump package which helps to archive and restore the whole filesystem incrementally and efficiently.
[Tip] Tip

See files in "/usr/share/doc/dump/" and "Is dump really deprecated?" to lean about the dump package.

10.1.8. An example script for the system backup

For a personal Debian desktop system running unstable suite, I only need to protect personal and critical data. I reinstall system once a year anyway. Thus I see no reason to backup the whole system or to install a full featured backup utility.

I use a simple script to make a backup archive and burn it into CD/DVD using GUI. Here is an example script for this.

#!/bin/sh -e
# Copyright (C) 2007-2008 Osamu Aoki <osamu@debian.org>, Public Domain
BUUID=1000; USER=osamu # UID and name of a user who accesses backup files
BUDIR="/var/backups"
XDIR0=".+/Mail|.+/Desktop"
XDIR1=".+/\.thumbnails|.+/\.?Trash|.+/\.?[cC]ache|.+/\.gvfs|.+/sessions"
XDIR2=".+/CVS|.+/\.git|.+/\.svn|.+/Downloads|.+/Archive|.+/Checkout|.+/tmp"
XSFX=".+\.iso|.+\.tgz|.+\.tar\.gz|.+\.tar\.bz2|.+\.afio|.+\.tmp|.+\.swp|.+~"
SIZE="+99M"
DATE=$(date --utc +"%Y%m%d-%H%M")
[ -d "$BUDIR" ] || mkdir -p "BUDIR"
umask 077
dpkg --get-selections \* > /var/lib/dpkg/dpkg-selections.list
debconf-get-selections > /var/cache/debconf/debconf-selections

{
find /etc /usr/local /opt /var/lib/dpkg/dpkg-selections.list \
     /var/cache/debconf/debconf-selections -xdev -print0
find /home/$USER /root -xdev -regextype posix-extended \
  -type d -regex "$XDIR0|$XDIR1" -prune -o -type f -regex "$XSFX" -prune -o \
  -type f -size  "$SIZE" -prune -o -print0
find /home/$USER/Mail/Inbox /home/$USER/Mail/Outbox -print0
find /home/$USER/Desktop  -xdev -regextype posix-extended \
  -type d -regex "$XDIR2" -prune -o -type f -regex "$XSFX" -prune -o \
  -type f -size  "$SIZE" -prune -o -print0
} | cpio -ov --null -O $BUDIR/BU$DATE.cpio
chown $BUUID $BUDIR/BU$DATE.cpio
touch $BUDIR/backup.stamp

This is meant to be a script example executed from root:

[Tip] Tip

You can recover debconf configuration data with "debconf-set-selections debconf-selections" and dpkg selection data with "dpkg --set-selection <dpkg-selections.list".

10.1.9. A copy script for the data backup

For the set of data under a directory tree, the copy with "cp -a" provides the normal backup.

For the set of large non-overwritten static data under a directory tree such as the one under the "/var/cache/apt/packages/" directory, hardlinks with "cp -al" provide an alternative to the normal backup with efficient use of the disk space.

Here is a copy script, which I named as bkup, for the data backup. This script copies all (non-VCS) files under the current directory to the dated directory on the parent directory or on a remote host.

#!/bin/sh -e
# Copyright (C) 2007-2008 Osamu Aoki <osamu@debian.org>, Public Domain
function fdot(){ find . -type d \( -iname ".?*" -o -iname "CVS" \) -prune -o -print0;}
function fall(){ find . -print0;}
function mkdircd(){ mkdir -p "$1";chmod 700 "$1";cd "$1">/dev/null;}
FIND="fdot";OPT="-a";MODE="CPIOP";HOST="localhost";EXTP="$(hostname -f)"
BKUP="$(basename $(pwd)).bkup";TIME="$(date  +%Y%m%d-%H%M%S)";BU="$BKUP/$TIME"
while getopts gcCsStrlLaAxe:h:T f; do case $f in
g)  MODE="GNUCP";; # cp (GNU)
c)  MODE="CPIOP";; # cpio -p
C)  MODE="CPIOI";; # cpio -i
s)  MODE="CPIOSSH";; # cpio/ssh
S)  MODE="AFIOSSH";; # afio/ssh
t)  MODE="TARSSH";; # tar/ssh
r)  MODE="RSYNCSSH";; # rsync/ssh
l)  OPT="-alv";; # hardlink (GNU cp)
L)  OPT="-av";;  # copy (GNU cp)
a)  FIND="fall";; # find all
A)  FIND="fdot";; # find non CVS/ .???/
x)  set -x;; # trace
e)  EXTP="${OPTARG}";; # hostname -f
h)  HOST="${OPTARG}";; # user@remotehost.example.com
T)  MODE="TEST";; # test find mode
\?) echo "use -x for trace."
esac; done
shift $(expr $OPTIND - 1)
if [ $# -gt 0 ]; then
  for x in $@; do cp $OPT $x $x.$TIME; done
elif [ $MODE = GNUCP ]; then
  mkdir -p "../$BU";chmod 700 "../$BU";cp $OPT . "../$BU/"
elif [ $MODE = CPIOP ]; then
  mkdir -p "../$BU";chmod 700 "../$BU"
  $FIND|cpio --null --sparse -pvd ../$BU
elif [ $MODE = CPIOI ]; then
  $FIND|cpio -ov --null | ( mkdircd "../$BU"&&cpio -i )
elif [ $MODE = CPIOSSH ]; then
  $FIND|cpio -ov --null|ssh -C $HOST "( mkdircd \"$EXTP/$BU\"&&cpio -i )"
elif [ $MODE = AFIOSSH ]; then
  $FIND|afio -ov -0 -|ssh -C $HOST "( mkdircd \"$EXTP/$BU\"&&afio -i - )"
elif [ $MODE = TARSSH ]; then
  (tar cvf - . )|ssh -C $HOST "( mkdircd \"$EXTP/$BU\"&& tar xvfp - )"
elif [ $MODE = RSYNCSSH ]; then
  rsync -rlpt ./ "${HOST}:${EXTP}-${BKUP}-${TIME}"
else
  echo "Any other idea to backup?"
  $FIND |xargs -0 -n 1 echo
fi

This is meant to be command examples. Please read script and edit it by yourself before using it.

[Tip] Tip

I keep this bkup in my "/usr/local/bin/" directory. I issue this bkup command without any option in the working directory whenever I need a temporary snapshot backup.

[Tip] Tip

For making snapshot history of a source file tree or a configuration file tree, it is easier and space efficient to use git(7) (see Section 10.8.4, “Git for recording configuration history”).

10.1.10. Removable mass storage device

Removable mass storage devices may be any one of

These removable mass storage devices can be automatically mounted as a user under modern desktop environment, such as GNOME using gnome-mount(1).

  • Mount point under GNOME is chosen as "/media/<disk_label>" which can be customized by:

    • mlabel(1) for FAT filesystem,
    • genisoimage(1) with "-V" option for ISO9660 filesystem, and
    • tune2fs(1) with "-L" option for ext2/ext3 filesystem.
  • The choice of encoding may need to be provided as mount option (see Section 8.3.6, “Filename encoding”).
  • The ownership of the mounted filesystem may need to be adjusted for use by the normal user.
[Note] Note

Automounting under modern desktop environment happens only when those removable media devices are not listed in "/etc/fstab".

[Tip] Tip

When providing wrong mount option causes problem, erase its corresponding setting under "/system/storage/" via gconf-editor(1).

Table 10.4. List of packages which permit normal users to mount removable devices without a matching "/etc/fstab" entry.

package popcon size description
gnome-mount V:23, I:39 968 wrapper for (un)mounting and ejecting storage devices (used by GNOME)
pmount V:11, I:35 868 mount removable devices as normal user (used by KDE)
cryptmount V:0.10, I:0.5 304 Management and user-mode mounting of encrypted filesystems
usbmount I:1.8 108 automatically mount and unmount USB mass storage devices

When sharing data with other system via removable mass storage device, you should format it with common filesystem supported by both systems. Here is a list of filesystem choices.

Table 10.5. List of filesystem choices for removable storage devices with typical usage scenarios.

filesystem typical usage scenario
FAT12 Cross platform sharing of data on the floppy disk. (<32MiB)
FAT16 Cross platform sharing of data on the small hard disk like device. (<2GiB)
FAT32 Cross platform sharing of data on the large hard disk like device. (<8TiB, supported by newer than MS Windows95 OSR2)
NTFS Cross platform sharing of data on the large hard disk like device. (supported natively on MS Windows NT and later version, and supported by NTFS-3G via FUSE on Linux)
ISO9660 Cross platform sharing of static data on CD-R and DVD+/-R
UDF Incremental data writing on CD-R and DVD+/-R (new)
MINIX filesystem Space efficient unix file data storage on the floppy disk.
ext2 filesystem Sharing of data on the hard disk like device with older Linux systems.
ext3 filesystem Sharing of data on the hard disk like device with current Linux systems. (journaling filesystem)

[Tip] Tip

See Section 9.4.1, “Removable disk encryption with dm-crypt/LUKS” for cross platform sharing of data using device level encryption.

The FAT filesystem is supported by almost all modern operating systems and is quite useful for the data exchange purpose via removable hard disk like media.

When formatting removable hard disk like devices for cross platform sharing of data with the FAT filesystem, the following should be safe choices:

  • Partitioning them with fdisk(8), cfdisk(8) or parted(8) (see Section 9.3.1, “Disk partition configuration”) into a single primary partition and to mark it as:

    • type "6" for FAT16 for media smaller than 2GB or
    • type "c" for FAT32 (LBA) for larger media.
  • Formatting the primary partition with mkfs.vfat(8) with:

    • just its device name, e.g. "/dev/sda1" for FAT16, or
    • the explicit option and its device name, e.g. "-F 32 /dev/sda1" for FAT32.

When using the FAT or ISO9660 filesystems for sharing data, the following should be the safe considerations:

  • Archiving files into an archive file first using tar(1), cpio(1), or afio(1) to retain the long filename, the symbolic link, the original Unix file permission and the owner information.
  • Splitting the archive file into less than 2 GiB chunks with the split(1) command to protect it from the file size limitation.
  • Encrypting the archive file to secure its contents from the unauthorized access.
[Note] Note

For FAT filesystems by its design, the maximum file size is (2^32 - 1) bytes = (4GiB - 1 byte). For some applications on the older 32 bit OS, the maximum file size was even smaller (2^31 - 1) bytes = (42GiB - 1 byte). Debian does not suffer the latter problem.

[Note] Note

Microsoft itself does not recommend to use FAT for drives or partitions of over 200 MB. Microsoft highlights its short comings such as inefficient disk space usage in their "Overview of FAT, HPFS, and NTFS File Systems". Of course, we should normally use the ext3 filesystem for Linux.

[Tip] Tip

For more on filesystems and accessing filesystems, please read "Filesystems HOWTO".

10.1.11. Sharing data via network

When sharing data with other system via network, you should use common service. Here are some hints.

Table 10.6. List of the network service to chose with the typical usage scenario.

network service typical usage scenario
SMB/CIFS network mounted filesystem with Samba Sharing files via "Microsoft Windows Network". See smb.conf(5) and The Official Samba 3.2.x HOWTO and Reference Guide or the samba-doc package.
NFS network mounted filesystem with the Linux kernel Sharing files via "Unix/Linux Network". See exports(5) and Linux NFS-HOWTO.
HTTP service Sharing file between the web server/client.
HTTPS service Sharing file between the web server/client with encrypted Secure Sockets Layer (SSL) or Transport Layer Security (TLS).
FTP service Sharing file between the FTP server/client.

Although these filesystems mounted over network and file transfer methods over network are quite convenient for sharing data, these may be insecure. Their network connection must be secured by:

10.1.12. Archive media

When choosing computer data storage media for important data archive, you should be careful about their limitations. For small personal data backup, I use CD-R and DVD-R by the brand name company and store in a cool, shaded, dry, clean environment. (Tape archive media seem to be popular for professional use.)

[Note] Note

A fire-resistant safe are meant for paper documents. Most of the computer data storage media have less temperature tolerance than paper. I usually rely on multiple secure encrypted copies stored in multiple secure locations.

Optimistic storage life of archive media seen on the net (mostly from vendor info):

  • 100+ years : acid free paper with ink
  • 100 years : optical storage (CD/DVD, CD/DVD-R)
  • 30 years : magnetic storage (tape, floppy)
  • 20 years : phase change optical storage (CD-RW)

These do not count on the mechanical failures due to handling etc..

Optimistic write cycle of archive media seen on the net (mostly from vendor info):

  • 250,000+ cycles : Harddisk drive
  • 10,000+ cycles : Flash memory
  • 1,000 cycles : CD/DVD-RW
  • 1 cycles : CD/DVD-R, paper
[Caution] Caution

Figures of storage life and write cycle here should not be used for decisions on any critical data storage. Please consult the specific product information provided by the manufacture.

[Tip] Tip

Since CD/DVD-R and paper have only 1 write cycle, they inherently prevent accidental data loss by overwriting. This is advantage!

[Tip] Tip

If you need fast and frequent backup of large amount of data, a hard disk on a remote host linked by a fast network connection, may be the only realistic option.

10.2. The binary data

Here, we discuss direct manipulation of the binary data on storage media. See Section 9.3, “Data storage tips”, too.

10.2.1. Making the disk image file

The disk image file, "disk.img", of an unmounted device, e.g., the second SCSI drive "/dev/sdb", can be made using cp(1) or dd(1):

# cp /dev/sdb disk.img
# dd if=/dev/sdb of=disk.img

The disk image of the traditional PC's master boot record (MBR) (see Section 9.3.1, “Disk partition configuration”) which reside on the first sector on the primary IDE disk can be made by using dd(1):

# dd if=/dev/hda of=mbr.img bs=512 count=1
# dd if=/dev/hda of=mbr-nopart.img bs=446 count=1
# dd if=/dev/hda of=mbr-part.img skip=446 bs=1 count=66
  • "mbr.img": the MBR with the partition table.
  • "mbr-nopart.img": the MBR without the partition table.
  • "part.img": the partition table of the MBR only.

If you have a SCSI device (including the new serial ATA drive) as the boot disk, substitute "/dev/hda" with "/dev/sda".

If you are making an image of a disk partition of the original disk, substitute "/dev/hda" with "/dev/hda1" etc.

10.2.2. Writing directly to the disk

The disk image file, "disk.img" can be written to an unmounted device, e.g., the second SCSI drive "/dev/sdb" with matching size, by dd(1):

# dd if=disk.img of=/dev/sdb

Similarly, the disk partition image file, "disk.img" can be written to an unmounted partition, e.g., the first partition of the second SCSI drive "/dev/sdb1" with matching size, by dd(1):

# dd if=disk.img of=/dev/sdb1

10.2.3. Mounting the disk image file

If "disk.img" contains an image of the disk contents and the original disk had a disk configuration which gives xxxx = (bytes/sector) * (sectors/cylinder), then the following will mount it to "/mnt":

# mount -o loop,offset=xxxx disk.img /mnt

Note that most hard disks have 512 bytes/sector. This offset is to skip MBR of the hard disk. You can skip this offset in the above example, if "disk.img" contains

  • only an image of a disk partition of the original hard disk, or
  • only an image of the original floppy disk.

10.2.4. Making an empty disk image file

An empty disk image file, "sparse", which can grow up to 5MiB can be made using dd(1) and mke2fs(8):

$ dd bs=1 count=0 if=/dev/zero of=sparse seek=5G
$ /sbin/mke2fs sparse
mke2fs 1.41.3 (12-Oct-2008)
sparse is not a block special device.
Proceed anyway? (y,n) y
...
$ du  --apparent-size -h sparse
5.0G  sparse
$ du -h sparse
83M sparse

For "sparse", its file size is 5.0 GiB and its actual disk usage is mere 83MiB. This descrepancy is possible since ext2fs can hold sparse file.

[Tip] Tip

The actual disk usage of sparse file grows with data which are written to it.

10.2.5. Viewing and editing binary data

The most basic viewing method of binary data is to use "od -t x1" command.

Table 10.7. List of packages which view and edit binary data.

package popcon size description
coreutils V:91, I:99 12868 basic package which has od(1) to dump files in octal and other formats
bsdmainutils V:66, I:99 644 utility package which has hd(1) to dump files in ASCII, decimal, hexadecimal, and octal formats
hexedit V:0.3, I:1.9 108 viewing and editing files in hexadecimal or in ASCII
bless V:0.05, I:0.3 1240 full featured hexadecimal editor (GNOME)
khexedit V:1.4, I:10 NOT_FOUND full featured hexadecimal editor (KDE)
okteta V:0.3, I:3 1252 full featured hexadecimal editor (KDE4)
ncurses-hexedit V:0.09, I:0.6 192 editing files/disks in HEX, ASCII and EBCDIC
lde V:0.04, I:0.4 992 Linux Disk Editor
beav V:0.04, I:0.3 164 binary editor and viewer for HEX, ASCII, EBCDIC, OCTAL, DECIMAL, and BINARY formats
hex V:0.01, I:0.10 84 hexadecimal dumping tool for Japanese

[Tip] Tip

HEX is used as an acronym for hexadecimal format.

10.2.6. Manipulating files without mounting disk

There are tools to read and write files without mounting disk.

Table 10.8. List of packages to manipulate files without mounting.

package popcon size description
mtools V:4, I:53 412 utilities for MSDOS files without mounting them.
hfsutils V:0.19, I:1.7 236 utilities for HFS and HFS+ files without mounting them.

10.2.7. Data redundancy

Linux's software RAID systems provide data redundancy in the kernel filesystem level to achieve high levels of storage reliability.

There are tools to add data redundancy to files in application program level to achieve high levels of storage reliability, too.

Table 10.9. List of tools to add data redundancy to files.

package popcon size description
par2 V:0.5, I:1.7 284 Parity Archive Volume Set, for checking and repair of files
dvdisaster V:0.16, I:0.9 1272 data loss/scratch/aging protection for CD/DVD media
ras V:0.01, I:0.2 92 utility to add redundancy files to archives for data recovery
dvbackup V:0.02, I:0.13 544 backup tool using MiniDV camcorders. This provides rsbep(1)
vdmfec V:0.00, I:0.02 88 recover lost blocks using Forward Error Correction

10.2.8. Data file recovery and forensic analysis

There are tools for data file recovery and forensic analysis.

Table 10.10. List of packages for data file recovery and forensic analysis.

package popcon size description
testdisk V:0.3, I:3 4616 utilities for partition scan and disk recovery
magicrescue V:0.07, I:0.5 336 utility to recover files by looking for magic bytes
scalpel V:0.03, I:0.2 124 frugal, high performance file carver
recover V:0.09, I:0.8 104 utility to undelete files on the ext2 filesystem
e2undel V:0.07, I:0.6 240 utility to undelete files on the ext2 filesystem
ext3grep V:0.07, I:0.5 300 tool to help recover deleted files on the ext3 filesystem
scrounge-ntfs V:0.03, I:0.4 80 data recovery program for NTFS filesystems
gzrt V:0.02, I:0.16 68 gzip recovery toolkit
sleuthkit V:0.14, I:0.6 4872 tools for forensics analysis. (Sleuthkit)
autopsy V:0.06, I:0.4 1372 graphical interface to SleuthKit
foremost V:0.09, I:0.6 140 forensics application to recover data
tct V:0.04, I:0.2 604 forensics related utilities
dcfldd V:0.03, I:0.16 124 enhanced version of dd for forensics and security
rdd V:0.02, I:0.13 200 forensic copy program

10.2.9. Making the ISO9660 image file

The ISO9660 image file, "cd.iso", from the source directory tree at "source_directory" can be made using genisoimage(1):

#  genisoimage -r -J -T -V volume_id -o cd.iso source_directory

Similary, the bootable ISO9660 image file, "cdboot.iso", can be made from debian-installer like directory tree at "source_directory":

#  genisoimage -r -o cdboot.iso -V volume_id \
   -b isolinux/isolinux.bin -c isolinux/boot.cat \
   -no-emul-boot -boot-load-size 4 -boot-info-table source_directory

Here Isolinux boot loader (see Section 3.3, “Stage 2: the boot loader”) is used for booting.

To make the disk image directly from the CD-ROM device using cp(1) or dd(1) has a few problems. The first run of dd(1) may cause an error message and may yield a shorter disk image with a lost tail-end. The second run of dd(1) may yield a larger disk image with garbage data attached at the end on some systems if the data size is not specified. Only the second run of dd(1) with the correct data size specified, and without ejecting the CD after an error message, seems to avoid these problems. If for example the image size displayed by df(1) is 46301184 blocks, use the following command twice to get the right image (this is my empirical information):

# dd if=/dev/cdrom of=cd.iso bs=2048 count=$((46301184/2))

10.2.10. Writing directly to the CD/DVD-R/RW

[Tip] Tip

DVD is only a large CD to wodim(1).

You can find a usable device by:

# wodim --devices

Then the blank CD-R is inserted to the device, and the ISO9660 image file, "cd.iso" is written to this device, e.g., "/dev/hda", by wodim(1):

# wodim -v -eject dev=/dev/hda cd.iso

If CD-RW is used instead of CD-R, do this instead:

# wodim -v -eject blank=fast dev=/dev/hda cd.iso
[Tip] Tip

If your desktop system mounts CD automatically, unmount it by "sudo unmount /dev/hda" before using wodim(1).

10.2.11. Mounting the ISO9660 image file

If "cd.iso" contains an ISO9660 image, then the following will manually mount it to "/cdrom":

# mount -t iso9660 -o ro,loop cd.iso /cdrom
[Tip] Tip

Modern desktop system mounts removable media automatically (see Section 10.1.10, “Removable mass storage device”).

10.2.12. Spliting a large file into small files

When a data is too big to backup, you can backup a large file into, e.g. 2000MiB chunks and merge those files into a large file.

$ split -b 2000m large_file
$ cat x* >large_file
[Caution] Caution

Please make sure you do not have any file starting with "x" to avoid the file name crash.

10.2.13. Clearing file contents

In order to clear the contents of a file such as a log file, do not use rm(1) to delete the file and then create a new empty file, because the file may still be accessed in the interval between commands. The following is the safe way to clear the contents of the file.

$ :>file_to_be_cleared

10.2.14. Dummy files

The following commands will create dummy or empty files:

$ dd if=/dev/zero    of=5kb.file bs=1k count=5
$ dd if=/dev/urandom of=7mb.file bs=1M count=7
$ touch zero.file
$ : > alwayszero.file
  • "5kb.file" is 5KB of zeros.
  • "7mb.file" is 7MB of random data.
  • "zero.file" is 0 byte file (if file exists, the file contents are kept while updating mtime.)
  • "alwayszero.file" is always 0 byte file (if file exists, the file contents are not kept while updating mtime.)

10.2.15. Erasing an entire hard disk

There are several ways to completely erase data from an entire hard disk like device, e.g., USB memory stick at "/dev/sda".

[Caution] Caution

Check your USB memory stick location with mount(8) first before executing commands here. The device pointed by "/dev/sda" may be SCSI hard disk or serial-ATA hard disk where your entire system resides.

  • Erase all by resetting data to 0:
dd if=/dev/zero of=/dev/sda
  • Erase all by overwriting random data:
# dd if=/dev/urandom of=/dev/sda
  • Erase all by overwriting random data very efficiently (fast):
# shred -v -n 1 /dev/sda

Since dd(1) is available from the shell of many bootable Linux CDs such as Debian installer CD, you can erase your installed system completely by running an erase command from such media on the system hard disk, e.g., "/dev/hda", "/dev/sda", etc.

10.2.16. Undeleting deleted but still open files

Even if you have accidentally deleted a file, as long as that file is still being used by some application (read or write mode), it is possible to recover such a file.

  • On one terminal:
$ echo foo > bar
$ less bar
  • Then on another terminal:
$ ps aux | grep ' less[ ]'
bozo    4775  0.0  0.0  92200   884 pts/8    S+   00:18   0:00 less bar
$ rm bar
$ ls -l /proc/4775/fd | grep bar
lr-x------ 1 bozo bozo 64 2008-05-09 00:19 4 -> /home/bozo/bar (deleted)
$ cat /proc/4775/fd/4 >bar
$ ls -l
-rw-r--r-- 1 bozo bozo 4 2008-05-09 00:25 bar
$ cat bar
foo
  • Alternatively, when you have the lsof package installed, on another terminal:
$ ls -li bar
2228329 -rw-r--r-- 1 bozo bozo 4 2008-05-11 11:02 bar
$ lsof |grep bar|grep less
less 4775 bozo 4r REG 8,3 4 2228329 /home/bozo/bar
$ rm bar
$ lsof |grep bar|grep less
less 4775 bozo 4r REG 8,3 4 2228329 /home/bozo/bar (deleted)
$ cat /proc/4775/fd/4 >bar
$ ls -li bar
2228302 -rw-r--r-- 1 bozo bozo 4 2008-05-11 11:05 bar
$ cat bar
foo

10.2.17. Searching all hardlinks

Files with hardlinks can be identified by "ls -li", e.g.:

$ ls -li
total 0
2738405 -rw-r--r-- 1 root root 0 2008-09-15 20:21 bar
2738404 -rw-r--r-- 2 root root 0 2008-09-15 20:21 baz
2738404 -rw-r--r-- 2 root root 0 2008-09-15 20:21 foo

Both "baz" and "foo" have link count of "2" (>1) showing them to have hardlinks. Their inode numbers are common "2738404". This means they are the same hardlinked file. If you do not happen to find all hardlinked files by chance, you can search it by the inode, e.g., "2738404":

# find /path/to/mount/point -xdev -inum 2738404

10.2.18. Invisible disk space consumption

All deleted but open files consumes disk space although they are not visible from normal du(1). They can be listed with their size by:

# lsof -s -X / |grep deleted

10.3. Data security infrastructure

The data security infrastructure is provided by the combination of data encryption tool, message digest tool, and signature tool.

Table 10.11. List of data security infrastructure tools.

command package popcon size description
gpg(1) gnupg V:37, I:99 5072 GNU privacy guard - OpenPGP encryption and signing tool
N/A gnupg-doc I:1.5 4124 GNU Privacy Guard documentation
gpgv(1) gpgv V:57, I:98 392 GNU privacy guard - signature verification tool
paperkey(1) paperkey V:0.01, I:0.15 88 extract just the secret information out ouf OpenPGP secret keys
cryptsetup(8), … cryptsetup V:3, I:4 912 utilities for dm-crypto block device encryption supporting LUKS
ecryptfs(7), … ecryptfs-utils V:0.09, I:0.2 444 utilities for ecryptfs stacked filesystem encryption
md5sum(1) coreutils V:91, I:99 12868 compute and check MD5 message digest
sha1sum(1) coreutils V:91, I:99 12868 compute and checks SHA1 message digest
openssl(1ssl) openssl V:31, I:90 2360 compute message digest with "openssl dgst" (OpenSSL)

See Section 9.4, “Data encryption tips” on dm-crypto and ecryptfs which implement automatic data encryption infrastructure via Linux kernel modules.

10.3.1. Key management for Gnupg

Here are GNU Privacy Guard commands for the basic key management:

Table 10.12. List of GNU Privacy Guard commands for the key management

command description
gpg --gen-key generate a new key
gpg --gen-revoke my_user_ID generate revoke key for my_user_ID
gpg --edit-key user_ID edit key interactively, "help" for help
gpg -o file --exports export all keys to file
gpg --imports file import all keys from file
gpg --send-keys user_ID send key of user_ID to keyserver
gpg --recv-keys user_ID recv. key of user_ID from keyserver
gpg --list-keys user_ID list keys of user_ID
gpg --list-sigs user_ID list sig. of user_ID
gpg --check-sigs user_ID check sig. of user_ID
gpg --fingerprint user_ID check fingerprint of "user_ID"
gpg --refresh-keys update local keyring

Here is the meaning of trust code:

Table 10.13. List of the meaning of trust code.

code trust
- No owner trust assigned / not yet calculated.
e Trust calculation has failed.
q Not enough information for calculation.
n Never trust this key.
m Marginally trusted.
f Fully trusted.
u Ultimately trusted.

The following will upload my key "A8061F32" to the popular keyserver "hkp://subkeys.pgp.net":

$ gpg --keyserver hkp://subkeys.pgp.net --send-keys A8061F32

A good default keyserver set up in "~/.gnupg/gpg.conf" (or old location "~/.gnupg/options") contains:

keyserver hkp://subkeys.pgp.net

The following will obtain unknown keys from the keyserver:

$ gpg --list-sigs | \
  sed -n '/^sig.*\[User ID not found\]/s/^sig..........\(\w\w*\)\W.*/\1/p' |\
  sort | uniq | xargs gpg --recv-keys

There was a bug in OpenPGP Public Key Server (pre version 0.9.6) which corrupted key with more than 2 sub-keys. The newer gnupg (>1.2.1-2) package can handle these corrupted subkeys. See gpg(1) under "--repair-pks-subkey-bug" option.

10.3.2. Using GnuPG with files

File handling:

Table 10.14. List of gnu privacy guard commands on files

command description
gpg -a -s file sign file into ascii armored file.asc
gpg --armor --sign file , ,
gpg --clearsign file clear-sign message
gpg --clearsign --not-dash-escaped patchfile clear-sign patchfile
gpg --verify file verify clear-signed file
gpg -o file.sig -b file create detached signature
gpg -o file.sig --detach-sig file , ,
gpg --verify file.sig file verify file with file.sig
gpg -o crypt_file.gpg -r name -e file public-key encryption intended for name from file to binary crypt_file.gpg
gpg -o crypt_file.gpg --recipient name --encrypt file , ,
gpg -o crypt_file.asc -a -r name -e file public-key encryption intended for name from file to ASCII armored crypt_file.asc
gpg -o crypt_file.gpg -c file symmetric encryption from file to crypt_file.gpg
gpg -o crypt_file.gpg --symmetric file , ,
gpg -o crypt_file.asc -a -c file symmetric encryption intended for name from file to ASCII armored crypt_file.asc
gpg -o file -d crypt_file.gpg -r name decryption
gpg -o file --decrypt crypt_file.gpg , ,

10.3.3. Using GnuPG with Mutt

Add the following to "~/.muttrc" to keep a slow GnuPG from automatically starting, while allowing it to be used by typing "S" at the index menu.

macro index S ":toggle pgp_verify_sig\n"
set pgp_verify_sig=no

10.3.4. Using GnuPG with Vim

The gnupg plugin let you run GnuPG transparently for files with extension ".gpg", ".asc", and ".ppg".

# aptitude install vim-scripts vim-addon-manager
$ vim-addons install gnupg

10.3.5. The MD5 sum

md5sum(1) provides utility to make a digest file using the method in rfc1321 and verifying each file with it.

$ md5sum foo bar >baz.md5
$ cat baz.md5
d3b07384d113edec49eaa6238ad5ff00  foo
c157a79031e1c40f85931829bc5fc552  bar
$ md5sum -c baz.md5
foo: OK
bar: OK
[Note] Note

The computation for the MD5 sum is less CPU intensive than the one for the cryptographic signature by GNU Privacy Guard (GnuPG). Usually, only the top level digest file is cryptographically signed to ensure data integrity.

10.4. Source code merge tools

There are many merge tools for the source code. Following commands caught my eyes.:

Table 10.15. List of source code merge tools.

command package popcon size description
diff(1) diff V:90, I:99 764 compare files line by line
diff3(1) diff V:90, I:99 764 compare and merges three files line by line
vimdiff(1) vim V:14, I:30 1684 compare 2 files side by side in vim
patch(1) patch V:11, I:93 204 apply a diff file to an original
dpatch(1) dpatch V:2, I:15 344 manage series of patches for Debian package
diffstat(1) diffstat V:2, I:14 84 produce a histogram of changes by the diff
combinediff(1) patchutils V:2, I:15 292 create a cumulative patch from two incremental patches
dehtmldiff(1) patchutils V:2, I:15 292 extract a diff from an HTML page
filterdiff(1) patchutils V:2, I:15 292 extract or excludes diffs from a diff file
fixcvsdiff(1) patchutils V:2, I:15 292 fix diff files created by CVS that "patch" mis-interprets
flipdiff(1) patchutils V:2, I:15 292 exchange the order of two patches
grepdiff(1) patchutils V:2, I:15 292 show which files are modified by a patch matching a regex
interdiff(1) patchutils V:2, I:15 292 show differences between two unified diff files
lsdiff(1) patchutils V:2, I:15 292 show which files are modified by a patch
recountdiff(1) patchutils V:2, I:15 292 recompute counts and offsets in unified context diffs
rediff(1) patchutils V:2, I:15 292 fix offsets and counts of a hand-edited diff
splitdiff(1) patchutils V:2, I:15 292 separate out incremental patches
unwrapdiff(1) patchutils V:2, I:15 292 demangle patches that have been word-wrapped
wiggle(1) wiggle V:0.03, I:0.11 204 apply rejected patches
quilt(1) quilt V:1.1, I:6 856 manage series of patches
meld(1) meld V:0.5, I:2 2304 compare and merge files (GTK)
xxdiff(1) xxdiff V:0.2, I:1.1 1352 graphical file comparator and merge tool (plain X)
dirdiff(1) dirdiff V:0.08, I:0.5 224 display differences and merge changes between directory trees
docdiff(1) docdiff V:0.02, I:0.19 688 compare two files word by word / char by char
imediff2(1) imediff2 V:0.01, I:0.10 76 interactive full screen 2-way merge tool
makepatch(1) makepatch V:0.02, I:0.2 148 generate extended patch files
applypatch(1) makepatch V:0.02, I:0.2 148 apply extended patch files
wdiff(1) wdiff V:1.9, I:14 124 display word differences between text files

10.4.1. Extracting differences for source files

Following one of these procedures will extract differences between two source files and create unified diff files "file.patch0" or "file.patch1" depending on the file location:

$ diff -u file.old file.new > file.patch0
$ diff -u old/file new/file > file.patch1

10.4.2. Merging updates for source files

The diff file (alternatively called patch file) is used to send a program update. The receiving party will apply this update to another file by:

$ patch -p0 file < file.patch0
$ patch -p1 file < file.patch1

10.4.3. Updating via 3-way-merge

If you have three versions of a source code, you can perform 3-way-merge effectively using diff3(1):

$ diff3 -m file.mine file.old file.yours > file

10.5. Version control systems

Here is a summary of the version control systems (VCS) on the Debian system:

[Note] Note

If you are new to VCS systems, you should start learning with Git, which is growing fast in popularity.

Table 10.16. List of version control system tools.

package popcon size tool VCS type comment
cssc V:0.01, I:0.05 2168 CSSC local Clone of the Unix SCCS (deprecated)
rcs V:1.6, I:9 772 RCS local "Unix SCCS done right"
cvs V:4, I:24 3660 CVS remote The previous standard remote VCS
subversion V:10, I:32 4248 Subversion remote "CVS done right", the new de facto standard remote VCS
git-core V:6, I:10 14036 Git distributed fast DVCS in C (used by the Linux kernel and others)
mercurial V:0.9, I:4 332 Mercurial distributed DVCS in python and some C.
bzr V:0.5, I:2 20300 Bazaar distributed DVCS influenced by tla written in python (used by Ubuntu)
darcs V:0.2, I:1.6 8104 Darcs distributed DVCS with smart algebra of patches (slow).
tla V:0.17, I:1.4 1100 GNU arch distributed DVCS mainly by Tom Lord. (Historic)
monotone V:0.04, I:0.4 4752 Monotone distributed DVCS in C++

VCS is sometimes known as revision control system (RCS), or software configuration management (SCM).

Distributed VCS such as Git is the tool of choice these days. CVS and Subversion may still be useful to join some existing open source program activities.

Debian provides free VCS services via Debian Alioth service. It supports practically all VCSs. Its documentation can be found at http://wiki.debian.org/Alioth .

[Caution] Caution

The git package is "GNU Interactive Tools" which is not the DVCS.

10.5.1. Comparison of VCS commands

Here is an oversimplified comparison of native VCS commands to provide the big picture. The typical command sequence may require options and arguments.

Table 10.17. Comparison of native VCS commands.

CVS Subversion Git function
cvs init svn create git init create the (local) repository
cvs login - - login to the remote repository
cvs co svn co git clone check out the remote repository as the working tree
cvs up svn up git pull update the working tree by merging the remote repository
cvs add svn add git add . add file(s) in the working tree to the VCS
cvs rm svn rm git rm remove file(s) in working tree from the VCS
cvs ci svn ci - commit changes to the remote repository
- - git commit -a commit changes to the local repository
- - git push update the remote repository by the local repository
cvs status svn status git status display the working tree status from the VCS
cvs diff svn diff git diff diff <reference_repository> <working_tree>
- - git repack -a -d; git prune repack the local repository into single pack.

[Caution] Caution

Invoking a git subcommand as "git-xyzzy" from the command line has been deprecated since early 2006.

[Tip] Tip

Git can work directly with different VCS repositories such as ones provided by CVS and Subversion, and provides the local repository for local changes with git-cvs and git-svn packages. See git for CVS users, Git for GNOME developers and Section 10.8, “Git”.

[Tip] Tip

Git has commands which have no equivalents in CVS and Subversion. "Fetch", "Rebase", "Cherrypick", …

10.6. CVS

Check

  • cvs(1),
  • "/usr/share/doc/cvs/html-cvsclient",
  • "/usr/share/doc/cvs/html-info",
  • "/usr/share/doc/cvsbook", and
  • "info cvs", for detailed information.

10.6.1. Installing a CVS server

The following setup will allow commits to the CVS repository only by a member of the "src" group, and administration of CVS only by a member of the "staff" group, thus reducing the chance of shooting oneself.

# cd /var/lib; umask 002; mkdir cvs
# export CVSROOT=/var/lib/cvs
# cd $CVSROOT
# chown root:src .
# chmod 2775 .
# cvs -d $CVSROOT init
# cd CVSROOT
# chown -R root:staff .
# chmod 2775 .
# touch val-tags
# chmod 664 history val-tags
# chown root:src history val-tags

You may restrict creation of new project by changing the owner of "$CVSROOT" directory to "root:staff" and its permission to "3775".

10.6.2. Using local CVS server

The following will set up shell environments for the local access to the CVS repository:

$ export CVSROOT=/var/lib/cvs

10.6.3. Using remote CVS pserver

The following will set up shell environments for the read-only remote access to the CVS repository without SSH (use RSH protocol capability in cvs(1)):

$ export CVSROOT=:pserver:account@cvs.foobar.com:/var/lib/cvs
$ cvs login

This is prone to eavesdropping attack.

10.6.4. Anonymous CVS (download only)

The following will set up shell environments for the read-only remote access to the CVS repository:

$ export CVSROOT=:pserver:anonymous@cvs.sf.net:/cvsroot/qref
$ cvs login
$ cvs -z3 co qref

10.6.5. Using remote CVS through ssh

The following will set up shell environments for the read-only remote access to the CVS repository with SSH:

$ export CVSROOT=:ext:account@cvs.foobar.com:/var/lib/cvs

You can also use public key authentication for SSH which eliminates the password prompt.

10.6.6. Creating a new CVS archive

Let's assume followings:

Table 10.18. Assumption for the CVS archive.

ITEM VALUE MEANING
source tree ~/project-x All source codes
Project name project-x Name for this project
Vendor Tag Main-branch Tag for the entire branch
Release Tag Release-initial Tag for a specific release

  • start project-x by:
$ cd ~/project-x
  • create a source tree …
$ cvs import -m "Start project-x" project-x Main-branch Release-initial
$ cd ..; rm -R ~/project-x

10.6.7. Working with CVS

To work with project-x using the local CVS repository:

$ mkdir -p /path/to; cd /path/to
$ cvs co project-x
  • get sources from CVS to local
$ cd project-x
  • make changes to the content …
$ cvs diff -u
  • similar to "diff -u repository/ local/"
$ cvs up -C modified_file
  • undo changes to a file
$ cvs ci -m "Describe change"
  • save local sources to CVS
$ vi newfile_added
$ cvs add newfile_added
$ cvs ci -m "Added newfile_added"
$ cvs up
  • merge latest version from CVS.
  • To create all newly created subdirectories from CVS, use "cvs up -d -P" instead.
  • Watch out for lines starting with "C filename" which indicates conflicting changes.
  • unmodified code is moved to .#filename.version .
  • search for "<<<<<<<" and ">>>>>>>" in the files for conflicting changes.
  • edit file to fix conflicts.
$ cvs tag Release-1
  • add release tag
  • edit further …
$ cvs tag -d Release-1
  • remove release tag
$ cvs ci -m "more comments"
$ cvs tag Release-1

* re-add release tag

$ cd /path/to
$ cvs co -r Release-initial -d old project-x
  • get original version to "/path/to/old" directory
$ cd old
$ cvs tag -b Release-initial-bugfixes
  • create branch (-b) tag "Release-initial-bugfixes"
  • now you can work on the old version (Tag is sticky)
$ cvs update -d -P
  • don't create empty directories
  • source tree now has sticky tag "Release-initial-bugfixes"
  • work on this branch … while someone else making changes too
$ cvs up -d -P
  • sync with files modified by others on this branch
$ cvs ci -m "check into this branch"
$ cvs update -kk -A -d -P
  • remove sticky tag and forget contents
  • update from main trunk without keyword expansion
$ cvs update -kk -d -P -j Release-initial-bugfixes
  • merge from Release-initial-bugfixes branch into the main
  • trunk without keyword expansion. Fix conflicts with editor.
$ cvs ci -m "merge Release-initial-bugfixes"
$ cd
$ tar -cvzf old-project-x.tar.gz old
  • make archive. use "-j" if you want ".tar.bz2".
$ cvs release -d old
  • remove local source (optional)

Table 10.19. Notable options for CVS commands (use as first argument(s) to cvs(1)).

option meaning
-n dry run, no effect
-t display messages showing steps of cvs activity

10.6.8. Exporting files from CVS

To get the latest version from CVS, use "tomorrow":

$ cvs ex -D tomorrow module_name

10.6.9. Administration of CVS

Add alias to a project (local server):

$ export CVSROOT=/var/lib/cvs
$ cvs co CVSROOT/modules
$ cd CVSROOT
$ echo "px -a project-x" >>modules
$ cvs ci -m "Now px is an alias for project-x"
$ cvs release -d .
$ cvs co -d project px
  • check out project-x (alias:px) from CVS to directory project
$ cd project
  • make changes to the content …

In order to perform above procedure, you should have the appropriate file permission.

10.6.10. File permissions in CVS repository

CVS will not overwrite the current repository file but replaces it with another one. Thus, write permission to the repository directory is critical. For every new repository creation, run the following to ensure this condition if needed.

# cd /var/lib/cvs
# chown -R root:src repository
# chmod -R ug+rwX   repository
# chmod    2775     repository

10.6.11. Execution bit

A file's execution bit is retained when checked out. Whenever you see execution permission problems in checked-out files, change permissions of the file in the CVS repository with the following command.

# chmod ugo-x filename

10.7. Subversion

Subversion is a next-generation version control system, intended to replace CVS, so it has most of CVS's features. Generally, Subversion's interface to a particular feature is similar to CVS's, except where there's a compelling reason to do otherwise.

10.7.1. Installing a Subversion server

You need to install subversion, libapache2-svn and subversion-tools packages to set up a server.

10.7.2. Setting up a repository

Currently, the subversion package does not set up a repository, so one must be set up manually. One possible location for a repository is in "/var/local/repos".

Create the directory:

# mkdir -p /var/local/repos

Create the repository database:

# svnadmin create /var/local/repos

Make the repository writable by the WWW server:

# chown -R www-data:www-data /var/local/repos

10.7.3. Configuring Apache2

To allow access to the repository via user authentication, add (or uncomment) the following in "/etc/apache2/mods-available/dav_svn.conf":

<Location /repos>
  DAV svn
  SVNPath /var/local/repos
  AuthType Basic
  AuthName "Subversion repository"
  AuthUserFile /etc/subversion/passwd
<LimitExcept GET PROPFIND OPTIONS REPORT>
    Require valid-user
</LimitExcept>
</Location>

Then, create a user authentication file with the command:

# htpasswd2 -c /etc/subversion/passwd some-username

Restart Apache2, and your new Subversion repository will be accessible with the URL "http://hostname/repos".

10.7.4. Subversion usage examples

The following sections teach you how to use different commands in Subversion.

10.7.5. Creating a new Subversion archive

To create a new Subversion archive, type the following:

$ cd ~/your-project         # go to your source directory
$ svn import http://localhost/repos your-project project-name -m "initial project import"

This creates a directory named project-name in your Subversion repository which contains your project files. Look at "http://localhost/repos/" to see if it's there.

10.7.6. Working with Subversion

Working with project-y using Subversion:

$ mkdir -p /path/to ;cd  /path/to
$ svn co http://localhost/repos/project-y
  • Check out sources
$ cd project-y
  • do some work …
$ svn diff
  • similar to "diff -u repository/ local/"
$ svn revert modified_file
  • undo changes to a file
$ svn ci -m "Describe changes"
  • check in your changes to the repository
$ vi newfile_added
$ svn add newfile_added
$ svn add new_dir
  • recursively add all files in new_dir
$ svn add -N new_dir2
  • non recursively add the directory
$ svn ci -m "Added newfile_added, new_dir, new_dir2"
$ svn up
  • merge in latest version from repository
$ svn log
  • shows all changes committed
$ svn copy http://localhost/repos/project-y \
      http://localhost/repos/project-y-branch \
      -m "creating my branch of project-y"
  • branching project-y
$ svn copy http://localhost/repos/project-y \
      http://localhost/repos/projct-y-release1.0 \
      -m "project-y 1.0 release"
  • added release tag.
  • note that branching and tagging are the same. The only difference is that branches get committed whereas tags do not.
  • make changes to branch …
$ svn merge http://localhost/repos/project-y \
   http://localhost/repos/project-y-branch
  • merge branched copy back to main copy
$ svn co -r 4 http://localhost/repos/project-y
  • get revision 4

10.8. Git

Git can do everything for both local and remote source code management. This means that you can record the source code changes without needing network connectivity to the remote repository.

10.8.1. Before using Git …

You may wish to set several global configuration in "~/.gitconfig" such as your name and email address used by Git:

$ git config --global user.name "Name Surname"
$ git config --global user.email yourname@example.com

If you are too used to CVS or Subversion commands, you may wish to set several command aliases;

$ git config --global alias.ci "commit -a"
$ git config --global alias.co checkout

You can check your global configuration by:

$ git config --global --list

10.8.2. Git references

There are good references for Git.

git-gui(1) and gitk(1) commands make using Git very easy.

[Warning] Warning

Do not use the tag string with spaces in it even if some tools such as gitk(1) allow you to use it. It will choke some other git commands.

10.8.3. Git commands

Even if your upstream uses different VCS, it is good idea to use git(1) for local activity since you can manage your local copy of source tree without the network connection to the upstream. Here are commands used with git(1).

Table 10.20. List of git packages and commands.

command package popcon size description
N/A git-doc I:2 5808 official documentation for Git
N/A gitmagic I:0.2 560 "Git Magic" provides easier to understand guide for Git
git(7) git-core V:6, I:10 14036 Git, the fast, scalable, distributed revision control system
gitk(1) gitk V:0.6, I:3 756 The GUI Git repository browser with history
git-gui(1) git-gui V:0.2, I:2 1432 The GUI for Git (No history)
git-svnimport(1) git-svn V:0.4, I:2 496 import the data out of Subversion into Git
git-svn(1) git-svn V:0.4, I:2 496 provide bidirectional operation between the Subversion and Git
git-cvsimport(1) git-cvs V:0.14, I:1.4 624 import the data out of CVS into Git
git-cvsexportcommit(1) git-cvs V:0.14, I:1.4 624 export a commit to a CVS checkout from Git
git-cvsserver(1) git-cvs V:0.14, I:1.4 624 A CVS server emulator for Git
git-send-email(1) git-email V:0.10, I:1.3 368 send a collection of patches as email from the Git
stg(1) stgit V:0.09, I:0.6 844 quilt on top of git (Python)
git-buildpackage(1) git-buildpackage V:0.16, I:0.9 448 automate the Debian packaging with the Git
guilt(7) guilt V:0.02, I:0.09 336 quilt on top of git (SH/AWK/SED/…)

10.8.4. Git for recording configuration history

You can manually record chronological history of configuration using Git tools. Here is a simple example for your practice to record "/etc/apt/" contents.:

$ cd /etc/apt/
$ sudo git init
$ sudo chmod 700 .git
$ sudo git add .
$ sudo git commit -a
  • commit configuration with description.
  • make modification to the configuration files
$ cd /etc/apt/
$ sudo git commit -a
  • commit configuration with description.
  • … continue your life …
$ cd /etc/apt/
$ sudo gitk --all
  • you have full configuration history with you.
[Note] Note

sudo(8) is needed to work with permissions of configuration data. For user configuration data, you may skip sudo.

[Note] Note

The "chmod 700 .git" command in the above example is needed to protect archive data from unauthorized read access.

[Tip] Tip

For more complete setup for recording configuration history, please look for the etckeeper package: Section 9.2.9, “Recording changes in configuration files”.