  The InterMezzo High Availability File System HOWTO
  by the project members
  v1.0.5, August 2001

  This document explains the configuration and operation of the Inter-
  Mezzo file system on Linux.
  ______________________________________________________________________

  Table of Contents


  1. Aknowledgements

  2. Disclaimer and License

  3. Introduction

     3.1 What is InterMezzo?
     3.2 Limitations in version 1.0.4

  4. Installing InterMezzo

     4.1 Overview
     4.2 Getting the packages
     4.3 Configuring the 2.4 kernel for booting
     4.4 Building the InterMezzo file system for a custom kernel

  5. Configuring InterMezzo

     5.1 Config files
     5.2 Formatting an InterMezzo file system
     5.3 Converting ext2/3 file systems to InterMezzo.
     5.4 Three common configurations
        5.4.1 One client and one server (typical use: laptop - desktop, backup and two web server synchronization)
        5.4.2 Two clients and one server (typical use: two remote offices)
        5.4.3 Using IPSec and ssh tunnels
        5.4.4 One client and one server on same host (typical use: testing InterMezzo)
     5.5 Configuration Checking

  6. Recovery from conflicts

  7. Debugging

  8. Using the test framework for testing and debugging

  9. How does InterMezzo work?

     9.1 Coherence and Granularity
     9.2 intermezzo.o, the kernel module
     9.3 Lento
     9.4 The KML File
     9.5 The Last_rcvd File
     9.6 Legal Transformations of the KML and Expect Files

  10. Contact Information



  ______________________________________________________________________

  11..  AAkknnoowwlleeddggeemmeennttss

  Many individuals have contributed to this HOWTO.  Among the authors
  are Peter J. Braam <mailto:braam@clusterfs.com>, Rob Simmonds, Gordon
  Matzigkeit <mailto:gord@fig.org>, Christopher Li
  <mailto:chrisl@gnuchina.org> and  Shirish Phatak
  <mailto:shirish@tacitussystems.com>


  22..  DDiissccllaaiimmeerr aanndd LLiicceennssee

  InterMezzo is an experimental file system.  It contains kernel code
  and daemons running with root permissions and is known to have bugs.
  Please back up all data when using or experimenting with InterMezzo.

  InterMezzo is covered by the GPL.  The GPL describes the warranties
  made to you, and can be found in the file COPYING.

  Copyright on InterMezzo is held by Peter J. Braam, Stelias Computing,
  Carnegie Mellon University, Phil Schwan, Los Alamos National
  Laboratory and Red Hat, Inc, TurboLinux, Inc., Tacitus Systems, Inc.
  and Mountain View Data, Inc.


  InterMezzo is a trademark of Stelias Computing.  It may be used freely
  to refer to the software on the InterMezzo Web Site <http://www.inter-
  mezzo.org>


  33..  IInnttrroodduuccttiioonn

  33..11..  WWhhaatt iiss IInntteerrMMeezzzzoo??

  InterMezzo is a file system that maintains replicas of _f_o_l_d_e_r
  _c_o_l_l_e_c_t_i_o_n_s, a.k.a. _f_i_l_e_s_e_t residing on multiple computers.  It keeps
  these replicas in sync by building a log of modifications and
  propagating that log to other nodes. The computers that express an
  interest in the replica are called the _r_e_p_l_i_c_a_t_o_r_s of the fileset.
  InterMezzo has one server for the fileset, which plays an organizing
  role in exchanging the updates with replicators.


  InterMezzo has _d_i_s_c_o_n_n_e_c_t_e_d _o_p_e_r_a_t_i_o_n, i.e. it maintains a _l_o_g to
  remember all updates that need to be _f_o_r_w_a_r_d_e_d when a failed
  communication channel comes back. This is a best effort
  synchronization since during disconnected operation _c_o_n_f_l_i_c_t_i_n_g
  _u_p_d_a_t_e_s are possible, unless the configuration parameters are set to
  avoid this.


  InterMezzo uses an existing disk file system as the storage location
  for all data.  At present we support _e_x_t_3, but soon also ReiserFS and
  XFS might be supported.  When an ext3 formatted disk volume is mounted
  with file system type InterMezzo instead of ext3, the InterMezzo
  software starts managing all access to the file system.  It keeps the
  _l_o_g_s of modification records and negotiates _p_e_r_m_i_t_s to modify the disk
  file system, to avoid conflicting updates during connected operation.


  InterMezzo can use a basic internal file tranfer mechanism or rely on
  the rsync protocol (see the Rsync web site <http://rsync.samba.org>).


  33..22..  LLiimmiittaattiioonnss iinn vveerrssiioonn 11..00..44


      SSeeccuurriittyy
        Currently you should run InterMezzo only on trusted networks --
        the root users on the replicating systems need to be equally
        trusted.  There is some rudumentary security built into the
        system yet, which is similar to NFS security (but _w_i_t_h_o_u_t root
        squash).  A good way to get a trusted network is to use IPSEC
        (see FreeSwan <http://www.freeswan.org>), CIPE (see
        <http://sites.inka.de/sites/bigred/devel/cipe.html>), or SSH
        tunnels.  The SSL utility stunnel is somewhat harder to use
        since it spawns many daemons trying to reconnect. Support for
        POSIX ACL replication is available for the 2.2 kernel and
        forthcoming for 2.4.  Some security improvements will be made as
        time progresses.


      RReeccoovveerryy
        The system currently has journal recovery in combination with
        Ext3.  After system crashes the local disk system with the KML,
        LML and last_rcvd file which contain distributed state will
        recover automatically.  Recovery with peers will normally also
        be seamless.


      CCoonnfflliicctt HHaannddlliinngg
        The system does not currently have conflict handlers but
        pessimistic, rigourous conflict detection.  More extensive
        conflict resolution tools are being developed and should be
        available with the next major release.  The design of the system
        means that conflicts can only occur when reconnecting after a
        period of disconnected operation and that conflicts can only
        occur on a client.


      FFeettcchh oonn ddeemmaanndd
        At the moment InterMezzo replicates an entire filesystem.
        However, a fetch on demand system will appear in a future
        version, which will allow partial replication of a filesystem.
        The first versions of this will fetch file data on demand but
        replicate metadata (directories and inodes) fully.  Partial
        metadata caching may be implemented in future versions.



  44..  IInnssttaalllliinngg IInntteerrMMeezzzzoo

  44..11..  OOvveerrvviieeww

  InterMezzo depends on a kernel that has the InterMezzo file system.
  There is also a user level file server and cache manager which are
  currently written in Perl.  Finally there are some utilities to make
  InterMezzo file systems.


  44..22..  GGeettttiinngg tthhee ppaacckkaaggeess

  The packages for version 1.0.4 are available from  <ftp://ftp.inter-
  mezzo.org:/pub/intermezzo/1.0.4/rh7.1/RPMS>.  These packages should
  install cleanly on a RedHat 7.1 system.  You want to intall either the
  2.2 kernel package or the 2.4 kernel package.


  44..33..  CCoonnffiigguurriinngg tthhee 22..44 kkeerrnneell ffoorr bboooottiinngg

  In order to boot the 2.4 kernel, you need to generate an initial
  ramdisk with initrd as follows:




       mkinitrd /boot/initrd-2.4.7-ext3_0.9.5-presto_1.0.4 2.4.7-ac9


  In order for Lilo to boot this kernel now add the following kind of
  lilo entry to your /etc/lilo.conf file:



       image=/boot/vmlinuz-2.4.7_ext3_0.9.5_presto_1.0.4
               label=InterMezzo
               read-only
               root=/dev/hda1
               initrd=/boot/initrd-2.4.7-ext3_0.9.5-presto_1.0.4





  44..44..  BBuuiillddiinngg tthhee IInntteerrMMeezzzzoo ffiillee ssyysstteemm ffoorr aa ccuussttoomm kkeerrnneell

  In order to get a kernel module for your kernel, you need to have the
  .config file and the kernel sources for your kernel.  Proceed by first
  preparing your kernel sources, and then building the module:




       cd /your/source/linux
       make distclean
       cp your.config  .config
       make oldconfig dep
       cd /usr/src/presto24-1.0.04
       ./configure --enable-linuxdir=/your/source/linux
       make install




  For Linux 2.2 kernel the same mechanism works.


  55..  CCoonnffiigguurriinngg IInntteerrMMeezzzzoo

  55..11..  CCoonnffiigg ffiilleess

  Your default config directory is /etc/intermezzo.  You may use the
  interactive inconfig command to generate the following configuration
  files, or manually create them.

  The config files in versions 1.0 and later use use the XML format
  instead of the Perl formats found in older versions.



     //eettcc//iinntteerrmmeezzzzoo//ssyyssiidd
        Holds a name of your system, the presto device name and the IP
        bind address.  Suppose your server has the name muskox, with IP
        address 192.168.0.3, and your clients are clientA and clientB.
        The sysid file on each host would contain the host name, the
        presto device and the IP bind address. i.e., on muskox the file
        would contain:



          <sysid name="muskox" psdev="/dev/intermezzo0" bindaddr="192.168.0.3" />




     Note that in early versions of InterMezzo, this file did not
     contain the name of the presto device; this field is now required.


     //eettcc//iinntteerrmmeezzzzoo//sseerrvveerrddbb
        Holds a database of servers.  The server structure is a XML
        server element, as follows:



          <serverdb>
            <server name="muskox" ipaddr="192.168.0.3" port="2222"
              bindaddr="192.168.0.3" />
          </serverdb>




     The above contains a single server description for the server
     muskox with IP address "192.168.0.3".  The port and bindaddr are
     optional; the default port is 2222.  Without a bindaddr the server
     listens to all interfaces for requests, with it, the server only
     listens on the bindaddr address.  If you are running both a client
     and a server on the same system, you need to specify a different
     bindaddr for the server and the client(s).


     //eettcc//iinntteerrmmeezzzzoo//ffsseettddbb
        Holds a database of filesets.  The fsetdb structure is a XML
        fileset element, as follows:



          <fsetdb>
          <fileset name="yourfsetname" servername="muskox" fetchtype="bulktype" >
          <replicator>clientA</replicator>
          <replicator>clientB</replicator>
          </fileset>
          </fsetdb>




     The above contains a single fileset description for a fileset
     called yourfsetname which is served by muskox.  The fileset is
     replicated on hosts clientA and clientB.


     The _f_e_t_c_h_t_y_p_e can be the class name of a supported bulk mover.  The
     default is "Rsync", the simpler InterMezzo managed bulk mover is
     called "Desc".


     //eettcc//ffssttaabb
        To ease the mounting of InterMezzo filesets add one of the
        following to the /etc/fstab file.  For testing and developing
        using a loop device as the cache is easiest:



          /tmp/cache  /izo0  intermezzo loop,fileset=fsetname,mtpt=/mnt/izo0,
                data=journal,prestodev=/dev/intermezzo0,cache_type=ext3,noauto 0 0




     where /tmp/cache is a file associated with a loop device, /izo0 is
     a mount point (a directory), fsetname is the name of the fileset
     and /dev/intermezzo0 is the name of the presto device.  The
     creation of the cache file and the presto device is explained in
     the examples at the end of this section.  The kernel must be
     configured with loopback device support enabled to do this.


     NNOOTTEE:: The mount option data=journal is important for 2.4 kernels
     pending a bug fix in ext3.


     Using a genuine block device is a little easier, because you do not
     need to set up a loop device. To use the block device /dev/hda9,
     the /etc/fstab file should contain:



          /dev/hda9  /izo0 intermezzo fileset=fsetname,mtpt=/izo0,
          prestodev=/dev/intermezzo0,cache_type=ext3,data=journal,noauto 0 0




     _N_O_T_E_:

     +o  While the lines may wrap in this document the /etc/fstab entry
        should be a single line. The same holds for the following
        examples.

     +o  the mount point needs to be explicitly passed in the options
        (future versions of mount will not need this).

     OOtthheerr ffiilleess
        The file /izo0/.intermezzo/fsetname/kml contains kernel
        modification log (aka the KML) which keeps track of all of the
        changes made in an InterMezzo filesystem. The file
        /izo0/.intermezzo/fsetname/last_rcvd is the last_rcvd file which
        keeps track of the distributed synchronization file.  In the
        current release of InterMezzo, the KML and last_rcvd files need
        to be created (usually by running mkizofs) before first mounting
        an InterMezzo filesystem.



  55..22..  FFoorrmmaattttiinngg aann IInntteerrMMeezzzzoo ffiillee ssyysstteemm



  For this one uses the mkizofs tool:


  mkizofs -r fsetname -j /tmp/cache
  mkizofs -r fsetname -j /dev/hdaX



  The argument to the -r option gives the root fileset name for which an
  InterMezzo replication log will be created, the -j option causes and
  Ext3 journal to be created.  Please note that this requires e2fsprogs
  version 1.22 or later (see  <http://e2fsprogs.sourceforge.net>). There
  are further options, see mkizofs -h for options, such as specifying
  the filesystem type.



  55..33..  CCoonnvveerrttiinngg eexxtt22//33 ffiillee ssyysstteemmss ttoo IInntteerrMMeezzzzoo..


  If you have already initialized your cache filesystem, then you must
  manually create the needed InterMezzo metadata files:



       mount -t ext2 -o loop /tmp/cache /izo0
       mkdir -p /izo0/.intermezzo/fsetname/db
       chgrp -R InterMezzo /izo0/.intermezzo
       chmod 700 /izo0/.intermezzo
       touch /izo0/.intermezzo/fsetname/{kml,lml,last_rcvd}
       tune2fs -j /tmp/cache # if file system was ext2
       umount /izo0




  These example assumes that we are using the loopback device with the
  /tmp/cache filesystm, and that the fileset will be called fsetname.


  Before you can mount these as InterMezzo you should manually replicate
  them to the replicators, so that the file systems are identical.


  55..44..  TThhrreeee ccoommmmoonn ccoonnffiigguurraattiioonnss


  Let's consider three common system configurations, for each we will
  give the config files and the correct invocations to start the
  server/cache manager.



  55..44..11..  bbaacckkuupp aanndd ttwwoo wweebb sseerrvveerr ssyynncchhrroonniizzaattiioonn)) OOnnee cclliieenntt aanndd oonnee
  sseerrvveerr ((ttyyppiiccaall uussee:: llaappttoopp -- ddeesskkttoopp,,


  In this case we assume that the host muskox is serving the fileset
  shared and the host clientA is replicating the fileset.  The following
  files are placed on both muskox and clientA.



     //eettcc//iinntteerrmmeezzzzoo//sseerrvveerrddbb


          <serverdb>
            <server name="muskox" ipaddr="192.168.0.3" />
          </serverdb>





     //eettcc//iinntteerrmmeezzzzoo//ffsseettddbb


          <fsetdb>
          <fileset name="shared" servername="muskox" >
          <replicator>clientA</replicator>
          </fileset>
          </fsetdb>

     //eettcc//iinntteerrmmeezzzzoo//ssyyssiidd
        On muskox this contains:


          <sysid name="muskox" psdev="/dev/intermezzo0" bindaddr="192.168.0.3" />




     On clientA this contains:


          <sysid name="clientA" psdev="/dev/intermezzo0" bindaddr="192.168.0.20" />





     //eettcc//ffssttaabb
        The following line is added on both muskox and clientA:

          /tmp/fs0  /izo0      intermezzo loop,file-
          set=shared,prestodev=/dev/intermezzo0,
          mtpt=/izo0,cache_type=ext3,noauto 0 0



     //ttmmpp//ffss00
        This file and the filesystem is created using the following
        commands:


          dd if=/dev/zero of=/tmp/fs0 bs=1024 count=10k
          mkizofs -F /tmp/fs0





     //iizzoo00//..iinntteerrmmeezzzzoo//sshhaarreedd//kkmmll
        If we didn't run mkizofs above, we create the KML and last_rcvd
        files by first mounting the filesystem as ext3:


          mkdir /izo0
          mount -o loop /tmp/fs0 /izo0
          mkdir -p /izo0/.intermezzo/shared
          touch /izo0/.intermezzo/shared/{kml,last_rcvd}
          umount /izo0





     //ddeevv//iinntteerrmmeezzzzoo00
        This is created using the following commands:


          mknod /dev/intermezzo0 c 185 0
          chmod 700 /dev/intermezzo0






     //eettcc//ccoonnff..mmoodduulleess
        Your modules configuration file may also be called
        /etc/modules.conf.  Add the lines:


          alias char-major-185 intermezzo




  Before starting lento, mount the cache:


       mkdir /izo0; mount /izo0




  Now lento can be started on both muskox and clientA by typing


       lento





  55..44..22..  TTwwoo cclliieennttss aanndd oonnee sseerrvveerr ((ttyyppiiccaall uussee:: ttwwoo rreemmoottee ooffffiicceess))


     //eettcc//iinntteerrmmeezzzzoo//sseerrvveerrddbb
        The can be the same as for the one client and one server case
        above.


     //eettcc//iinntteerrmmeezzzzoo//ffsseettddbb


          <fsetdb>
          <fileset name="shared" servername="muskox" >
          <replicator>clientA</replicator>
          <replicator>clientB</replicator>
          </fileset>
          </fsetdb>





     This is the same as in the first example, but clientB is added to
     the replicators list.


     //eettcc//iinntteerrmmeezzzzoo//ssyyssiidd
        This is the same as in the first example for muskox and clientA,
        and on clientB contains the following:


          <sysid name="clientB" psdev="/dev/intermezzo0" bindaddr="192.168.0.21" />





     //eettcc//ffssttaabb
        This is the same as used with the one client and one server case
        above.


  55..44..33..  UUssiinngg IIPPSSeecc aanndd sssshh ttuunnnneellss


  Could someone write something here please?

  Running over an encrypted tunnel ssh -f -x -L 3333:localhost:2222 -R
  3333:localhost:2222



  55..44..44..  IInntteerrMMeezzzzoo)) OOnnee cclliieenntt aanndd oonnee sseerrvveerr oonn ssaammee hhoosstt ((ttyyppiiccaall
  uussee:: tteessttiinngg


  Suppose that we are running on the host muskox.  To run multiple
  lentos on one host we need to use ip-aliasing; the ip-aliasing option
  must be compiled into your kernel (CONFIG_IP_ALIAS).  This allows one
  interface to have more than one IP address associated with it.
  Suppose the name muskoxA1 and the IP address 192.168.0.100 are
  available.  In:


     //eettcc//hhoossttss
        Add the line:


          192.168.0.100   muskoxA1




     Then add the ip-alias by typing:

         ifconfig eth0:1 muskoxA1 up



     Then create two configuration files containing the following:

     //eettcc//iinntteerrmmeezzzzoo//ssyyssiidd


          <sysid name="muskox" psdev="/dev/intermezzo0" bindaddr="192.168.0.3" />





     //eettcc//iinntteerrmmeezzzzoo//ssyyssiidd..mmuusskkooxxAA11


          <sysid name="muskoxA1" psdev="/dev/intermezzo1" bindaddr="192.168.0.100" />




     The latter file will act as a sysid file for the lento running on
     the aliased IP address.  Note that because we are running both the
     client and the server on the same system, we have to specify
     different devices for each, namely /dev/intermezzo0 and
     /dev/intermezzo1.


     //eettcc//iinntteerrmmeezzzzoo//ffsseettddbb


          <fsetdb>
          <fileset name="shared" servername="muskox" >
          <replicator>muskoxA1</replicator>
          </fileset>
          </fsetdb>




     To run the second lento, a second presto device and loopback cache
     are required.  These are made as follows:


     mknod /dev/intermezzo1 c 185 1
     dd if=/dev/zero of=/tmp/fs1 bs=1024 count=10k

     mkizofs -F /tmp/fs1
     chmod 700 /dev/intermezzo1




     //eettcc//ffssttaabb
        Note that two entries are needed here:

        /tmp/fs0  /izo0      intermezzo loop,fileset=shared,prestodev=/dev/intermezzo0,
        mtpt=/izo0,cache_type=ext3,noauto 0 0
        /tmp/fs1  /izo1      intermezzo loop,fileset=shared,prestodev=/dev/intermezzo1,
        mtpt=/izo1,cache_type=ext3,noauto 0 0



     Now mount the two InterMezzo filesystems:


          mount /izo0
          mount /izo1




     The lento acting as the server can be started as before:


          lento




     The lento acting as the replicator has to be told which sysid file
     to read (which tells it which presto device to use).  The second
     lento is started as follows:


          lento.pl --idfile=sysid.muskoxA1








  55..55..  CCoonnffiigguurraattiioonn CChheecckkiinngg

  Currently the _c_h_e_c_k_c_o_n_f_i_g tool is not working.  The XML version of the
  config check is not ready yet.

  A script is provided to perform simple checks on the configuration
  files.  The script is called config_check and can be found in the
  .../intermezzo/tools directory.


  If Lento is using the standard system id file, /etc/intermezzo/sysid,
  the script can be run without arguments.  If a different system id
  file is being used the --idfile=my_idfile flag can be used to indicate
  this.


  It is also possible to use a configuration directory other than
  /etc/intermezzo by using the --configdir=my_confdir flag.


  66..  RReeccoovveerryy ffrroomm ccoonnfflliiccttss


  The current version of InterMezzo has a built in recovery mechanism to
  deal with most situations of system crashes.  Through configuration
  choices, conflicts, i.e. inconsistent updates to client and server
  caches can be avoided.


  However, during disconnected operation, conflicts can be generated if
  the configuration does not explicitly avoid them through enforcing the
  file system to be readonly.  Where the client and server have
  inconsistent caches, only manual recovery can recover the system.


  The system can be recovered manually as follows:



  1. When a conflict happens, the lento which is reintegrating changes
     will die.  This Lento is receiving updates from its peer in this
     replicator and typically the peer will have the latest updates.  So
     we are going to synchronize from the lento that survived to the
     lento that died.

  2. Shutdown the server and client(s), unmount the caches, and remove
     the presto module from the kernel:

      umountizo ; rmmod presto

  3. Mount each cache as an ext3 filesystem:

      mount -o loop /tmp/fs0 /izo0

  4. Use rsync or tar, or another tool, to synchronize the caches on the
     clients and server.  Make sure to remove files from the client that
     you don't have on the server, the caches need to be identical.

  5. Set the synced flag on the clients - this prevents the system from
     resyncing on startup.  This is done using the command below where
     SYSID is replaced with the client's sysid, and FSETNAME is replaced
     with the name of the fileset:

      touch /var/intermezzo/SYSID/FSETNAME-synced

     e.g. on client iclientA with fileset shared use:
      touch /var/intermezzo/iclientA/shared-synced

  6. The persistent databases will be out of sync at this point, so you
     must clear the KML and last_rcvd records on both the client and the
     server:

      cp /dev/null /izo0/.intermezzo/shared/kml ;
      cp /dev/null /izo0/.intermezzo/shared/last_rcvd

  7. Unmount the caches and mount them again as InterMezzo file systems.
     Restart Lento on the server and client.

  This is cumbersome, but journaled recovery is on its way.



  77..  DDeebbuuggggiinngg

  To help us find bugs we need logging information.  The logs come in
  two places, from the kernel in /var/log/messages, and from lento on
  stdout and stderr.


  The kernel debugging log slows things down enormously and is activated
  with:




       echo 4095 > /proc/sys/intermezzo/debug
       echo 1 > /proc/sys/intermezzo/trace




  The lento log can be captured from the terminal, and is activated
  using the --debuglevel=N.  With N=1 you get many things, with N=100,
  all of it.

  Mailing us the logs as well as a precise description of what you did
  to produce the bug might be enough to see what's happening.


  88..  UUssiinngg tthhee tteesstt ffrraammeewwoorrkk ffoorr tteessttiinngg aanndd ddeebbuuggggiinngg

  Read the README file in the ../intermezzo/tests directory.  This can
  save all information for you conveniently and runs the client(s) and
  server on a single system.






  99..  HHooww ddooeess IInntteerrMMeezzzzoo wwoorrkk??

  InterMezzo was heavily inspired by Coda, and its current cache
  synchronization protocol is one of the many protocols that Coda
  supports.  It is likely not the best for every situation but it is as
  simple as we could make it.


  The InterMezzo filesystem keeps sets of files on multiple hosts
  synchronized.  It sits on top of the native filesystems on each host
  and keeps track of updates to the filesystems in such a way that it
  can synchronize the changes between multiple hosts.  In this document
  we describe the architectures and protocols that InterMezzo uses to
  keep files synchronized.



  99..11..

  CCoohheerreennccee aanndd GGrraannuullaarriittyy

  InterMezzo guarantees only very loose coherence between the
  filesystems.  Files are only ever handled as complete units, changes
  are not propagated until the file is closed for writing, and changes
  on one system are not necessarily reflected on another immediately.
  In InterMezzo 1.0 whole filesystems are replicated and only one host
  may have the write lock for that filesystem at any one time.

  99..22..

  iinntteerrmmeezzzzoo..oo,, tthhee kkeerrnneell mmoodduullee

  Presto is the kernel module for InterMezzo.  It implements the various
  operations associated with the InterMezzo file system under VFS and
  creates pseudo devices for communication with Lento.

  99..33..

  LLeennttoo

  Lento is a user-space daemon which handles file transfers and other
  caching issues on behalf of presto.  There is one Lento per mounted
  InterMezzo file system.

  99..44..

  TThhee KKMMLL FFiillee

  There is one KML file per mounted InterMezzo filesystem.  The KML file
  contains records of changes to the filesystem, and taken as a whole
  the KML file can provide a script for building a replica of the whole
  filesystem.

  The KML file is a series of binary records, each of which represents a
  single modification to the filesystem.  Each record is self-contained
  in that it does not have references to other records, a property which
  makes the records easy to move around.  The records are of variable
  length, and the length of the record is stored at the beginning and
  end of each record to facilitate moving forward or backward through
  the file.  A complete description of the allowed KML record formats
  doesn't exist yet.

  99..55..

  TThhee LLaasstt__rrccvvdd FFiillee

  There is one Expect file per mounted InterMezzo filesystem. The Expect
  file contains information about how this host is synchronized with the
  other hosts by holding pointers into this and other hosts' KML files.
  This information is stored in the filesystem so that it will be
  persistent across reboots.

  The Expect file has four pieces of information for each remote host.


  1. nneexxtt__ttoo__eexxppeecctt.  A pointer to the next record in the remote host's
     KML file that we expect it to send to us.  If we get a set of
     records that is does not start at this value then a message has
     been dropped somewhere and we need to renegotiate with that host.
     This is NOT a hint.

  2. nneexxtt__ttoo__sseenndd.  A pointer to the next record in our KML file that we
     intend to send to the remote host.  This is just a hint because we
     advance next_to_send as soon as we've sent data to another host,
     not when we've gotten confirmation that it has been received and
     processed.  When we send KML records to the remote host we send the
     value of next_to_send (plus the gap, below) to tell the remote host
     where the records come from in our KML.

  3. ccoonnffiirrmmeedd.  A pointer to the beginning of the next record in our
     KML file that has not yet been confirmed as received and processed
     by the remote host.  This is NOT a hint.

  4. ggaapp.  An adjustment to add to next_to_send before sending it to the
     remote host.  This lets us move records forward or back in our
     local KML file while preserving the externally visible file
     locations.  This is NOT a hint.

  99..66..

  LLeeggaall TTrraannssffoorrmmaattiioonnss ooff tthhee KKMMLL aanndd EExxppeecctt FFiilleess

  In order to maintain consistency, only certain kinds of
  transformations to the KML and Expect files are allowed, and generally
  they have to be done together using transactions to make sure the
  system remains in a coherent state.


  1. AAppppeenndd aa RReeccoorrdd ttoo tthhee KKMMLL FFiillee.. This is the operation that the
     normal VFS file operations end up using.  The record is appended to
     the KML file, and no modifications are made to the Expect file.

  2. IInnccoorrppoorraattee aa RReemmoottee KKMMLL RReeccoorrdd.. In addition to performing the
     operation and appending the record to the local KML file, increment
     the next_to_expect for that host.  Modifies the KML and Expect
     files.

  3. SSeenndd KKMMLL RReeccoorrddss ttoo aa RReemmoottee HHoosstt.. A block of KML records are read
     from the KML file starting at next_to_send, and are transmitted to
     the remote machine.  next_to_send is incremented by the number of
     bytes read.  We effectively get a read lock on this section of the
     KML file. KML is read but not modified, and the Expect file is
     modified.

  4. RReecceeiivvee CCoonnffiirrmmaattiioonn ooff KKMMLL PPrroocceessssiinngg.. We receive confirmation
     from a remote host that a set of records starting at a given point
     and with a certain byte length has been received and processed.
     These offsets are from a remote host so we have to subtract off the
     gap for that host, then compare with what we think the confirmed
     pointer is, then move the confirmed pointer.  There are no KML
     modifications.

  5. OOppttiimmiizzee aa SSeeccttiioonn ooff tthhee KKMMLL FFiillee.. We obtain a write lock on the
     section to be optimized, then read the section in, perform whatever
     optimizations we desire, then write it out again.  The newly
     written section must be no larger than the previous, and if it is
     smaller a NOP block is inserted to fill out the space either before
     or after the new section. If this section is at the end of the KML
     file then the KML file can be truncated to remove the NOP block at
     the end.  Then the write lock is relenquished.  The KML file is
     modified and the Expect file is not.

  6. PPuunncchh OOuutt aa SSeeccttiioonn ooff tthhee KKMMLL FFiillee.. The section to be removed must
     not have outstanding read or write locks, and it can only have NOP
     records in it.  File system magic is then performed to release the
     appropriate file blocks to produce a sparse file.  The Expect file
     is not changed.

  7. FFrroonntt--TTrruunnccaattiioonn ooff tthhee KKMMLL FFiillee.. Instead of producing a sparse
     file you can remove the beginning of the KML file.  Like the punch
     operation, the section to be removed should have no outstanding
     read or write locks and should have only NOP operations.  The file
     should then be truncated and the gap values for all of the remote
     hosts adjusted in one transaction.

  8. SSkkiipp aa NNOOPP BBlloocckk iinn tthhee KKMMLL FFiillee.. The next_to_send and confirm
     pointers must both be pointing to the beginning of a NOP block.
     Then next_to_send, confirm, and gap can all be incremented by the
     size of the NOP block.  No changes are made in the KML file.










  1100..  CCoonnttaacctt IInnffoorrmmaattiioonn

  The InterMezzo web site is  <http://www.inter-mezzo.org>.


  General questions about InterMezzo can be sent to intermezzo-
  discuss@lists.sourceforge.net .  This along with other InterMezzo
  related mail lists are archived on the InterMezzo web site, so it may
  be worth checking here to see if your question has already been
  answered.


  Bug reports should be filed on sourceforge.  Please include the
  version of InterMezzo you are using and a description of your system
  configuration and the problem observed.


  Also, please include all relevant logs: /var/log/messages, and the
  output of Lento (run with debugging) on server and clients.






















