InterMezzo Protocols

Peter J. Braam
Mountain View Data

Oct, 2000

Abstract:  We describe InterMezzo's protocols as they are currently
implemented with extensions we want to add in the foreseeable future.

Mounting Protocol
-----------------

A cache is an instance of a Unix mount point of an InterMezzo file
system.  Caches are subdivided into file sets.  Each file set has an
InterMezzo mountpoint in a cache.  One file set is distinguished,
namely the one that is mounted on the root directory of a cache. 

InterMezzo is partly a path based system - during reintegration the
path names of files need to be known.  At mount time the InterMezzo
modules need to pass the mount point of the file system and the file
set mounted at the root of the cache.  These two options are passed as
the "mtpt" and "fileset" mount options.  

The mtpt and cache_root_fileset are primarily passed to enable Lento
to do reintegration.  Unless these flags are set, the file system
needs to be read-only.   If InterMezzo is used as the root file
system, the initial mount by init will be read-only.  A subsequent
remount issued by the rc scripts will set the mtpt and fileset flags
and make the file system read-write.

Mount points of other file sets need to be initialized before the file
system can use these file sets.  This is done with the SETFSROOT
ioctl.  The ioctl can only be made when the root file set and the
mountpoint of the cache have been set.


Network Request Protocol
------------------------

Our network request protocol is layered: 

1. network layer - TCP sockets
2. packet protocol - InterMezzo's Packet protocol
3. session protocol - InterMezzo's Session protocol
4. request handling 
5. bulk data transport

Networking Layer

InterMezzo has a "Pinger" session which tries to connect to a socket
on the server.  This connect is made asynchronously.  Connections are
only initiated by clients.  If the connect attempt fails a new attempt
to connect is started a few seconds later.  If it succeeds a
connection object is established.  All further communication uses the
packet protocol over the connection.

The connection is monitored.  In the absence of traffic it is
forcefully shut down.   All sessions awaiting actions from the
connection are notified of such event.  Buffers are flushed.

Packet protocol

The incoming networking protocol reads the stream and assembles
objects called packets from the data received.  The outgoing protocol
is capable of formatting packets into byte stream objects.  Typically
these packets are sent out in chunks of a few kilobytes.

The packets have a type: 

REQ: the first packet always sent in an exchange. It defines which
     network peer is the client and which is the server (need to
     contain a client ACT to handle response).  Will typically
     generate a new session in the server to handle the request.
REP: packets sent in response to the client by the server. Will
     contain a server ACT in case the client wishes to send more
     (which is not currently done).
MSG: packets sent from client to server after a REP packet has
     arrived.

Bulk transfer introduces a few more packets:

START: from server to client to indicate server is ready to
       receive data.  This packet contains ACT's for sink 
       and source in the server/client spot. 
DAT:   client to server and server to client, dispatched to ctoken. 
EOD:   client to server and server to client, dispatched to ctoken. 


Every packet is part of an exchange.  An exchange is a sequence of
packets associated with processing a request.  Every exchange starts
with a REQ packet which defines the sender of the packet as the client
in the exchange and the receiver as the server in the exchange.

Exchanges, at present, take only a few forms:

Client exchanges:
-----------------

REQ(out), REP(in)

REQ(out), [ DATA(in)+, EOD(in) ]*  REP(in)

REQ(out), [START(in), DATA(out)+, EOD(out)]*, REP(in)

Server exchanges:
-----------------

REQ(in), REP(out)

REQ(in), [ DATA(out)*, EOD(out) ]* REP(out)

REQ(in), [ START(out), DATA(in)+, EOD(in) ]* REP(out)

Often exchanges are nested, and a server while processing an exchange
requires one or more other exchanges which it initiates as a client
with the same or another peer.

InterMezzo manages request processing in objects called sessions that
have event handlers.  Connections send events to sessions.  In order
for network traffic to be directed to the correct session, a client
and server "asynchronous completion token" ACT may be present for
each.   The connection layer is responsible for dispatching packets to
the correct session.  REQ packets are dispatched to the reqdispatcher
which instantiates new sessions to handle the exchange.

The following request sequences are currently in use:




REPLICATING, DATA-ON-DEMAND, PARTIAL CACHING

- Backfetch:
 current params:   'volume','path', 'ino', 'generation'
 networking internal parameters: 'client ACT' (which session is handling the fetch)
 This is made by the bulk descriptor and involves a bulk data transfer.  

- Reintegrate:
 		 ('REQ', ['Reintegrate_KML', $kml->{vol}->name(), 
			  $_[HEAP]->{replicator}->next_to_send(), 
			  $_[HEAP]->{replicator}->{last_remote_offset}, 
			  $_[HEAP]->{replicator}->{last_remote_recno}, 
			  $_[HEAP]->{replicator}->{last_local_recno}, 
			  $_[HEAP]->{replicator}->{last_local_offset}, 
			  $kml->{last_offset} - $_[HEAP]->{replicator}->next_to_send()]);

- Ping
	params: time()

- Sysid
	system name

- GetML
                                     ['GetML', $vol->name(), $local_kmlsize, 
                                      $last_remote_recno,$last_remote_offset,
                                      $last_local_recno,$last_local_offset]);
- Permit:
	       ('REQ', ['Permit', $_[HEAP]->{volume}->name()]);

DATA ON DEMAND & PARTIAL CACHING:

- GetFile - made by kernel to fetch data
        ['SendFile', $_[HEAP]->{volname}, 
                     $_[HEAP]->{remotefile},
		         $_[HEAP]->{is_backfetch}

PARTIAL CACHING:

- GetDir - made by kernel to get directory when data is missing


Kernel

Kernel state 

Kernel state of InterMezzo is that of a running ext2 file system with
added state in:
 - dentries
 - inodes (flags)
 - non posted journal pages
 - pending upcalls (including posted journal pages)

Dentries have flags:
 - HAVE_DATA
 - HAVE_ATTR
 - IS_VOLROOT
 - HAVE_PERMIT (on volume root dentry's)

Journal entries may refer to objects in one or in two volumes
(rename).  Journal entries cannot be created unless the HAVE_PERMIT
flag is set for the volumes affected by the entry.




Recovery

We assume that writes to disks are ordered.  This appears not to be
guaranteed for SCSI drives.

A system crashes and comes back up - how can we assure that:

1) the CML and ext2 cache are compatible
2) partially fetched objects are detected

1) 

We will write a time stamp to the CML every few seconds. 

a) The last 30 seconds of CML may refer to objects not present on the
disk.  Remove such CML entries except when they are rmdirs and
unlinks.  

(Notice that even in the case of unlinks/rmdirs they may be for
objects which are not present on the server, i.e. objects created
and removed during the last xx seconds)

b) There may be objects on the disk which were modified, but CML
entries were not generated. 

- if such objects are new, they have disk {mc}times during the last 30
seconds.   Their parents may not have 

- if such objects are not new, we must force their {mc}times to the
disk before modifying them 


2) 

a) Fetches should always store data in new files or directories.  In
this way, it is not possible for data blocks to be present after ext2
recovery without the new inode having been written. 

b) While fetches are in progress and for 30 secs afterwards (or until
sync is called) new objects should have a flag set in the inode
(EXT2_FLAGS_SETTLING).  30 seconds after creation of a new object,
lento can clear the flag.

This flag applies to changes to the FS made by Lento.

c) ifsck should remove all objects with this flag set. 
