





     UUsseerrffss -- FFiilleessyysstteemmss IImmpplleemmeenntteedd aass UUsseerr PPrroocceesssseess
           _J_e_r_e_m_y _F_i_t_z_h_a_r_d_i_n_g_e _<_j_e_r_e_m_y_@_s_w_._o_z_._a_u_>
                     Softway Pty. Ltd.



_1_.  _I_n_t_r_o_d_u_c_t_i_o_n

Userfs  is a mechanism by which normal user processes can be
a  Linux  filesystem.   There  are  many  uses   for   this,
including:

Prototype filesystems

     Prototype  new  block  allocation  algorithms in a user
     process and  debug  with  gdb  before  going  into  the
     compile-crash-reboot cycle of kernel development.

Infrequent use filesystems

     You  want to mount "FooBaz 0X" filesystems under Linux,
     but you don't want it that often, and you don't need it
     to  be  maximum  speed.   Rather than trying to get the
     kernel  itself  to  understand,  or  write  specialised
     tools, write a filesystem program.

Add capabilities to existing filesystems

     Want  compression, encryption, ACLs?  Have a process to
     mirror  an  existing  file  tree,  but  with  your  own
     extentions and semantics.

Completely virtual filesystems and new interfaces

     Add   a   filesystem-type   interface  to  an  existing
     mechanism, or a filesystem interface as a  new  way  of
     representing data.  Sick of FTP?  How about

          $ mkdir /ftp/tsx-11.mit.edu
          $ cd /ftp/tsx-11.mit.edu/pub/linux
          $ cp README $HOME

     Or mail?




















                            - 2 -



          $ cd /mail
          $ ls
          001.sbg@socs.uts.edu.au
          002.Leroy
          003.tlukka@vinkku.hut.fi
          004.Davor_Jadrijevic
          $ cat */From
          From: sbg@socs.uts.edu.au
          From: leroy@socs.uts.edu.au (Leroy)
          From: tlukka@vinkku.hut.fi
          From: davor%emard.uucp@ds5000.irb.hr (Davor Jadrijevic)
          $ cat */Subject
          Subject: More things
          Subject: (none)
          Subject: That userfs thing
          Subject: mailfs again
          $

You get the idea.


_2_.  _I_n_s_t_a_l_l_a_t_i_o_n

_2_._1  _K_e_r_n_e_l

First  of  all,  if you have installed userfs before, remove
traces of previous versions of userfs: make sure  there  are
no  userfs header files in _l_i_n_u_x_/_i_n_c_l_u_d_e_/_l_i_n_u_x and no userfs
patches to any of the kernel source.

Otherwise, the kernel module should just  compile  with  the
rest  of  the  build process.  Userfs is currently supported
for 2.0.x (tested up to 2.0.0), and for  1.3.x.   It  should
work  out  the  kernel  version  you're  compiling  for  and
configure itself appropriately.  There are no kernel patches
and compiling userfs into the kernel is not supported.

To  install  the module you need the mmoodduuttiillss package, which
should be available from your local Linux ftp  archive.   It
should  be  clear from its documentation what you need to do
with _u_s_e_r_f_s_._o to get it into the kernel.  If  you  get  some
warnings  about multiply defined symbols, ignore them:  only
undefined symbols are a problem.  You can compile the kernel
and module with either ELF or a.out compilers; I used ELF.

The kernel module supports modverions, so it generally won't
be necessary to  recompile  it  for  every  kernel  version;
_i_n_s_m_o_d will let you know if you need to recompile the userfs
module (with luck - otherwise Linux will helpfully  let  you
know with obscure behaviour or kernel panics...).

_2_._2  _N_o_n_-_k_e_r_n_e_l _C_o_d_e

Building  the  rest of the code should be a matter of typing








                            - 3 -



"make" at the  top  userfs  directory.  This  will  generate
dependencies  and  build  the utilities needed (genser), the
library, the  clients  using  the  library  and  the  kernel
module.  There will be some warnings; ignore them.

I  used  gcc  2.7.0;  you  probably  need  to use the latest
compiler and libraries (libg++ 2.7.0.1) for the C++  (though
I've  avoided  templates  and  exceptions;  g++  has  enough
problems with simple things).

_2_._3  _M_a_i_l_i_n_g _l_i_s_t

The old USERFS channel on the Linux Activists list server is
now    defunct.     There    is    a    new   list:   linux-
userfs@vger.rutgers.edu_.   _T_o  _s_u_b_s_c_r_i_b_e_,   _s_e_n_d   _m_a_i_l   _t_o
_m_a_j_o_r_d_o_m_o_@_v_g_e_r_._r_u_t_g_e_r_s_._e_d_u with the contents

     subscribe linux-userfs [your email address]

_2_._4  _B_u_g_s_, _c_o_m_m_e_n_t_s_, _e_t_c

When  you  find  a  bug,  tell  me.  Please send me the code
you're using, the kernel version,  whatever  changes  you've
made  to userfs kernel code, and instructions or a script to
reproduce the bug.  Don't just tell me "it broke."

If you've made changes to the kernel code, please send it to
me  rather than sending it out to the world.  Please send me
comments, ideas for new kernel features, or things that  you
think  would  make  good  filesystems but you can't do right
now.  Also feel free  to  ask  questions  about  either  the
implementation  of  my  code,  how  to write your own userfs
clients, or even just to tell me you got it working.

Send   mail    to    either    me    (Jeremy    Fitzhardinge
<jeremy@sw.oz.au>) or to the mailing list (see above).


_3_.  _U_s_i_n_g _c_l_i_e_n_t_s

Clients  are  generally  mounted  with  the mmuusseerrffss command.
It's quite simple - it's a  program  which  makes  sure  the
mount  point  is  legal for the user to mount on, and mounts
the given process with the user's  permissions.   Note  that
any  user  can  mount a process, so more checking is done on
the mount point than for a normal mount.  Unless the user is
root,  the  mount  point  must  be  owned  by  the  user and
writable.  mmuusseerrffss has a man page, which is even up to date.

There are a few useful or semi-useful clients: hhoommeerr, ffttppffss,
mmaaiillffss and aarrccffss..

Homer is written in C++, and uses the C++ library in the lib
     directory  to  do most of its work.  All it does is set








                            - 4 -



     up a single  directory  under  its  mount  point  which
     contains  symbolic  links named after each user name in
     the password file, which points to the associated  home
     directory.    Mounted   on   /u  it  makes  a  passible
     replacement for ~ expansion in a shell  (but  it  works
     for any program).

Ftpfs is  an  experimental  filesystem which allows readonly
     access to  FTP  sites,  maintaining  a  long-term  disk
     cache.   Its  intended primarily for anonymous FTP, but
     can also be used for authenticated FTP sessions.

Mailfs is by Davor Jadrijevic.   It  is  for  reading  mail.
     Currently  its  read-only  and  does  not track mailbox
     changes, and is no  longer  being  actively  developed.
     Pester Davor (or fix it yourself).

Arcfs was  written by David Gymer.  It allows you to mount a
     compressed tar file  as  a  read-only  filesystem,  and
     inspect  it  with  normal tools.  It's pretty neat, but
     not recommended for heavy "production" use, or for very
     large files.


_4_.  _T_h_e_o_r_y _o_f _o_p_e_r_a_t_i_o_n

The  kernel  module registers a new filesystem type with the
kernel ("userfs").  The filesystem itself  is  very  simple;
all  it does it takes the normal kernel filesystem requests,
wraps them up into well-defined  packets  and  squirts  them
down  a file descriptor (presumeably connected to a process)
and waits for the reply on another file descriptor.

If the filesystem process is on the same machine,  then  the
file  descriptors  are  probably  ordinary  pipes.  However,
userfs just reads and writes on  the  file  descriptors,  so
they  could  be  anything;  files, sockets, devices - userfs
doesn't care.

The following is not a  comprehensive  tutorial  on  writing
filesystems,  or  a detailed "how it works" or specification
of the existing code.  It is intended to give some  idea  of
what  I  was  thinking,  and  basic concepts to bear in mind
while poking about in my kernel or user code.

_4_._1  _P_r_i_o_r_i_t_i_e_s

I had a number of goals which I  wanted  satisfied  by  this
thing (from most to least important):

Flexibility

     I  want  the process to have as much power as a kernel-
     resident filesystem as possible.  I wanted to keep  the








                            - 5 -



     interfaces  as  generic  and  flexible.   This has been
     mostly achieved.

Robustness

     Since I see prototyping and development a major use for
     userfs, it seems important to make sure that the kernel
     code can't (at worst) crash or lock up if the user code
     fails.   As  it  stands,  it should be impossible for a
     user process to crash the kernel, but  it  is  possible
     for  a  bad user process to lock up processes trying to
     use the filesystem.

     It is also possible for a process to go  strange  while
     it is being mounted, leaving a half-mounted filesystem.
     The mountpoint becomes a  nulled  out  inode,  but  the
     kernel   refuses   to  unmount  it  (because  it  isn't
     mounted), and refuses to  mount  on  it  (because  it's
     busy).   This  happens much less often than it used to,
     because muserfs does a  simple  check  to  see  if  the
     filesystem process is at all viable.

Availability to users and Security

     I'd  like  any  user  to  be able to write a filesystem
     process.  Traditionally, filesystems  are  things  that
     embody  the  security  of  Unix, and are therefore very
     much superuser-only things.  However, there are only  a
     couple  of  really sensitive features that shouldn't be
     able to be controlled by any user: suid executables and
     device  nodes.   Since  a  trusted superuser process is
     still required to call the mount system call, and  that
     process  can  set  the no-suid and no-device flags, the
     filesystem code can't use these as security  holes.   I
     can't  think  of  anything else that needs special care
     from a security point  of  view.   However,  since  the
     filesystem  is  completely  under  the  control  of the
     process,  one  can  make  no  assumptions   about   its
     contents.  For example "." and ".." may not do expected
     things, symlinks may point to places  other  than  what
     readlink   returns.    This   makes   navigating   such
     filesystems a new and interesting experience.

Efficiency

     Efficiency is my  lowest  priority,  but  it  is  still
     important.   Unfortunately  the  other requirements (as
     usual)  make   things   less   efficient.    The   most
     significant   inefficiency   is  the  context  switches
     between the kernel and the process.  I think  the  most
     benefits can be gained by reducing the number of these.
     There are several approaches to this:










                            - 6 -



        +o If the process wants a default  behaviour  for  an
          operation, then it can be done in the kernel.  The
          best example of this is permission checking  -  if
          the  process wants normal unix permission checking
          then it doesn't need to do it  itself.   Otherwise
          it  can  take all the permission requests from the
          kernel, and implement other  permission  policies.
          This   is   currently   implemented.    When   the
          filesystem is first mounted, the kernel  asks  the
          process  what  requests it will accept.  From that
          point the kernel will do sensible default  actions
          for  requests  that  the  process  doesn't want to
          handle  rather  than   sending   them   down   the
          connection.

        +o Group  requests commonly issued together into one.
          This is hard, since  the  main  kernel  tells  the
          filesystem  code  very  little  about  what  it is
          doing, so it is hard to  know  what  to  do  next.
          However,  there  are  a  couple  of  single kernel
          requests that are implemented in the  protocol  as
          two  or more transactions.  This could be fixed in
          future.

        +o Data can be cached in the  kernel.   This  is  the
          most  tricky,  since  kernel caching or read-ahead
          limits the amount of control the process can  have
          over  the  data  once read.  I think this could be
          optionally implemented, depending on  whether  the
          process  says  it  is  OK to do caching, and if so
          what kinds.

          Currently directory readahead is implemented  with
          the  uupppp__mmuullttiirreeaaddddiirr  operation.  This allows the
          filesystem process to  return  as  many  directory
          entries  as  it  likes,  which are then saved in a
          readahead buffer.  If  there's  a  future  request
          which  can  be  satisfied  by  this  buffer it is,
          rather  than  sending  another  message   to   the
          filesystem.  In 1.3.x kernels, userfs can transfer
          the whole lot to the process reading the directory
          in  one  syscall  (assuming the process has enough
          space allocated for it).  This is a win  if  there
          are  lots  of  linear  directory searches (such as
          shell globbing, ls or pwd).

        +o A larger than 4k maximum packet size can be  used,
          now that the kernel memory allocator allows larger
          than 4k memory allocations.  However, since  pipes
          are  the most common connection beween filesystems
          and kernels, and pipes can  hold  at  most  4k  of
          data,  there  would  still  be  a  context  switch
          between filesystem code and kernel  every  4k,  so
          there wouldn't be much gain.








                            - 7 -



     A  number of people have suggested adding shared memory
     between the kernel and the  filesystem  process.   This
     would  be  quite  limiting  and  least likely option to
     improve things.  At the moment, the filesystem makes no
     assumptions  about  the  nature of the file descriptors
     for talking to the process.  To implement shared memory
     between  the  kernel and the process would require some
     way of finding the process on the other end of the file
     descriptors  (if  any),  and playing around with memory
     maps.  This still wouldn't cut down on  the  number  of
     context  switches  at  all  (it  may  even increase the
     number of switches because of syncronisation).

_4_._2  _P_r_o_t_o_c_o_l

The protocol used is machine independent, using network byte
order   and   defined  type  sizes.   The  code  to  do  the
packetisation and depacketisation is generated automatically
by a program, given the description of each packet.  This is
not fully portable, but it avoids byte order  and  structure
alignment problems.

A  packet to or from the kernel has two parts.  The first is
a header that contains a sequence number, an operation type,
a  packet  type,  size of the following data, and a protocol
version number.  The packet type can either be a request,  a
reply or an enquiry.  Requests and enquiries are always from
the kernel to the process, and the process only  ever  sends
replies to the kernel.  A reply's header has one extra field
- an error  field,  containing  an  error  number.   Replies
always  have the same sequence number as their corresponding
request or enquiry.  If there was an  error  performing  the
operation  the  error  field  is set to the error number and
there is no additional data returned.  If there is no  error
the error field is set to 0.

Following   a  request  or  reply  packet  is  the  optional
operation-specific  data.   This  is  passed   through   the
protocol  for  interpretation  by  the operation routines at
each end.

The kernel may have multiple outstanding requests.  In other
words,  the kernel may send a new request before receiving a
reply to a previous one.   This  allows  the  filesystem  to
block one process for a slow operation while other processes
can  use  the  filesystem  for  shorter  operations.    This
improves  performance  on,  for  example, an ftp filesystem,
where one process may  be  using  a  fast  local  link,  and
another  may be using a slow international one, and each has
to wait for its own requests to  be  satisfied.   Of  course
this requires the filesystem process to be written with some
form of  multi-threading.   If  the  filesystem  just  reads
requests,  acts  on  them  and replies then it can do so and
ignore any kernel requests until it is ready  to  deal  with








                            - 8 -



them.

_4_._3  _H_a_n_d_l_e_s

The  base  element of a filesystem is an _i_n_o_d_e.  There is an
exact one to  one  relationship  between  inodes  and  files
(where  a  _f_i_l_e  in  this case can be any filesystem object,
like a normal file, a directory  and  so  on).   The  kernel
needs  to  be  able to uniquely identify inodes.  Inodes are
uniquely numbered within  a  filesystem,  but  each  mounted
filesystem  has  its  own  numbering.  Therefore an inode is
completely identified by an inode number  and  a  filesystem
identifier  (or  _d_e_v_i_c_e,  though  it doesn't mean much for a
filesystem which has no physical  hardware  associated  with
it).

A  device is what distinguishes mounted filesystems from one
another, and an inode is what distinguishes files  within  a
filesystem  from each other.  Inode numbers are generated by
each filesystem, and are used by  the  kernel  to  refer  to
specific  files  to  the  filesystem  specific  code.   User
process filesystems are no exception: between the kernel and
the  filesystem  process,  files  are  refered  to  by using
_h_a_n_d_l_e_s, which are  essentially  32  bit  unsigned  numbers.
When  a  filesystem  first mentions a file to the kernel, it
gives it a handle, which  the  kernel  uses  for  all  later
operations  on the file.  It the the handle which identifies
the file, rather than the name, so it is  important  to  use
distinct  handles  for  distinct files, and never change the
handle of a file once it has been given to the kernel.

_4_._4  _R_a_n_d_o_m _o_p_e_r_a_t_i_o_n _s_p_e_c_i_f_i_c _a_d_v_i_c_e _a_n_d _b_l_u_r_b

This may eventually accurately describe the whole  protocol,
but for now its a list of interesting points and things that
have bitten me.

Normally when  writing  a  filesystem  you  should  use  the
library  _l_i_b_u_s_e_r_f_s  (see  below), and use the advice in this
section as a guide on what kind of things should be  put  in
your userfs operation functions, or for idle curiosity.

_4_._4_._1  _M_o_u_n_t_i_n_g

The  mount  is initiated by a user process calling the mount
system call, with the  "userfs"  filesystem  type.   In  the
filesystem  specific  data,  the  process  passes  two  file
descriptor numbers for the kernel  to  read  and  write  to.
These  can  by  any  kind  of  file descriptor at all.  Most
commonly they would be pipes or sockets,  but  there  is  no
restriction.   All the kernel requires that the one it talks
to the process with is writable, and the one it gets replies
from is readable.









                            - 9 -



The  most  important  request  is mounting.  Most important,
because it is one of the two requests that the  process  has
to  implement  (of  course,  not  implementing anything else
would be completely useless).  The  request  itself  is  not
that  complex.   All it does is return a handle of the inode
at the root of the filesystem.  Most commonly, this will  be
a  directory.   Userfs does not enforce this, but the kernel
itself may.

After the process returns the root handle, the  kernel  will
probe  the  process  to see what operations it is willing to
support.  This is  done  by  sending  a  series  of  enquire
packets  to  the  process.   The  process  should reply with
normal reply packets, with the errno field either set  to  0
if it is supported or ENOSYS if it isn't.  No real operation
should be done, and no additional information should be sent
in   the  reply.   If  the  process  replies  ENOSYS  to  an
operation, it will never recieve it again,  and  the  kernel
will  use  a  sensible  default  for  it (typically what the
kernel would normally do for an in-kernel filesystem  if  it
doesn't   support   the   operation).   Conversely,  if  the
filesystem process doesn't get an enquiry about a particular
operation  from the kernel, it will never see that operation
from the kernel.  The filesystem process should send  0  for
the  operations  it  explicitly  supports,  and  ENOSYS  for
everything else, so the protocol  can  be  extended  without
having to modify existing clients.


_4_._4_._2  _R_e_a_d_i_n_g _I_n_o_d_e_s

The  most common thing for a filesystem to be asked to do is
to read inodes.  For the process, this involves filling  out
a  structure  much like the kernel's inode structure and the
stat structure.  It's important is to make sure  the  nlinks
field  is  non-zero.   This field is the number of names the
inode has, that is, the number of directory entries  in  the
filesystem  which  refer to this inode.  In theory, this can
never be 0 when the kernel asks for the inode, because  that
means  that  the  kernel  asked  for  the inode without ever
seeing a name referring to it, implying that the  filesystem
never  told  the kernel about the file.  If it is 0 then the
kernel will never "put" the inode,  and  it  will  make  the
filesystem un-umountable.

When  the kernel wants an inode from the filesystem, it uses
the uupppp__iirreeaadd protocol request to fetch it.  This happens if
something  in  the  kernel  asks for the inode, but it isn't
already in the kernel  inode  table.   Therefore,  once  the
kernel  has  asked  the filesystem for an inode, it will not
ask for it again while anything in the kernel is using it.

Once nothing in the kernel is using the  inode,  the  kernel
will  issue  an uupppp__iippuutt operation, which may be preceded by








                           - 10 -



an  uupppp__iiwwrriittee  if  the  inode  was  modified  in  use.    A
filesystem  need  not implement these operations if there is
no need to do so.

_4_._4_._3  _O_p_e_n _a_n_d _C_l_o_s_e

Reading  and  putting  inodes  are  the  basic   operations:
regardless  of  what  an  inode is being used for it will be
read  and  put.   The  uupppp__ooppeenn  and  uupppp__cclloossee   operations
specifically  correspond  to the ooppeenn(2) and cclloossee(2) system
calls.  Normally a filesystem doesn't need  to  perform  any
special   handling  for  these  operations,  and  would  not
normally implement them, except if  it  wants  to  know  the
identity  of  the  process  doing  the  operations.   When a
program issues an open system call for a file  on  the  user
filesystem,  the  kernel  will send a _u_p_p___o_p_e_n operation for
the file, which  includes  complete  identifcation  for  the
process  which issued the open.  When the filesystem replies
it  returns  a  _c_r_e_d_e_n_t_i_a_l_s  _t_o_k_e_n_.   From  then  on,   that
credentials   token   is  sent  to  the  filesystem  in  all
operations which correspond to a system call which  takes  a
file     descriptor     as     an    argument,    such    as
rreeaadd,wwrriittee,rreeaaddddiirr,llsseeeekk and so on.

This may seem a bit complex: why not just send  the  uid  of
the process with the operations?  Well, the credentials of a
process are quite complex,  since  they  include  the  real,
saved  and  effective  uids and gids of the process, and all
the auxillary groups.  Sending this with each request  would
be quite an overhead.  The idea is that all the info is sent
on a open, and the filesystem process can associate it  with
a token internally, and only use the token in correspondance
with the kernel.

Also note that the credentials are associated with  an  open
file  descriptor,  not the process performing the operation.
Mostly a process will deal  with  file  descriptors  it  has
created  itself,  but its quite possible that it can inherit
file descriptors from another process with a  different  set
of  credentials.   In  this  case  the  filesystem knows the
original process's credentials,  but  not  for  the  process
which is performing the operation.

_4_._4_._4  _H_a_n_d_l_e _M_a_n_a_g_e_m_e_n_t

The  handle  of  an  inode  is  only  way the kernel and the
filesystem can talk about a file.  An inode  may  have  more
than  one  name, or no names at all, so file names are not a
good way of keeping track of a file.   Use  inodes  in  your
filesystem  code  to keep track of files, even if you have a
simple 1:1 name to file mapping.

Handles must also be consistent.  Of course you must  always
keep  the  handles of files currently in use consistent, but








                           - 11 -



you must also keep  them  consistent  between  uses.   If  a
process  opens  a  file once, closes it and then reopens it,
then it will expect it to have the same inode number  if  it
is  supposed  to  be  the  same file (which is how processes
using a user filesystem will see the file handles).

Also, if you ever refer to a handle  in  communication  with
the kernel, you must be prepared for the kernel to ask about
it.  For example, if the kernel reads a directory  with  the
uupppp__rreeaaddddiirr  or  uupppp__mmuullttiirreeaaddddiirr  operations, each entry in
the reply will have a name and  a  handle.   Each  of  those
handles  must  be the handle of the file if the kernel looks
at the file more closely.  If you make them  all  the  same,
for  example,  then  a  program would be entitled to believe
that all the names in the  directory  refer  to  one  actual
file.

_4_._4_._5  _D_e_a_l_i_n_g _w_i_t_h _m_u_s_e_r_f_s

Writing  a  client  which  can be handled by muserfs is very
easy.   The  important  thing  to  remember  is   that   the
filesystem  process can basically ignore muserfs, and ignore
issues like how to quit and so on.

A userfs filesystem process should only terminate under  one
condition:  it  gets  an  EOF  (a  read of 0 bytes) from the
kernel on the file descriptor its reading operation requests
from.   Muserfs  will  execute  it  so that most signals are
ignored, so it can handle them  itself.   When  the  muserfs
process  is  sent  a  SIGINT  or  SIGTERM  it  unmounts  the
filesystem mount point with the uummoouunntt(8) command  (used  so
that /etc/mtab is updated properly).  This causes the kernel
to send the filesystem process a uupppp__uummoouunntt operation.   The
kernel  will  close its end of the file descriptors, and the
process is expected to do the same, even if only by exiting.
Therefore,  when  trying  to unmount a userfs filesystem, do
not kill the filesystem process directly, and  do  not  kill
muserfs  with  SIGKILL.   Either  way  you should be able to
unmount with uummoouunntt as root.


_5_.  _U_s_i_n_g _l_i_b_u_s_e_r_f_s

_l_i_b_u_s_e_r_f_s  is  a  C++  library  designed  to  make   writing
filesystem  clients  easier.  It is designed so all the work
common to almost all filesystems is encapsulated into a  few
generic  classes,  which  can  be  used  as base classes for
specific filesystem functions.

_5_._1  _B_a_s_i_c _C_l_a_s_s_e_s

The most basic classes, CCoommmm, FFiilleessyysstteemm and IInnooddee implement
the basic communication with the kernel and stub methods for
each operation.








                           - 12 -



The Comm class reads from the kernel and decodes the headers
of  the  operation  packets, and passes the remainder to the
Filesystem class.  The Filesystem performs the operation and
returns  an  unencoded return header and the encoded body of
the reply, if any.  All this is  not  exposed  to  the  code
using the library.

Filesystem  takes  each  operation  and dispatches it to the
appropriate place.  The Filesystem  class  directly  handles
the  operations  which  are  global to the whole filesystem,
such as mounting or unmounting.  For operation which pertain
to a particular Inode (such as reading, or looking up a name
in a directory), Filesystem looks up the Inode in its  table
and dispatches the operation to it.

The  Inode  class  has  all its methods implemented as stubs
which fail with the "not implemented" error code.   It  also
has members for the standard inode properties of mode, type,
size, ownership, links, timestamps and so on.

These classes are completely useless on their own,  so  they
must be used as base classes for other classes with actually
do  something.   _l_i_b_u_s_e_r_f_s  has  more  specific,  but  still
generally useful classes.

SSiimmpplleeIInnooddee  implements  a  simple  inode with some normally
expected behaviour.  It has a constructor which  initializes
the  inode  properties to sensible values, and methods which
implement  simple  defaults  for   the   open,   close   and
permissions check operations.

DDiirrIInnooddee,,  derived  from  SimpleInode,  implements  all  the
operations needed for a  directory,  including  linking  and
unlinking   inodes  to/from  names,  rename,  and  directory
scanning and lookup.  It takes very  little  extra  code  to
implemement simple directory behaviour.

_5_._2  _W_r_i_t_i_n_g _y_o_u_r _o_w_n _f_i_l_e_s_y_s_t_e_m _c_l_a_s_s_e_s

A complete filesystem has two parts: a collection of inodes,
one for each file,  and  the  filesystem  structure  itself,
which  holds all the inodes together.  Each inode represents
a file in the filesystem, regardless of type.  There is only
one  inode  per  file  in  the  filesystem, even if the file
appears multiple times under different names.

_5_._2_._1  _A_r_g_u_m_e_n_t_s _a_n_d _r_e_t_u_r_n _v_a_l_u_e_s _o_f _o_p_e_r_a_t_i_o_n _m_e_t_h_o_d_s

Each method with the name ddoo__ssoommeetthhiinngg in the Filesystem and
Inode  classes  corresponds  to  an  operation in the userfs
protocol.  As a  result,  they  all  have  similar  argument
structures.   All  such  methods have ccoonnsstt uupp__pprreeaammbbllee &&pprree
and uupppp__rreeppll &&rreeppll which are  references  to  the  operation
reqest  and reply packet headers.  Mostly there is no reason








                           - 13 -



for operation methods to use them,  because  their  contents
are  dealt with in lower levels of the library, but they are
there if you want them.

Each userfs protocol operation may  have  arguments,  return
values,  both  or neither, and the method for that operation
will have corresponding arguments.  For an operation named _x
the  method  argument with the operation arguments will have
the type ccoonnsstt uupppp___x__ss, and the return values argument  will
have  the  type  uupppp___x__rr, For example, the up_read operation
will correspond to the Inode method

     int Inode::do_read(const up_preamble &pre, upp_repl &repl,
                        const upp_read_s &args, upp_read_r &ret);

The contents of the  structures,  along  with  encoding  and
decoding  functions,  are  machine  generated, and therefore
have a consistent set of rules.  Mostly  its  quite  simple,
with  normal  base types directly corresponding to C and C++
types.  However, variable sized types need to  have  both  a
pointer  to  the  data and the size of the data encoded into
them.  Memory for the data is allocated with the C++ new and
delete  operators, with the aalllloocc method of a variable sized
object.  The memory is automatically freed by  the  method's
caller.  For example, if a return value of a method contains
an member called nnaammee representing a filename, it  would  be
set  with  the  following  sequence  (assuming  oouurrnnaammee is a
normal 0 terminated string):

     int namelen = strlen(ourname);
     ret.name.alloc(namelen);                   // Allocate memory
     ret.name.nelem = namelen;                  // Set name length
     memcpy(&ret.name.elems, ourname, namelen); // Set name contents
     // ...

(alternatively, you could just point _r_e_t_._n_a_m_e_._e_l_e_m_s directly
at  _o_u_r_n_a_m_e,  because it won't try and free the string if it
was never allocated).

Note that strings are never zero terminated; the  length  of
the  returned  string is exactly the number of characters in
the string.

If the operation the method is performing fails,  it  should
return  the  appropriate  error  code,  or 0 if it succeeds.
Don't return -1 unless you mean to - it has special  meaning
(see below, in "Deferring Replies").

_5_._2_._2  _D_e_r_i_v_i_n_g _f_r_o_m _F_i_l_e_s_y_s_t_e_m

Filesystem  class must implement a number of methods to make
the filesystem viable:










                           - 14 -



_E_n_q_u_i_r_e is  called  when  the  kernel  wants  to  find  what
     operations  your  filesystem  supports.   For  all  the
     operations that any inode will implement, return 0  and
     return ENOSYS for the rest.

_d_o___m_o_u_n_t takes  no  arguments and returns the handle for the
     inode  for  the  root  directory  (that  is,  the   top
     directory  of your filesystem).  The kernel immediately
     does a ddoo__iirreeaadd operation using this handle.
You can also implement _d_o___s_t_a_t_f_s which allows the kernel  to
get  space  and inode usage statistics, such as when "df" is
executed,  and  _d_o___u_m_o_u_n_t  so  the  filesystem  is  formally
informed  when it is unmounted (normally it just gets an EOF
from the kernel, and Comm::Run returns).

_5_._2_._3  _D_e_r_i_v_i_n_g _f_r_o_m _I_n_o_d_e

Most of the work of the filesystem is done  in  the  inodes.
All  inode classes must be derived from Inode, and generally
there will be a number of different Inode based classes.

It is probably better to use SimpleInode as  a  base  rather
than  plain Inode, because it implements simple defaults for
some   methods,   which   would    otherwise    fail.     If
Filesystem::Enquire  says  that  the  filesystem  supports a
particular operation, then any inode should be  prepared  to
get that operation from the kernel.

Similarly,  unless you are doing something special, deriving
directories from DirInode saves a lot of work.

Only  _d_o___i_r_e_a_d  need  be  implemented,  but  obviously   the
filesystem   will   do   nothing  interesting  unless  other
operations are implemented.  do_iread returns the details of
the  inode.   Note  that  the  Filesystem  class  calls  the
do_iread of the Inode when  the  operation  comes  from  the
kernel,  so the inode must exist by the time the kernel asks
for it.  The constructor for Inode  automatically  registers
the  inode  in the Filesystem's inode table; conversely, the
destructor removes it.

Here are  some  other  useful  methods  for  an  Inode;  the
descriptions  are  brief  and general, and don't necessarily
refer to all the arguments and return  values,  which  means
they can be ignored.

_d_o___i_w_r_i_t_e is,  obviously,  the  opposite  of  do_iread.   It
     simply sets the various Inode values.

_d_o___i_p_u_t is called when the kernel is  no  longer  using  the
     inode.   That  is,  the  inode  is  no longer open, the
     current or root directory of a process, being  executed
     from or being mapped from.  If an inode is iput and has
     no  names  (has  no  name  to  inode  mapping  in   any








                           - 15 -



     directory) it can be destroyed.

_d_o___r_e_a_d allows data to be read from the file.  The arguments
     are the offset in the file to start reading  from,  and
     the number of bytes desiried.  The method may return as
     many bytes up to that number as it likes, including  0,
     which means EOF.

_d_o___w_r_i_t_e does the converse; a block of data and an offset is
     passed in, and the method returns the number  of  bytes
     actually written.

_d_o___l_o_o_k_u_p translates  a  name into an inode reference.  This
     is typically implemented for directories; if  the  name
     exists  in  the  directory the method should return the
     handle of the inode, or fail with ENOENT.

_d_o___d_i_r_r_e_a_d returns the next directory entry  at  the  passed
     offset.  It returns the name and inode of the next file
     in the directory, and the size of the  entry  returned.
     This  is  added  by the kernel to the current offset in
     the directory to form the offset of the next  directory
     entry  for  the next call.  Since the directory entries
     don't correspond to real file storage as in other, more
     conventional  filesystems,  a  directory  entry  can be
     regarded as having an offset of 1.

     If the end of the directory has been reached, it should
     return a new offset of 0.

_d_o___m_u_l_t_i_r_e_a_d_d_i_r is similar to do_readdir, but can return any
     number of directory entries,  which  are  cached  in  a
     readahead  buffer in the kernel.  If a program asks for
     a directory entry for  an  inode  which  has  a  cached
     directory  entry  then  the entry will come from within
     the kernel rather than asking the  filesystem  process.
     This  operation  can  return  only one entry (and so is
     like do_readdir), or as many as will fit  in  a  return
     packet  (up  to  4k  or  so  of entries).  Returning no
     entries  means  the  end  of  the  directory  has  been
     reached.    Returning  multiple  entries  improves  the
     performance of directory scans, most frequently done by
     ls, pwd and shell globbing.

     Look at the implementation of DirInode::do_multireaddir
     for details of how this should be dealt with.

_d_o___c_r_e_a_t_e does all file creation, whether  it  be  a  normal
     file,  a  directory, a fifo file or a device node.  The
     mode contains the type of the file in same way  as  the
     stat structure member sstt__mmooddee..

_d_o___u_n_l_i_n_k is   the  opposite,  and  is  used  for  unlinking
     (removing  a  name  to   inode   mapping)   files   and








                           - 16 -



     directories.   If  an  inode  is  not in use and has no
     links then it can be destroyed and its  handle  can  be
     reused.

_d_o___s_y_m_l_i_n_k is used to create new symlink inodes.  It returns
     the handle of the new inode.

_d_o___r_e_a_d_l_i_n_k returns the pathname which a  symbolic  link  is
     pointing to.

_d_o___f_o_l_l_o_w_l_i_n_k returns  the  pathname  of the file a symbolic
     link is really referring  to.   If  Filesystem::Enquire
     says  the  filesystem  does not support this operation,
     the readlink operation is used instead.

_d_o___o_p_e_n is called when a file is  actually  opened.   It  is
     only  necessary to implement this if it is important to
     know whether a file is being opened as opposed to being
     used  in  any  other  way.   This  operation passes the
     filesystem the complete authentication  credentials  of
     the  process doing the open, so that the filesystem can
     do extended security checking or change  the  behaviour
     of the file depending on the user.

     This  method  can return a credential token, which is a
     magic number used by the filesystem process to refer to
     the  set  of  credentials  passed  by  the kernel.  The
     kernel attaches this credentials  token  to  each  each
     operation   generated  by  system  calls  on  the  file
     descriptor generated  by  the  open  (read(),  write(),
     readdir()  and close()).  The credentials token is part
     of the file descriptor, so is inhereited  unchanged  if
     the descriptor is passed to another process, even if it
     has different credentials.

     When a file is opened, a new file table entry  for  the
     inode  is  created.  That file table entry has a single
     file descriptor referring to it.  More file descriptors
     can  be  made to refer to the file table entry with the
     dduupp(2) system call, and can be removed with cclloossee(2).

_d_o___c_l_o_s_e is called when the last file descriptor for a  file
     table  entry  is closed.  The only argument for this is
     the credentials token for that  file  table  entry,  so
     that the filesystem can free all references to it.

_d_o___p_e_r_m_i_s_s_i_o_n is called when the filesystem says it wants to
     do permissions checking.  This is called a lot, and can
     cause  many  more operations to pass between the kernel
     and filesystem process.  If  the  filesystem  does  not
     implement it the normal unix user/group/others checking
     is performed.










                           - 17 -



_d_o___r_e_n_a_m_e moves a file from  one  directory  to  a  new  one
     (though it may be the same).

_5_._2_._4  _D_e_r_i_v_i_n_g _f_r_o_m _D_i_r_I_n_o_d_e

DirInode implements a number of userfs operation methods for
directories, such as readdir, multireaddir and  lookup.   It
also  automatically constructs directories with "." and ".."
entries pointing to the appropriate places.

DirInode deals with strings a lot, and rather than using the
normal cchhaarr ** it uses the libg++ SSttrriinngg class for all string
arguments to its own methods (but not, of  course,  for  the
userfs protocol operation methods).

DirInode expects a pointer to the parent directory, which is
also a class derived from DirInode.  If the directory is  at
the  top  of  the  filesystem's  tree,  it  should be a NULL
pointer.  The protected member ppaarreenntt points the the  parent
inode, or tthhiiss for the top one.  It should never be NULL.

DirInode  keeps  a  list of files in the directory, but does
not allow that  list  to  be  directly  visible.   The  only
operations  for  manipulating  the  directory contents for a
derived class are:

iinntt lliinnkk((ccoonnsstt SSttrriinngg nnaammee,, IInnooddee **)) which links a new  name
     into the directory, updating all the reference and link
     counts;

iinntt uunnlliinnkk((ccoonnsstt SSttrriinngg nnaammee)) which does the opposite;

DDiirrEEnnttrryy **llooookkuupp((ccoonnsstt SSttrriinngg nnaammee)) which     returns      a
     directory   entry   if  it  finds  the  file,  or  NULL
     otherwise; and

DDiirrEEnnttrryy **ssccaann((DDiirrEEnnttrryy ** &&ppooss)) which returns the  directory
     entry  at  _p_o_s_,  updating it in the process, or NULL if
     there are no more entries.

DDiirrEEnnttrryy **ssccaann((iinntt &&ppooss)) is the  same,  except  it  uses  an
     integer offset, which is less efficient.

_5_._3  _C_o_m_m_u_n_i_c_a_t_i_o_n_s _c_l_a_s_s_e_s

There are a number of communications classes in the library,
which provide different ways of multiplexing replies.

The most simple is the Comm class, which simply  takes  each
request,  passes  it  to  the  filesystem and sends back the
reply.  There are more complex comms classes though.











                           - 18 -



_5_._3_._1  _F_i_l_e _D_e_s_c_r_i_p_t_o_r _D_i_s_p_a_t_c_h_e_r

The CCoommmmBBaassee class (base of all comms  classes)  provides  a
dispatcher  which  allows  classes  to  register interest in
activity on file descriptors.  This is  used  internally  to
get  input  from the kernel, but can be used by a filesystem
to monitor any file descriptor for any reason.   To  do  it,
simply  derive  a  dispatcher class from DDiissppaattcchhFFDD and call
ssttrruucctt ddiisspp__ffdd CCoommmmBBaassee::::aaddddDDiissppaattcchh((iinntt ffdd,,  DDiissppaattcchhFFDD  **,,
iinntt  wwhhaatt)),  where what can be one or more of _D_I_S_P___R, _D_I_S_P___W
or _D_I_S_P___E, for  interest  in  read  ready,  write  ready  or
exceptions.       When      an     event     occurs,     the
DDiissppaattcchhFFDD::::ddiissppaattcchh((iinntt ffdd,, iinntt wwhhaatt)) method is  called  of
the  registered  class.   If it returns 0 then it is removed
from the dispatch list.  If it returns -1  it  indicates  an
error;   it   is   removed,   and  CCoommmmBBaassee::::RRuunn(())  returns.
Returning 1 is a normal return.

CCoommmmBBaassee::::RRuunn(()) returns normally  when  there  are  no  more
entries on the dispatch list.

_5_._3_._2  _D_e_f_e_r_r_i_n_g _R_e_p_l_i_e_s

In normal operation, the filesystem processes one request at
a time, so each operation is replied to before the  next  is
looked  at.   This  is a convention of the way the user code
works, and not something the kernel enforces.  It just sends
requsts  as  processes  using  the filesystem need them, and
they block until the reply for their particular  request  is
replied   to.    Therefore,  it  is  possible  for  multiple
processes to use the filesystem at once.

The DDeeffeerrCCoommmm and DDeeffeerrFFiilleessyyss classes have a method  called
DDeeffeerrRReeppllyy  (the  DeferFilesys once just calls the DeferComm
one to make it accessable to things within the  filesystem).
DeferReply  forks  the  filesystem;  on  the  child  side it
returns 0 and in the parent it returns the pid of the child.
If  the operation method returns -1 then the Filesystem just
goes on to processing the  next  request  from  the  kernel.
When  the child is ready to reply, it can just return in the
normal way.  The call to DeferReply sets  up  the  DeferComm
class in the child process to reply though the parent rather
than going straight to the kernel, in order to make sure the
replies  from multiple processes don't get jumbled up.  When
the reply has been sent back, the child process just  exits.

Because  the child is really a child process, you have to do
all  the  changes  in  filesystem   state   before   calling
DeferReply,  or  arrange  for  some  other mechanism for the
parent and children to talk.

_5_._3_._3  _M_u_l_t_i_-_t_h_r_e_a_d_e_d _f_i_l_e_s_y_s_t_e_m_s

The TThhrreeaaddCCoommmm class creates a new  lightweight  thread  for








                           - 19 -



each  request,  using  the  Rex  lwp  library  (in  the  lwp
directory).  This allows multiple  requests  to  be  handled
within the one process, so long as one thread does not block
the whole process in a system  call.   The  file  descriptor
dispatcher  in  CommBase  is useful for preventing this: see
ffttppffss for a complete example of a multithreaded  filesystem.





















































