1 Introduction
This document summarizes the scientific and technical
discussion that took place during MoSAIC's kick-off meeting on
October 13th, 2004 at LAAS-CNRS in Toulouse. Attendees were:
-
Yves Roudier
from Eurecom;
-
Michel Banâtre and Paul Couderc from
IRISA;
-
Laurent Blain, Ludovic Courtès, Yves Deswarte, Marc-Olivier
Killijian, David Powell, Matthieu Roy and Isabelle Silvain from LAAS-CNRS.
The following sections describe short-term research tracks for the
project and various ideas that popped up during the brainstorming
session. These ideas are transcribed here in a
raw way. Some
of them fall under several themes and hence contain pointers to the
other places where they appear.
In the following we use the terms data owner
(respectively data saver) and backup client (resp. backup server) interchangeably.
2 Distributed Backup
MoSAIC will have to build on current archival, backup, and
especially cooperative backup techniques.
The following ideas were brought up and are to be further
investigated (the first items may actually be considered as preliminary
work):
-
Typical use scenarios need to be defined so as to
determine a spectrum of realistic internet connectivity and mobility schemes; see also item 1 of section 3.
- An interaction model between peers should be
defined based on a set of scenarios (e.g. frequency and duration of
internet accesses? density of ad-hoc nodes in the absence of an
internet access? usefulness of ad-hoc routing protocols and
reachability of distant nodes?).
-
Representing use cases and peer interactions as a
finite state machine (FSM) is a necessary first step to get an idea
of how the overall system could work (e.g. Alice's PDA meets a peer
PDA, introduces itself, negotiates the storage of 1k block for one
week, etc.) and it could even provide a more detailed representation
of the interactions between peers (see also item 8 of section 3).
-
Peer-to-peer file sharing and backup systems are of
particular interest as they provide approaches to data
dissemination, retrieval of scattered data and privacy
and anonymity enforcement techniques (among other things);
peer-to-peer systems also implement various economic models that
aim to give peers incentives to collaborate with the community.
-
Fragmentation-Redundancy-Dissemination (FRD)
techniques are to be considered with respect to several scenarios;
besides being crucial in enforcing data privacy
(see item 2 of section 4), fragmentation is well suited for
short peer interactions with frequent disconnections (one may not be
able to send a whole file to a given peer); redundancy and
dissemination help increase the data availability;
-
Data backup operations such as incremental
backups, evaluation of the differences between two version of a
file, block-level backup needs to be reviewed and compared to
a data-semantic-aware backup technique and the advantages of content-based block addressing; see also item 10 of section 4.
-
Data recovery techniques are to be considered in the
framework of a "push" model (i.e. data saver pushes backed up
data to its client data owners) versus a "pull" model
(i.e. data owners query data savers as in peer-to-peer file sharing
systems); see also item 9 of section 4.
- Data revision and obsolescence management raises
a number of questions: dealing with the fact that only partial backups
may be done while the data keeps changing; making sure that chunks
coming from different revisions of a given file cannot be merged
together when restoring the file; taking care of backing up entirely a
given file revision before starting to back up chunks of later
revisions; defining a mechanism allowing servers to determine which
chunks of data are obsolete and can be deleted; a look at revision
control systems such as CVS and
GNU Arch
would seem relevant here.
- Resource discovery protocols may be used by a data
owner to find out available storage space in its surrounding peers.
- Data transfer between mobile nodes and in
particular protocols addressing ephemeral connection issues
(bootstrapping, abortion, etc.).
-
Data ownership and sharing is a major concern: while some
data are definitely private and should not be made available to other
peers, the backup system could benefit from knowing that some
data were created collectively (e.g. during a meeting using
collaborative applications); moreover such collectively created data
should be made available to all the peers which participated in their
creation; this may require users to explicitly attach semantic
information to their data; see also item 6 of section 4.
- Data criticality levels may be used to provide
the backup system with hints on which data should be backed up first;
again, the user would have to explicitly provide the system with this
information; managing criticality levels may require a "smart backup
scheduling" technique: if the most critical data are stored in very
active files it may turn out that the system keeps backing up the
different versions of those files while completely forgetting about
less critical data.
-
Useful online services may improve the backup
functionality (e.g. whenever a data owner can access its trusted desktop
computer, it might want to send its most critical data there; similarly,
data savers might push data back to the data owner's trusted
desktop computer's whenever the latter becomes available); see also
item 7 of section 3.
3 Negotiation and Collaboration
This section focuses on ideas related to mechanisms (economic
models) that aim to provide peers with collaboration incentives
while protecting the system against selfishness "attacks".
-
Typical use scenarios will be helpful in finding out
how peers could interact with each other; see also
item 1 of section 2.
-
Tamper-proof hardware and smartcards may be used as a
means of providing a secure and reliable user identification
which might help enforce data privacy (see item 1 of section 4) and may help implement electronic-cash based
incentives (which require the human owner -- rather than his device --
to be reliably identified and reliance on the impossibility to forge
identities).
-
Identifying devices and/or users is needed in order to
implement an economic model; see also item 5 of section 4.
- Identification through a trusted certification
authority may be used; however, using a trusted authority breaks
the peer-to-peer model and may be impossible for disconnected mobile
nodes which do not have an internet access.
- A hybrid model based on ecash and a
trust-based economic model may be considered given that mobile nodes
may not always be able to connect to trusted third-parties such as the
"bank" that issues cash and validates transactions.
-
Trust establishment is an issue as nodes have no prior
mutual trust (if a central certification authority is to be involved
in the process, then peers may want to ask it to validate a peer's
identity as soon as they can; if a trust-based decentralized approach
is considered, trust has to be bootstrapped somehow so that peers can
start collaborating together); see also item 7 of section 4.
-
Useful online services that may help the backup service
(e.g. trusted third parties such as a bank or a certification authority)
need to be identified; see also item 13 of section 2.
-
Representing use cases and peer interactions as a finite state
machine, as mentioned in item 3 of section 2, may help describe
peer interactions at various levels of abstraction.
-
Cooperation, negotiation: peers will have to
negotiate storage space and possibly duration; see also item 8 of section 4.
-
ID bootstrapping and recovery: in order to start using
the system, users may have to acquire a unique ID from a central
authority or devices may identify themselves using a vendor-defined
identifier (e.g. MAC address); after loss, theft or crash of a mobile
device, it may desirable for the user/device to reenter the system and
automatically benefit from the trust peers had in him/it previously
and/or use the electronic cash he/it had previously accumulated; see
also item 11 of section 4.
- Quality of Service, in particular making it
possible to guarantee that data will be kept for some time and that its
criticality level (see above) can be accounted for.
4 Privacy
This section deals with techniques useful for guaranteeing
backed up data privacy. Some of the ideas presented in previous
sections are relevant to this goal, most notable data fragmentation
and dissemination, data ownership and sharing management, as well as
user/device identification. Below is a list of more specific ideas.
-
Tamper-proof hardware and smartcards may be used to enforce
data privacy (e.g. the user's private key is stored on a
smartcard which can be reused after a device failure, therefore making it
possible for the user to retrieve its previously backed up data);
see also item 2 of section 3.
-
Data fragmentation, dissemination and encryption
mentioned earlier (item 5 of section 2) is a means of ensuring
data privacy since no single data saver has enough information to
reconstruct a backed-up file, nor is it practically feasible for a
malicious data saver to decipher data blocks.
- Allowing users to have serveral IDs or roles might
help in providing anonymity; however, even though a client
system could use several application-level identities when connecting to
a backup server, the server may still be able to retrieve the client's
actual MAC (hardware) address or some such; anonymity seems a lesser
requirement in a backup system than in a document publishing system.
- Security policies may define who or which peers
can access certain data; this is similar to the data ownership issue
mentioned earlier.
-
Identifying devices and/or users may be a crucial point
in order to make private data only available to authorized users
(provided snapshots are encrypted and disseminated, identifying snapshots rather than devices and/or users might be sufficient since
access control can be left to the user who can decide whether to
disclose a snapshot ID); see also item 3 of section 3.
-
Data ownership and sharing semantics must be defined in
order to guarantee data privacy (see also item 11 of section 2).
-
Trust establishment is an issue wrt. privacy: it should
be impossible to forge new identities or to use another
device/user's identity; see also item 6 of section 3.
-
Cooperation and negotiation should not reveal private
data; see also item 9 of section 3.
-
Data recovery techniques are a crucial point for data
privacy enforcement: it should be practically infeasible for someone
(including backup server owners) to retrieve and decrypt backed up data,
unless entitled to do so; see also item 7 of section 2.
-
Data backup operations should be made opaque
(using ciphering techniques) so that external observers may not be able
to know what data are being backed up; see also item 6 of section 2.
-
ID bootstrapping and recovery is a potential issue
wrt. privacy: it should be impossible for someone to identify
herself/himself as another person so as to make backup operations on
his/her behalf (thus potentially altering the trust other people have in
him/her) or to retrieve her data; see also item 10 of section 3.
- Worms such as those developed at Xerox PARC in the early
80's might have had similar concerns to those that we have now.
5 Experimentation
Design proposals and actual prototypes shall be evaluated
keeping the following items in mind:
-
Performance and efficiency metrics need to be
defined in order to evaluate the backup system; we may as well define
a set of benchmarking scenarios.
- Validation using proofs is a prerequisite for
confidence in a given design, protocol and interaction model.
- Safety and liveness properties for the backup
system must be defined (e.g. under what circumstances should we expect
the system to be able/unable to retrieve backed up data; etc.).
- Technology issues such as choosing the most
convenient software platform for building prototypes have to be solved
(e.g. whether using a Java environment induces too many limitations on
what can be done, whether actual Java implementations are available for
the chosen OS and architecture, etc.).