Storage Tradeoffs in a Collaborative Backup Service for Mobile Devices
Ludovic Courtès, Marc-Olivier Killijian, David Powell

1 / 18 -- Context
The MoSAIC Project
  • 3-year project started in Sept. 2004: IRISA, Eurecom and LAAS-CNRS
  • supported by the French national program for Security and Informatics (ACI S&I)
  • communicating mobile devices (laptops, PDAs, cell phones)
  • mobile ad-hoc networks, spontaneous, peer-to-peer-like interactions
Dependability Goals
  • improving data availability
  • guarantee data integrity & confidentiality

Goals and Issues

2 / 18 -- Fault Tolerance for Mobile Devices
Costly and Complex Backup
  • only intermittent access to one's desktop machine
  • potentially costly communications (e.g., GPRS, UMTS)
Our Approach: Cooperative Backup
  • leverage encounters, opportunistically
  • high throughput, low energetic cost (Wifi, Bluetooth, etc.)
  • leverage excess resources
  • variety of independent failure modes
  • hopefully self-managed mechanism
Travelling and disseminating data

3 / 18 -- Challenges
Secure Cooperation
  • participants have no a priori trust relationship
  • protect against DoS attacks: data retention, selfishness, flooding
  • ideas from P2P: reputation mechanism, cooperation incentives, etc.
Trustworthy Data Storage
  • ensure data confidentiality
  • data integrity
  • data authenticity
  • more requirements...

Storage Mechanisms

4 / 18 -- Constraints Imposed on the Storage Layer
Scarce Resources
(energy, storage, CPU)
  • maximize storage efficiency
  • but avoid CPU-intensive techniques (compression, encryption)
Short-lived and Unpredictable Encounters
  • fragment data into small blocks & disseminate it among contributors
  • yet, retain transactional semantics of the backup (ACID)
Lack of Trust Among Participants
  • replicate data fragments
  • enforce data confidentiality, verify integrity & authenticity

5 / 18 -- Maximizing Storage Efficiency
Single-Instance Storage
  • reduce redundancy across files/file blocks
  • idea: store only once any given datum
  • used in: peer-to-peer file sharing, version control, etc.
Generic Lossless Compression
  • well-known benefits (e.g., gzip, bzip2, etc.)
  • unclear resource requirements
Techniques Not Considered
  • differential compression: CPU- and memory-intensive, weakens data availability
  • lossy compression: too specific (image, sound, etc.)

6 / 18 -- Chopping Data Into Small Blocks
Natural Solution: Fixed-Size Blocks
  • simple and efficient
  • similar data streams might yield common blocks
Finding More Similarities Using Content-Based Chopping
  • see Udi Manber, Finding Similar Files in a Large File System, USENIX, 1994
  • identifies identical sub-blocks among different data streams
  • to be coupled with single-instance storage
  • ⇒ improves storage efficiency? under what circumstances?

7 / 18 -- Providing a Suitable Meta-Data Format
Design Principle: Separation of Concerns
  • separate data from meta-data
  • separate stream meta-data from file meta-data
Indexing Individual Blocks
  • avoid block name clashes
  • block IDs must remain valid in time and space
Indexing Sequences of Blocks
  • produce a vector of block IDs
  • recursively chop it and index it
Stream meta-data recursive structure

8 / 18 -- Providing Data Confidentiality, Integrity, and Authenticity
Enforcing Confidentiality
  • encrypt both data & meta-data
  • use energy-economic algorithms (e.g., symmetric encryption)
Allowing For Integrity Checks
  • protect against both accidental and malicious modifications
  • ⇒ store cryptographic hashes of (meta-)data blocks (e.g., SHA1, RIPEMD-160)
  • ⇒ use hashes as a block naming scheme (content-based indexing)
  • ⇒ eases implementation of single-instance storage
Allowing For Authenticity Checks
  • cryptographically sign (part of) the meta-data

9 / 18 -- Enforcing Backup Atomicity
Comparison With Distributed and Mobile File Systems
  • backup: only a single writer and reader
  • thus, no consistency issues due to parallel accesses
Using Write-Once Semantics
  • data is always appended, not modified
  • previous versions are kept
  • allows for atomic insertion of new data
  • used in: peer-to-peer file sharing, version control

10 / 18 -- Replication Using Erasure Codes
Erasure Codes at a Glance
  • b-block message → b × S coded blocks
  • m blocks suffice to recover the message, b < m < (S × b)
  • S ∈ℜ: stretch factor, overhead
  • failures tolerated: (S × b) - m
  • More storage-efficient than simple replication
  • Impact on data availability?
  • Compared to simple replication?
An illustration

Preliminary Evaluation of Storage Mechanisms

11 / 18 -- Our Storage Layer Implementation: libchop
Key Components
  • chopper, block & stream indexers, keyed block store
  • provides several implementations of each component
Strong Focus on Compression Techniques
  • single-instance storage (SHA-1-based block indexing)
  • content-based chopping (Manber's algorithm)
  • zlib compression filter (similar to gzip)
An illustration

12 / 18 -- Experimental Setup
  • storage efficiency
  • computational cost (throughput)
  • ... for different combinations of algorithms
File Sets
  • a single mailbox file (low entropy)
  • C program, several versions (low entropy, high redundancy)
  • Ogg Vorbis files (high entropy, hardly compressable)

13 / 18 -- Algorithmic Combinations
Config.Single Instance?Chopping Algo.Expected Block SizeInput Zipped?Blocks Zipped?
B1yesManber's1024 Bnono
B2yesManber's1024 Bnoyes
B3yesfixed-size1024 Bnoyes
Cyesfixed-size1024 Byesno

14 / 18 -- Storage Efficiency & Computational Cost Assessment
Config.SummaryResulting Data SizeThroughput (MiB/s)
C filesOggmboxC filesOggmbox
A1(without single instance)26%100%55%211518
A2(with single instance)13%100%55%221517
B2Manber + zipped blocks11%103%58%7510
B3fixed-size + zipped blocks18%103%71%11518
Cfixed-size + zipped input13%102%57%22521

15 / 18 -- Storage Efficiency & Computational Cost Assessment
Single-Instance Storage
  • mostly beneficial in the multiple version case (50% improvement)
  • computationally inexpensive
Content-Defined Blocks
  • mostly beneficial in the multiple version case
  • computationally costly
Lossless Compression
  • inefficient on high-entropy data (Ogg files)
  • otherwise, always beneficial (block-level or whole-stream-level)

16 / 18 -- Conclusions
Implementation of a Flexible Prototype
  • allows the combination of various storage techniques
Assessment of Compression Techniques
  • tradeoff between storage efficiency & computational cost
  • most suitable: lossless input compression + fixed-size chopping + single-instance storage
Six Essential Storage Requirements
  • storage efficiency
  • small data blocks
  • backup atomicity
  • error detection
  • encryption
  • backup redundancy

17 / 18 -- On-Going & Future Work
Improved Energetic Cost Assessment
  • build on the computational cost measurements (execution time ≈ energy)
  • see Barr et al. Energy-Aware Lossless Data Compression, ACM Trans. on Comp. Sys., Aug. 2006
Algorithmic Evaluation
  • identify tradeoffs in the replication/dissemination processes (Markov chain analysis)
  • develop algorithms to dynamically adapt to the environment (?)
Design & Implementation
  • finalize the overall architecture
  • integrate required technologies: service discovery, authentication, etc.
  • interface with trust management mechanisms

18 / 18 -- Thank you!


This HTML page was produced by Skribilo.
Last update: Wed Oct 25 14:31:29+0200 2006