Distributed File Systems and Object Storage: Understanding the Differences

Friday, September 10, 2021 - 09:46

By Aron Brand, CTERA Networks

With enterprises set to triple the amount of unstructured data they have stored in the next four years, according to Gartner, enterprises are looking for efficient ways to manage and analyse that data. This trend has spiked a massive shift toward distributed file systems and object storage that enable enterprises to scale linearly (scale-out) in a cost-effective manner to address their performance and capacity needs.

While the two technologies are both essential for managing unstructured data, each is a discrete technology with a distinct set of attributes. This article outlines some of the basic differences between object storage and distributed file system storage for enterprises currently evaluating next generation data storage and management options.

In this article, we’ll dive a bit deeper into a comparison of two flavours of distributed file systems – Clustered Distributed Filesystem (DFS) and Federated Distributed Filesystem (Federated DFS).

Gartner defines distributed file systems as follows:

“Distributed file system storage uses a single parallel file system to cluster multiple storage nodes together, presenting a single namespace and storage pool to provide high bandwidth for multiple hosts in parallel. Data is distributed over multiple nodes in the cluster to handle availability and data protection in a self-healing manner, and cluster both capacity and throughput in a linear manner.”

Like distributed file systems, object storage also distributes data over multiple nodes in order to provide self-healing and linear scaling in capacity and throughput.

But this is where the similarities end.

From a technical standpoint, object storage differs from file systems in three main areas:

In a file system, files are arranged in a hierarchy of folders, while object storage systems are more like a “key value store,” where objects are arranged in flat buckets.
File systems are designed to allow for random writes anywhere in the file. Object storage systems only allow atomic replacement of entire objects.
Object Storage systems provide eventual consistency, while distributed file systems can support strong consistency or eventual consistency (depending on the vendor). More about that later.

Here’s a side-by-side comparison:

Distributed File System	Object Storage
Files in Hierarchical Directories	Objects in Flat Buckets
POSIX File Operations	REST API
Random writes anywhere in file	Atomically replace full objects
Strong or Eventual Consistency	Eventual Consistency

Putting Theory into Practice

As noted, object storage and distributed file systems are well suited for storing large amounts of unstructured data. Object storage exposes a REST API, and therefore is limited to applications that are specially designed to support this type of storage. In contrast, distributed file systems expose a traditional filesystem API, which means they are suitable for any application, including legacy applications which were designed to work over a hierarchical filesystem.

A picture containing stationary, curtainDescription automatically generated

Distributed file systems offer a richer and more general purpose (but more complex) interface to applications, which enables them to perform specific operations which are not suitable for object storage. Examples of these capabilities include acting as the backend for a database, or handling workloads that are heavy on random reads/writes.

Object storage, on the other hand, is more suitable for acting as a repository or archive of massive volumes of large files and comes at a significantly lower price per gigabyte than a distributed filesystem.

There are three fundamental differences between distributed file systems and object storage:

Arrangement – Files are arranged in a hierarchy of folders, while object storage arranges objects in flat buckets.
Update semantics – File systems allow for random writes anywhere in the file, while object storage only allows atomic replacement of entire objects.
Consistency model – Object storage supports eventual consistency, while distributed file systems can support strong or eventual consistency (per vendor).

The CAP Theorem and Distributed File Systems

Not all distributed file systems are created equal – and the reason for this is firmly rooted in computer science theory. The CAP Theorem states that a distributed data store can have no more than two out of the following three properties:

Consistency: Every read receives the most recent write or an error
Availability: Every request receives a (non-error) response – without the guarantee that it contains the most recent write
Partition tolerance: The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes

As such, it follows that there are two flavours of distributed file systems on the market today:

Clustered Distributed File System

Consisting of a strongly coupled cluster of nodes, Clustered Distributed Filesystems (DFS) are geared towards strict data consistency and are especially suitable for high scale computing use cases (e.g., big data analytics) at the enterprise core.

Clustered DFS focuses on the Consistency and Availability properties of the CAP theorem. Strong consistency guarantees do not come without a price – they create fundamental limitations on system operation and performance, particularly when the nodes are separated by high latency or unreliable links. Examples of Clustered DFS include products like Dell EMC Isilon and IBM Spectrum Scale.

Federated Distributed File System

Federated Distributed Filesystems are focused on making data available over long distances with partition tolerance. As such, Federated DFS is well-suited for weakly coupled edge-to-cloud use cases such as unstructured data storage and management for remote offices. Federated DFS focuses on the Availability and Partition tolerance properties of the CAP theorem and trades away the strict consistency guarantee.

In a Federated DFS, read and write operations on an open file are directed to a locally cached copy. When a modified file is closed, the changed portions are copied back from the edge to a central file service. In this process, update conflicts may occur and should be automatically resolved. It could be argued that Federated DFS combines the semantics of a filesystem with the eventual-consistency model of object storage.

A picture containing graphical user interfaceDescription automatically generated

Examples of Federated DFS include the CTERA Global File System as well as the venerable Andrew File System and Coda developed by Carnegie Mellon in the 1980s.

The following comparison table sums it all up:

Clustered DFS	Federated DFS
Strongly consistent	Partition tolerant, eventually consistent
Deployed in the core	Deployed at edge and core
Strongly coupled nodes	Weakly coupled edge nodes
Ideal for high performance computing (HPC); databases; analytics	Ideal for archiving; backup; media libraries; mobile data access; content distribution to edge locations; content ingestion from edge to cloud; ROBO storage; hybrid cloud storage

Clustered DFS and Federated DFS both have their places in the enterprise. To maximize benefits from a distributed file system, enterprises need to understand the differences between the two flavours and choose the option that best meets their application needs.

Aron Brand is CTO at CTERA Networks

Search form

Distributed File Systems and Object Storage: Understanding the Differences