Download HDFS vs CFS PDF

TitleHDFS vs CFS
File Size1.4 MB
Total Pages14
Document Text Contents
Page 1

Comparing the Hadoop Distributed
File System (HDFS) with the
Cassandra File System (CFS)






WHITE PAPER





By DataStax Corporation

September 2012

Page 2

© 2012 DataStax. All rights reserved. 2

Contents
Introduction ...................................................................................................................................... 3!
Overview of HDFS ........................................................................................................................... 4!
The Benefits of HDFS ..................................................................................................................... 5!

Built-In Redundancy and Failover ............................................................................................... 5!
Big Data Capable ........................................................................................................................ 5!
Portability .................................................................................................................................... 5!
Cost-Effective .............................................................................................................................. 5!

What Is Apache Cassandra? ........................................................................................................... 6!
What Is the Cassandra File System (CFS)? ................................................................................... 6!

How Does CFS Work? ................................................................................................................ 6!
Hadoop Compatibility and Command Management ................................................................... 8!

Benefits of CFS ............................................................................................................................... 9!
Simpler Deployment .................................................................................................................... 9!
More Scalability ........................................................................................................................... 9!
Better Availability ........................................................................................................................ 9!
Multi-Data Center Support ........................................................................................................ 10!
No Shared Storage Requirement for Failover ........................................................................... 10!
Full Data Integration .................................................................................................................. 10!
Commodity Hardware Support .................................................................................................. 11!
What About Performance? ........................................................................................................ 11!

Managing and Monitoring CFS Deployments ................................................................................ 11!
Other Benefits of DataStax Enterprise .......................................................................................... 12!
Who Is Using CFS? ....................................................................................................................... 13!
Conclusion ..................................................................................................................................... 14

About DataStax ............................................................................................................................. 14

Page 7

© 2012 DataStax. All rights reserved. 7























Figure 2: A Simple DataStax Enterprise Cluster

CFS stores metadata information regarding Hadoop data in a Cassandra keyspace, which is
analogous to a database in the relational database management system (RDBMS) world. Two
Cassandra column families (like tables in an RDBMS) in the keyspace contain the actual data.
The data contained in these column families is replicated across the cluster to ensure data
protection and fault tolerance.

The column families mirror the two primary HDFS services. The inode column family replaces
the HDFS NameNode service, which tracks each datafile’s metadata and block locations on the
participating Hadoop nodes. Captured information in this column family includes filename, parent
path, user, group, permissions, filetype and a list of block ids that make up the file. For block ids,
it uses TimeUUID, so blocks are ordered sequentially in a natural way. This makes supporting
HDFS functions like append()easy.

The sblocks column family supplants the HDFS DataNode service that stores file blocks. This
column family stores the actual contents of any file that is added to a Hadoop node.

Each row in sblocks represents a block of data associated with a row in the inode column
family. Each row key is a block TimeUUID from an inode row. The columns are time-ordered
compressed sub-blocks that, when decompressed and combined, equal one HDFS block.

Page 8

© 2012 DataStax. All rights reserved. 8















Figure 3: CFS Column Families

When data is added to a Hadoop node, CFS writes the static metadata attributes to the inode
column family. It then allocates a new sblocks row object, reads a chunk of that data (controlled
via the Hadoop parameter fs.local. block.size), splits it into sub-blocks (controlled via the
parameter cfs.local.subblock.size), and compresses them via Google’s snappy
compression.

Once a specific block is complete, its block id is written to the inode column family row and the
sub-blocks are written to Cassandra with the block id as the row key and the sub-block ids as the
columns.

Reads are handled in a straightforward manner. When a query request comes into a Hadoop
node, CFS reads the inode information and finds the block and sub-block(s) needed to satisfy
the request.

Hadoop Compatibility and Command Management
CFS implements the Hadoop File System API so it is compatible with all layers of the Hadoop
stack and third-party tools.

With respect to handling Hadoop commands, CFS provides complementary commands, so little
to no learning curve is experienced. For instance, the following Hadoop command:

bin/hadoop fs -cat /myfiles/foo.txt

would be run inside of CFS/DataStax Enterprise as:

bin/dse hadoop fs -cat /myfiles/foo.txt

Page 13

© 2012 DataStax. All rights reserved. 13

Who Is Using CFS?
Many modern businesses and organizations are using Cassandra for critical applications today.
Here are just some examples:

Figure 6: A sample of companies and organizations using Cassandra in production

Some DataStax customers using CFS include:

• eBay – Uses DataStax Enterprise across multiple data centers, with one data center being
devoted to CFS and Hadoop analytics.

• HealthCare Anytime – Employs DataStax Enterprise with CFS and Hadoop for their online
patient portals, with analytics being needed to produce proper billing for Medicare/Medicaid.

• Next Big Sound – Uses DataStax Enterprise and CFS to analyze large amounts of social
media information with Hadoop that pertain to music artist popularity on the Web.

• ReachLocal – Uses DataStax Enterprise and CFS in six different data centers across the
world to support their global online advertising business, with Hadoop analytics being part of
their infrastructure.

• SimpleReach – Deploys DataStax Enterprise and CFS to provide clients with Google
analytics ability for their websites, which allows them to know how all their content is being
referenced socially.

• SourceNinja – Utilizes CFS and DataStax Enterprise with Hadoop to provide a single source
of all information about open source software and updates to that software deployed by their
subscribers.

Page 14

© 2012 DataStax. All rights reserved. 14

Conclusion
While HDFS is a good solution for providing cost-effective storage for Hadoop analytic systems,
CFS supplies all the same features as HDFS and delivers a number of other compelling benefits
to those who are looking for a proven and trusted platform for their big data applications.

To find out more about Cassandra and DataStax, and to obtain downloads of Cassandra
and DataStax Enterprise software, please visit www.datastax.com or send an email to
[email protected] Note that DataStax Enterprise Edition is completely free to use in
development environments, while production deployments require a software subscription to
be purchased.

About DataStax
DataStax, the commercial leader in Apache Cassandra™, offers products and services that make
it easy for customers to build, deploy, and operate big data applications. Over 200 customers use
DataStax today, including leaders such as Netflix, Cisco, Rackspace, and Constant Contact, with
industries served including web, financial services, telecommunications, logistics, and
government.

DataStax is backed by industry-leading investors, including Lightspeed Venture and Crosslink,
and is based in San Mateo, CA, with offices also in Austin, TX. For more information, visit
www.datastax.com.

Similer Documents