Home » Tech Tips » Understanding Hadoop’s Distributed File System

Understanding Hadoop’s Distributed File System

Introduction

The Hadoop Distributed File System (HDFS) is a descendant of the Google File System, which was developed to solve the problem of big data processing at scale. HDFS is simply a distributed file system. This means that a single large dataset can be stored in several different storage nodes within a compute cluster. HDFS is how Hadoop is able to offer scalability and reliability for the storage of large datasets in a distributed fashion.

Motivation for a Distributed Approach

Why would a company choose a distributed architecture over the usual single-node architecture?

  • Fault Tolerance: A distributed architecture is more immune to failures due to the existence of replicas.

  • Scalability: HDFS can allow adding more nodes to a cluster. It is not limited. This allows a business to easily scale with demand.

  • Access speeds: A distributed architecture offers fast data retrieval from storage compared to traditional relational databases.

  • Cost effectiveness: At the scale of big data, a distributed approach is cheaper in terms of storage and retrieval computations since the extra processing required to read and write into a relational database would cost so much more.

With all its advantages, HDFS and Hadoop work best for use cases with very large amounts of data. If your use case has relatively small and manageable data, an HDFS would not be ideal.

Related:  Lifecycle Management for Amazon EBS Snapshots

HDFS Architecture

The file system follows a Master Slave Architecture.

Hadoop Distributed File System Architecture -myTechMint.com

Components include:

Namenode: The master node. It manages access to files on the system and also manages the namespace of the file system.

DataNode: The slave node. It performs read/write operations on the file system as well as block operations such as creation, deletion, and replication.

Block: The unit of storage. A file is broken up into blocks, and different blocks are stored on different datanodes as per instructions of the namenode.

Several datanodes can be grouped into a single unit and referred to as a Rack. The rack design depends on the big data architect and what they wish to achieve. A typical architecture may have a main rack and some replication racks for purposes of redundancy in case of failures.

Basic Usage Commands

To use the Hadoop file system, you need some background knowledge in Linux or command line usage since most of the commands are similar to those in Linux.

Related:  AWS CloudFormation Tutorial: Concepts, Templates and Use Case

To start up the HDFS, assuming you have Hadoop installed on your machine, run the command

1start-dfs.sh

The general pattern is to use basic Linux file system commands but preceded by the command hadoop fs so they run on HDFS and not the local file system.

To create a new directory within the HDFS, use the command

1 hadoop fs -mkdir <directory_name>

To copy a file from your host machine onto HDFS, use the command

1 hadoop fs -put ~/path/to/localfile /path/to/hadoop/directory

The reverse of put is get, which copies files from HDFS back to host file system.

1 hadoop fs -get /hadoop/path/to/file ~/path/to/local/storage

More Linux file system commands such as chownrmmvcpls, and others are also available in HDFS.

Check Full Basics and Advance Hive Tutorials

Leave a Comment