Understanding Hadoop ecosystem with big data

Hadoop ecosystem contains various tools which provide different services to solve and analyze big data problems. Moreover, it includes multiple components that support stages of big data processing.

Components of Hadoop ecosystem

HDFS: Firstly, Hadoop Distribute File System is the core component of Hadoop ecosystem. It stores different types of data sets. HDFS has two core components.

First is Name Node. Name Node contains metadata and requires less storage.
Second is Data Node. Data Node stores the actual data and requires more storage resources.

YARN: Secondly, YARN is like the brain of the ecosystem. Moreover, it performs all the processing activities by allocating resources and tasks. Following are the components of YARN:

Resource Manager: Resource manager is again a main node. Certainly, it receives processing requests and passes it to Node manager.
Node Manager: It is responsible for execution of task on every single data node.

MapReduce: Thirdly, MapReduce is a framework which helps in writing applications that process large datasets using distributed and parallel algorithms. Furthermore, following are the two main functions of mapreduce:

Map: This function uses filtering, grouping and sorting techniques.
Reduce: This function summarizes the result it got from map function.

Apache Pig: It has two main classifications. First is Pig Latin. And second is Pig runtime. Pig performs various functions like grouping, joining and sorting. It analyzes large data sets. In addition, we can either dump data on screen or store the result back in HDFS.

Apache Hive: Hive is a component which performs read, write operations and manages large data sets in a distributed environments using SQL like interface. Moreover, HIVE has two main components: Command line and JDBC/ODBC driver. Certainly, HIVE is highly scalable and supports all primitive data types of SQL.

Modes in Hadoop

Local or Standalone Mode: This is the default mode in which Hadoop runs. It is used for debugging where e can use input and output to a local system. Moreover, it is the fastest mode as it uses local system.

Psuedo- distributed Mode: In this mode, Data Node and Name Node resides on the same machine. It is mainly use for testing purpose.

Fully distributed Mode: In this mode, Data Node and Name Node resides on different machine. It offers computing capability and reliability.

Advantages of Hadoop ecosystem

Scalability: Hadoop is highly scalable storage platform as it can store and distribute large datasets across hundreds of inexpensive servers.
Reduces costs: Hadoop is cost effective as it can affordably store all of a company’s data for later use. Moreover, it offers storage and computing capabilities for hundreds of pounds per terabyte.
Flexibility: It is flexible as it enables businesses to easily access new sources and tap into different types of data.