Introduction to Big Data and Hadoop (Ecosystem)

Introduction to Big Data and Hadoop (Ecosystem)

What is Big data?

Big data refers to the large and complex sets of data generated by various sources in today's digital world. Also, any data that can be characterised by the 3 V's is considered to be big data and they are:-

  1. Volume: The volume of data is huge which our traditional system can't handle.

  2. Variety: Variety refers to the different forms of data, such as

    (I) Structured Data:

    data that follows a specific format and organisation, like a database.

    (II) Semi-structured data:

    Spreadsheets, JSON, and XML files are examples of textual data files with an obvious pattern that allows for analysis.

    (III) Quasi-structured data:

    With work and the right software tools, text data in irregular formats, like click-stream data, can be formatted.

    (IV) Unstructured data:

    Data that lacks a natural organisation and is typically saved in various file types, including PDFs, photos, weblogs, etc.

  3. Velocity: Velocity at which the data is coming.

Another "V" stands for Veracity which refers to the nature of the data (Poor quality, Unclean data)

Why Big data and What business objectives are we seeking to achieve?

To process huge amounts of data that traditional systems are not capable of processing. To process we need to store our data first

  1. Store - Store massive amounts of data

  2. Process - Process it as soon as possible.

  3. Scale - Scalable system which scales easily as data grows.

    Two ways to build a system

    (I) Monolithic system: One powerful system with a lot of resources(Memory, Storage, Compute).

    • The disadvantages of a monolithic system

      (i) Hard to add the resources.

      (ii) Is not scalable, twice the resources doesn't mean twice the performance.

      This comes under vertical scaling, which is not true scaling.

(II) Distributed system: Many smaller systems come together to form a single resource and each system is known as a node.

A combined system/node is known as a Cluster.

  • Advantage of a distributed system

    It is scalable. Horizontal scaling is true scaling, all we need to do is just add the nodes.

Hadoop

  1. It is a framework(a bunch of tools) to solve the Big data problems.

  2. Google's architecture was based on a monolithic system. In 2003, they published a white paper(a theoretical notion to the outside world) outlining a distributed architecture for storing huge datasets. This paper was called Google File System "GFS".

    Eg. We have data of 50TB, then this 50TB was broken down and saved it into different nodes

  3. In 2004, Google released another paper to describe how to process large datasets. As the data wasn't available on the machine then the traditional programming model won't work. This paper was known as MapReduce.

  4. In 2006, Yahoo implemented it and named it as Hadoop Distributed File System "HDFS", and the implementation of MapReduce was named MapReduce(Unchanged).

    • Hadoop 1.0

      HDFS - Distributed storage

      Mapreduce - Distributed processing, It's code is written in Java.

  5. In 2009, Hadoop came under the Apache software foundation and became open source(Source Code is available publicly)

  6. In 2013, Hadoop 2.0 came with major changes. They felt that MR is a bulky component and it is getting bottled neck, thus "YARN" came into the picture and took some capabilities of MR & it also has its own.

    • YARN (Yet Another Resource Negotiator)

      It is mainly responsible for resource management. YARN is a large-scale distributed operating system for big data applications.

Hadoop Ecosystem

  1. Sqoop (Data ingestion/migration)

    • Transfer data from the traditional database to the Hadoop environment and vice versa.

    • It is a special MR job where only mappers work.

  2. Apache spark

    • A distributed general-purpose in-memory computing engine

    • Spark is an alternative to MR, it is also a computation unit.

    • We can plug it into any storage system

      eg. Local storage, HDFS, S3

    • Plug it into any resource manager

      eg. YARN, Kubernetes

    • Spark is written in Scala and can use Scala for coding. It also supports Java, Python and R.

  3. Hive

    • Hive was developed by Meta/FB and it is known as Hive Query Language (HQL)

    • Data warehouse tool built on top of Apache hadoop for providing data query and analysis.

    • It is a SQL that internally converts it on java code and triggers that code on the cluster.

  4. HBase

    It is a NoSQL database that runs on top of HDFS.

  5. PIG

    It was used for two purpose

    • To clean the data using a scripting language "Pig Latin", and another way was by using any ETL tool.

    • Unstructured data to structured data.

All these things are now done by Apache spark.

  1. Oozie

    A workflow scheduler system to manage Apache hadoop jobs.