Big Data Interview Questions and Answers

Big Data Interview Questions with Answers

1. What did you understand from the team big data?

This is Simple term which is associated with the complex & larger datasets. The database which is relational cannot able to handle this big data. The special tools & methods are generally used for performing the operations on the vast collections of the data.

This Big data enable the companies to better understand the business & help them to derive the meaningful information. This is from unstructured & raw data which are collected on regular basis. This Big data usually allows companies for taking better business decision which is based on the data.

2. Explain about Hadoop?

The data analysis will deal with the massive amount of structured, unstructured and semi-structured data. Analyzing unstructured data is quite difficult. Hadoop will play a major role in the capabilities of the following.

Storage.
Processing.
Data collection.

Hadoop is an open source. This will help in running the commodity hardware. It’s cost-benefit gives the solution for businesses.

3. Can you explain about the 5 V’s of a big data?

Following are the five V’s of a Big data. They are,

Volume: Volume generally represents the amount of the data which is used in growing at higher rate.
Velocity: Velocity generally represents the rate at which the data grows. Social media plays a major role in the velocity of growing the data.
Variety: Variety represents the different data types. various data formats like text, audios, videos, etc.
Veracity: Veracity represents the uncertainty of the available data. Veracity arises due to the high volume of data that brings incompleteness and inconsistency.
Value: Value represents turning the data into value. By turning accessed big data into values, businesses may generate revenue.

4. How will you relate this big data & Hadoop?

Big data & Hadoop are generally synonyms with each other. There is a rise in a big data & Hadoop framework. The specialty in a big data operation makes this very popular. The frameworks are used by the professionals. This is generally used for analyzing a big data & help the businesses for decision making.

5. How big data is useful in the businesses?

A Big data helps the companies in understanding their customers. It will allow them to derive a conclusion from large data sets collected over the years. It will help them to make better decisions.

6. What you consider as essential Hadoop tool for a Big data to work effectively?

Following are list of Hadoop tools that enhance the performance of Big Data. They are,

HBase .
Ambari.
Hive.
HDFS.
Sqoop.
ZooKeeper.
Pig.
Lucene/SolrSee.
NoSQL.
Oozie.
Mahout.
GIS Tools.
Avro.
Clouds.
Flume.
SQL on Hadoop.

7.Can you able to define the HDFS & YARN also explain their respective component?

Generally, HDFS is generally known as Hadoop’s storage unit which is default. Also, responsible for the storing various types of the data in distributed environments.

HDFS has two components. They are as follows.

NameNode – This will be considered as master node. It has the metadatas information of all data blocks in the HDFS.
DataNode – Nodes that will act like a slave node. These are very responsible for the data storing.

YARN is another Resources Negotiator. It will be responsible for the resource management & providing execution environments.

YARN has two components. They are as follows.

ResourceManager – It will be Responsible for the allocation of resources for each Node Managers which is based on their needs.
NodeManager – It will execute the tasks on every DataNode.

8. Explain about the commodity hardware?

Commodity Hardware generally refers the minimum hardware resources which is required for running Apache Hadoop frameworks. Any hardware which supports Hadoop’s minimal requirements is generally called as the ‘Commodity Hardware.’

9. Explain about the FSCK?

The short form of Filesystem Check is FSCK. This command which is generally used for running Hadoop summary reports. The report generally describes state of the HDFS. It will only check for the errors. These commands can execute on the system / subset of file.

10. Explain about purpose of JPS commands in the Hadoop?

This is generally involved in testing works of Hadoop daemon. It lists out all daemons.

Are you Interested In Joining ?

Attend a Free Demo Session with a sip of Coffee.

Book Free demo

11. Whether HDFS is to be used for applications which has large data sets. Why is it not the correct tool to use when we use many small files?

Most of the time we are not considering the HDFS as essential tool just for handling small amount of the data spread in different file. The reason is “Namenode” which is very costly & highly performing systems. Space allocated for the “Namenode” is used for the essential metadata that is generated generally for the only single file, instead of the many small files.

Handling large quantity of the data which attributed for single file, the “Namenode” occupies lower space & gives off the optimized performances. From this we can conclude that HDFS is used for huge data file supporting than many files with the small data.

12. Can you tell me Port Number for NameNode?

The port number for the NameNode is Port 50070.

13. Can you explain about the steps involved in deploy a Big Data solution?

There are 3 steps we should follow to deploy this Big Data Solutions. They are as follows.

Data Ingestion 1st step involved in the deploy this big data solutions are data ingestions. The data can be extracted from different sources. May be Data source is CRM. This will be like Salesforce, the Enterprise Resources Planning the Systems such as SAP, RDBMS such as MySQL / other log file, document, social media feed etc. Data are ingested either via batch jobs / real times streaming. Extracted data will be stored in the HDFS.

Data Storage Once the data ingestion is completed, next step in the deployment is storing extracted data. Data are either stored in the HDFS / NoSQL DB. HDFS storage will work for the sequential access where the HBase for a random read or write accessing.

Data Processing Last step in the deploying process is data processing. Data will be processed via one of processing frameworks. The frameworks such as MapReduce, Pig, Spark, etc., will be used.

14. State the difference between the NAS and HDFS?

The following are main difference between the NAS & HDFS. They are,

HDFS will run on clusters of the machines whereas NAS will run on single machine. Data redundancies is common issues in the HDFS. On other side replication protocols are different in the case of the NAS. Chances of the data redundancies are very less.
Data is stored as data blocks in local drives in HDFS. But in NAS, it is stored in dedicated hardware.

15. Can you give me the Command to format the NameNode?

$ hdfs namenode is the format

16. State the function of the JPS command?

JPS is used in testing all of daemons in Hadoop runs perfectly.

17. How would you start Hadoop daemon all together?

Please start the daemon using “start-all.sh”. This shell script is available in directory sbin under root.

18. How would you stop Hadoop daemon all together?

Please stop the daemon using “stop-all.sh”. This shell script is available in directory sbin under root.

19. Can you list out the Hadoop features?

Following are some of the most useful features of Hadoop. They are,

Open-Source: Hadoop is an open-sourced platform. It will allow rewriting the code or modifying according to user and analytics requirements.
Scalability: Hadoop supports the addition of hardware resources to the new nodes.
Data Recovery: Hadoop follows replication. It will allow the recovery of data in the case of any failure.
Data Locality: Hadoop moves the computation to the data and not the other way around. This way, the whole process speeds up.

20. What is the port number to be used for task-tracker?

Port Number for Task Tracker is Port 50060

21.What is the port number for job-tracker?

Port Number for Job Tracker is Port 50030

22.What is mean by indexing in HDFS?

It will index data block which will be based on sizes. End of data block will point address of next chunks of the data blocks which get stored. DataNodes stores data as blocks while NameNode stores will manage the data.

23. Explain the Edge Nodes in Hadoop?

Edge nodes will refer to the gateway nodes. It will act like an interface between Hadoop clusters & the external network. These nodes run client apps & cluster management tools which can be used as the staging areas. Enterprises-class storage capability must be required for Edge Nodes. A single edge node generally suffices for various Hadoop clusters.

24.Name few data managements tool which is used in combination with Edge-Node in the Hadoop?

Following are the most common data management tools that work with Edge Node

Ambari.
Oozie.
Flume
Pig.

25. What are Reducer core methods?

There are three core methods of a reducer. They are as follows.

setup() – This method is useful in configuring different parameters like heap size, distributed cache & data input.
reduce() – Parameter which will be called once each key with concerned reduce tasks.
cleanup() – It clears all the temporary files & only called at end of tasks allocated to reducer.

26. State various tombstone marker used for the deletion purpose in the HBase?

There are mainly 3 tombstone markers which is used for deletion in the HBase.

Family Delete Marker: It is mainly used for marking all columns of a column families.
Version Delete Marker: It is generally used for marking single versions of a single columns.
Column Delete Marker: It is used for marking all versions of a single column.

27. Can you tell me about with how many modes you can run Hadoop?

We can run Hadoop in three modes. They are as follows.

Standalone mode.
Pseudo-distributed mode.
Fully-distributed mode.

28. Tell me some of the real-time applications of Hadoop?

Following are Some of the real-time applications of Hadoop. They are,

Content management.
Financial agencies.
Defense and cybersecurity.
Managing posts on social media.

29. What will you run using the Apache Hadoop framework?

Commodity hardware is simply defined as the basic hardware resources which is needed to run the Apache Hadoop framework.

30. List some of the most common input formats in Hadoop?

Following are the most common input format used in Hadoop. They are,

Text Input Format.
Key Value Input Format.
Sequence File Input Format.

31. Can you list some of the companies that use Hadoop?

Following are some of the companies which uses Hadoop. They are,

Yahoo.
Facebook.
Netflix.
Amazon.
Twitter.

32. Tell me about the default mode for Hadoop?

Standalone mode is the default mode in Hadoop. It is primarily used for debugging purpose.

33. Can you state the role of Hadoop in big data analytics?

Hadoop helps big data by providing storage and helping in the collection and processing of data.

34. Can you tell me Which hardware configuration is most beneficial for Hadoop jobs?

It is the best thing to use dual processors or core machines with 4 / 8 GB RAM. ECC memory for conducting Hadoop operations. We cannot consider ECC memory as low-end, it is helpful for Hadoop users as it does not deliver any checksum errors. The hardware configuration for different Hadoop jobs would also depend on the process and workflow needs of specific projects and may have to be customized accordingly.

35. Explain about TaskInstance?

A TaskInstance refers to a specific Hadoop MapReduce work process. It will run on any given slave node. Each taskinstance has its very own JVM process. It is created by default for aiding its performance.

36. How counters are useful in Hadoop?

Counters are an integral part of any Hadoop job. They are very useful for gathering relevant statistics. Let us consider in a job, it will consist of 150 nodes of clusters with 150 mappers which is running at any given point of time. It will become cumbersome and time-consuming to map and consolidate invalid records for the log entries. Here, counters can be used for keeping a final count of all such records and presenting a single output

37. How will you check the file systems in HDFS?

The “fsck” command is used for conducting file system checks in Linux Hadoop and HDFS. It is helpful in blocking names and locations, as well as ascertaining the overall health of any given file system.

38. What is mean by “speculative execution” in context to Hadoop?

When a specific node gets slow down the performance of any given task, the master node can execute another task instance on a separate note redundantly. In such a scenario, the task that reaches its completion before the other is accepted, while the other is killed. This entire process is referred to as “speculative execution”.

39. Where will Hive store table data by default?

The default location for the storage of table data by Hive is:

hdfs://namenode/user/hive/warehouse

40. Name some of the important relational operations in Pig Latin?

Following are some of the important relational operations in Pig Latin. They are,

Group.
Distinct.
Join.
For each.
Order by.
Filters.
Limit.

41. Can you tell me what happens when two users try to access the same file in the HDFS?

HDFS NameNode will support exclusive write only. Only the first user will receive the grant for file access. The second user will be rejected.

42. How will you recover a NameNode when it is down?

We should follow the following steps which need execute to make the Hadoop cluster up and running:

Use the FsImage. It is file system metadata replica to start a new NameNode.
Configure the DataNodes. The clients make them acknowledge the newly started NameNode.
Once the new NameNode completes loading. Tthe last checkpoint FsImage which has received enough block reports from the DataNodes. It will start to serve the client.
In case of large Hadoop clusters, the NameNode recovery process consumes a lot of time which turns out to be a more significant challenge in case of routine maintenance.

43. What is your understanding about Rack Awareness in Hadoop?

It is an algorithm which is applied to the NameNode to decide how blocks and its replicas are placed. Depending on rack definitions network traffic is minimized between DataNodes within the same rack.

44. Give me an example for Rack Awareness in Hadoop?

Let us consider the replication factor as 3. Two copies will be placed on one rack. The third copy in a separate rack.

45. Can you state the difference between “HDFS Block” and “Input Split”?

The HDFS divides the input data physically into blocks. It is for processing which is known as HDFS Block.

Input Split is a logical division of data by mapper for mapping operation.

46. State the difference between Hadoop and RDBMS?

S:NO	CRITERIA	HADOOP	RDBMS
1	DATA TYPE	Structured, Semi-structured and unstructured data.	Structured data.
2	SCHEMA	Based on Schema on Read.	Based on Schema on Write.
3	SPEED	Writes are fast.	Reads are fast.
4	APPLICATIONS	Data discover, storage and processing of unstructured data.	OLTP and complex ACID transactions.
5	COST	Open source framework, Free of cost.	Licensed software, paid.

47. Can you Explain the core components of Hadoop?

Hadoop is an open source framework. It is meant for storage and processing of big data in a distributed manner. The core components of Hadoop are as follows,

HDFS (Hadoop Distributed File System) HDFS is the basic storage system of Hadoop. The large data files running on a cluster of commodity hardware are stored in HDFS. It can store data in a reliable manner even when hardware fails.
Hadoop MapReduce MapReduce is the Hadoop layer. It is responsible for data processing. It writes an application to process unstructured and structured data stored in HDFS. It is responsible for the parallel processing of high volume of data by dividing data into independent tasks. The processing is done in two phases Map and Reduce. The Map is the first phase of processing that specifies complex logic code and the Reduce is the second phase of processing that specifies light-weight operations.
YARN The processing framework in Hadoop is YARN. It is used for resource management and provides multiple data processing engines i.e. data science, real-time streaming, and batch processing.

48. Can you tell me the configuration parameters in a “MapReduce” program?

The main configuration parameters in “MapReduce” framework are as follows. They are,

Input locations of Jobs in the distributed file system.
Output location of Jobs in the distributed file system.
The input format of data.
The output format of data.
The class which contains the map function.
The class which contains the reduce function.
JAR file which contains the mapper, reducer and the driver classes.

49. How will you achieve security in Hadoop?

Kerberos are used to achieve security in Hadoop. There are 3 steps to access a service at a high level. Each step involves a message exchange with a server.

Authentication: The first step involves authentication of the client to the authentication server, and then provides a time-stamped TGT (Ticket-Granting Ticket) to the client.
Authorization: In this step, the client uses received TGT to request a service ticket from the TGS (Ticket Granting Server).
Service Request: It is the final step to achieve security in Hadoop. Then the client uses service ticket to authenticate himself to the server.

50. Can you explain how Hadoop MapReduce works?

There are two phases of MapReduce operation. They are as follows.

Map phase: In this phase, the input data is split by map tasks. The map tasks run in parallel. These split data are used for analysis purpose.
Reduce phase: In this phase, the similar split data is aggregated from the entire collection and shows the result.

51. What do you mean by MapReduce?

MapReduce is a programming model in Hadoop. It is generally used for processing large data sets over a cluster of computers. It is commonly known as HDFS. It is a parallel programming model.

52. Can you give the syntax you use to run a MapReduce program?

The syntax to run a MapReduce program is:

hadoop_jar_file.jar /input_path /output_path.

53. Can you state the different file permissions in HDFS for files or directory levels?

HDFS uses a specific permissions model for files and directories. Following user levels used in HDFS

Owner.
Group.
Others.

For each user who are mentioned above should have following permissions. They are,

read (r).
write (w).
execute(x).

Above mentioned permissions work differently for files and directories.

For files:

The r permission is for reading a file.
The w permission is for writing a file.

For directories:

The r permission lists the contents of a specific directory.
The w permission creates or deletes a directory.
The X permission is for accessing a child directory.

54. Can you tell me the basic parameters of a Mapper?

The basic parameters of a Mapper are as follows. They are,

LongWritable and Text.
Text and IntWritable.

55. What will happen when NameNode that doesn’t have any data?

A NameNode without having any data will not exist in Hadoop. If there is a NameNode, it will defiantly contain some data in it. Else it won’t exist.

56. Can you explain how Hadoop CLASSPATH essential to start or stop Hadoop daemons?

CLASSPATH will include the necessary directories that contain jar files to start or stop Hadoop daemons. For setting CLASSPATH, it is essential to start or stop Hadoop daemons.

Setting up CLASSPATH every time is not the standard that we follow. Usually CLASSPATH is written inside /etc/hadoop/hadoop-env.sh file. Hence, once we run Hadoop, it will load the CLASSPATH automatically.

57. Explain: DFS can handle a large volume of data then why do we need Hadoop framework?

Hadoop is not only used for storing large data, but it will also use to process those big data. DFS too can store the data, but it lacks in the below features. They are,

It is not fault tolerant.
Data movement over a network depends on bandwidth.

58. What is mean by Sequencefileinputformat?

Hadoop uses a specific file format. It is known as Sequence file. The sequence file stores data in a serialized key-value pair. Sequencefileinputformat is an input format to read sequence files.

59. Tell me about your Experience in big data?

This is a most frequent questions ask in interview. This question will help them to know about your interest in the field and your dedications. The interviewer wants to evaluate if you are fit for the project requirement.

Let’s see how we must approach the question: First start with the roles in your past position and slowly add details to the conversation. Tell them about your contributions that made the project successful. The later questions are based on this question, so answer it carefully. You should also take care not to go overboard with a single aspect of your previous job. Keep it simple and to the point.

60. Tell me about your approach data preparation?

Data preparation is one of the crucial steps in big data projects. You may face at least one question based on data preparation. When the interviewer asks you this question, he wants to know what steps or precautions you take during data preparation.

Data preparation is required to get necessary data which can then further be used for modeling purposes. We should convey this message to the interviewer. We should also emphasize the type of model you are going to use and reasons behind choosing that model. Finally, you should also discuss important data preparation terms such as transforming variables, outlier values, unstructured data, identifying gaps, and others.

Enroll Now to Get Big Data Training

https://www.hopetutors.com/course/big-data-training-in-chennai/

January 30, 2020