hadoop: In which format data is stored in HDFS - apache-spark

I am loading data into HDFS using spark. How is the data stored in HDFS? Is it encrypt mode? Is it possible to crack the HDFS data? how about Security for existing data?
I want to know the details how the system behaves.

HDFS is a distributed file system which supports various formats like plain text format csv, tsv files. Other formats like parquet, orc, Json etc..
While saving the data in HDFS in spark you need to specify the format.
You can’t read parquet files without any parquet tools but spark can read it.
The security of HDFS is governed by Kerberos authentication. You need to set up the authentication explicitly.
But the default format of spark to read and write data is - parquet

HDFS can store data in many formats and Spark has the ability to read it (csv, json, parquet etc). While writing back specify the format that you wish to save the file in.
reading some stuff on the below commands will help you this thing:
hadoop fs -ls /user/hive/warehouse
hadoop fs -get (this till get the files from hdfs to your local file system)
hadoop fs -put (this will put the files from your local file system to hdfs)

Related

Transform CSV into Parquet using Apache Flume?

I have a question, is it possible to execute ETL for data using flume.
To be more specific I have flume configured on spoolDir which contains CSV files and I want to convert those files into Parquet files before storing them into Hadoop. Is it possible ?
If it's not possible would you recommend transforming them before storing in Hadoop or transform them using spark on Hadoop?
I'd probably suggest using nifi to move the files around. Here's a specific tutorial on how to do that with Parquet. I feel nifi was the replacement for Apache Flume.
Flume partial answers:(Not Parquet)
If you are flexible on format you can use an avro sink. You can use a hive sink and it will create a table in ORC format.(You can see if it also allows parquet in the definition but I have heard that ORC is the only supported format.)
You could likely use some simple script to use hive to move the data from the Orc table to a Parquet table. (Converting the files into the parquet files you asked for.)

Loading local file into HDFS using hdfs put vs spark

Usecase is to load local file into HDFS. Below two are approaches to do the same , Please suggest which one is efficient.
Approach1: Using hdfs put command
hadoop fs -put /local/filepath/file.parquet /user/table_nm/
Approach2: Using Spark .
spark.read.parquet("/local/filepath/file.parquet ").createOrReplaceTempView("temp")
spark.sql(s"insert into table table_nm select * from temp")
Note:
Source File can be in any format
No transformations needed for file loading .
table_nm is an hive external table pointing to /user/table_nm/
Assuming that they are already built local .parquet files, using -put will be faster as there is no overhead of starting the Spark App.
If there are many files, there is simply still less work to do via -put.

Read compressed JSON in Spark

I have data stored in S3 as utf-8 encoded json files, and compressed using either snappy/lz4.
I'd like to use Spark to read/process this data, but Spark seems to require the filename suffix (.lz4, .snappy) to understand the compression scheme.
The issue is that I have no control over how the files are named - they will not be written with this suffix. It is also too expensive to rename all such files to include such as suffix.
Is there any way for spark to read these JSON files properly?
For parquet encoded files there is the 'parquet.compression' = 'snappy' in Hive Metastore, which seems to solve this problem for parquet files. Is there something similar for text files?

how to move hdfs files as ORC files in S3 using distcp?

I have a requirement to move text files in hdfs to aws s3. The files in HDFS are text files and non-partitioned.The output of the S3 files after migration should be in orc and partitioned on specific column. Finally a hive table is created on top of this data.
One way to achieve this is using spark. But I would like to know, is this possible using Distcp to copy files as ORC.
Would like to know any other best option is available to accomplish the above task.
Thanks in Advance.
DistCp is just a copy command; it doesn't do conversion of anything. You are trying to execute a query generating some ORC formatted output. You will have to use a tool like Hive, Spark or Hadoop MapReduce to do it.

Using spark dataFrame to load data from HDFS

Can we use DataFrame while reading data from HDFS.
I have a tab separated data in HDFS.
I googled, but saw it can be used with NoSQL data
DataFrame is certainly not limited to NoSQL data sources. Parquet, ORC and JSON support is natively provided in 1.4 to 1.6.1; text delimited files are supported using the spark-cvs package.
If you have your tsv file in HDFS at /demo/data then the following code will read the file into a DataFrame
sqlContext.read.
format("com.databricks.spark.csv").
option("delimiter","\t").
option("header","true").
load("hdfs:///demo/data/tsvtest.tsv").show
To run the code from spark-shell use the following:
--packages com.databricks:spark-csv_2.10:1.4.0
In Spark 2.0 csv is natively supported so you should be able to do something like this:
spark.read.
option("delimiter","\t").
option("header","true").
csv("hdfs:///demo/data/tsvtest.tsv").show
If I am understanding correctly, you essentially want to read data from the HDFS and you want this data to be automatically converted to a DataFrame.
If that is the case, I would recommend you this spark csv library. Check this out, it has a very good documentation.

Resources