Loading local file into HDFS using hdfs put vs spark - apache-spark

Usecase is to load local file into HDFS. Below two are approaches to do the same , Please suggest which one is efficient.
Approach1: Using hdfs put command
hadoop fs -put /local/filepath/file.parquet /user/table_nm/
Approach2: Using Spark .
spark.read.parquet("/local/filepath/file.parquet ").createOrReplaceTempView("temp")
spark.sql(s"insert into table table_nm select * from temp")
Note:
Source File can be in any format
No transformations needed for file loading .
table_nm is an hive external table pointing to /user/table_nm/

Assuming that they are already built local .parquet files, using -put will be faster as there is no overhead of starting the Spark App.
If there are many files, there is simply still less work to do via -put.

Related

Spark write Dataframes directly from Hive to local file system

This question is almost a replica of the requirement here: Writing files to local system with Spark in Cluster mode
but my query is with a twist. The page above writes files from HDFS directly to local filesystem using spark but after converting it to RDD.
I'm in search of options available with just the Dataframe; conversion to RDD for huge data takes a toll on resource utilisation.
You can make use of below syntax to directly write a dataframe to HDFS filesystem.
df.write.format("csv").save("path in hdfs")
Refer spark doc for more details: https://spark.apache.org/docs/2.2.0/sql-programming-guide.html#generic-loadsave-functions

hadoop: In which format data is stored in HDFS

I am loading data into HDFS using spark. How is the data stored in HDFS? Is it encrypt mode? Is it possible to crack the HDFS data? how about Security for existing data?
I want to know the details how the system behaves.
HDFS is a distributed file system which supports various formats like plain text format csv, tsv files. Other formats like parquet, orc, Json etc..
While saving the data in HDFS in spark you need to specify the format.
You can’t read parquet files without any parquet tools but spark can read it.
The security of HDFS is governed by Kerberos authentication. You need to set up the authentication explicitly.
But the default format of spark to read and write data is - parquet
HDFS can store data in many formats and Spark has the ability to read it (csv, json, parquet etc). While writing back specify the format that you wish to save the file in.
reading some stuff on the below commands will help you this thing:
hadoop fs -ls /user/hive/warehouse
hadoop fs -get (this till get the files from hdfs to your local file system)
hadoop fs -put (this will put the files from your local file system to hdfs)

how to move hdfs files as ORC files in S3 using distcp?

I have a requirement to move text files in hdfs to aws s3. The files in HDFS are text files and non-partitioned.The output of the S3 files after migration should be in orc and partitioned on specific column. Finally a hive table is created on top of this data.
One way to achieve this is using spark. But I would like to know, is this possible using Distcp to copy files as ORC.
Would like to know any other best option is available to accomplish the above task.
Thanks in Advance.
DistCp is just a copy command; it doesn't do conversion of anything. You are trying to execute a query generating some ORC formatted output. You will have to use a tool like Hive, Spark or Hadoop MapReduce to do it.

Show how a parquet file is replicated and stored on HDFS

Data stored in parquet format results in a folder with many small files on HDFS.
Is there a way to view how those files are replicated in HDFS (on which nodes)?
Thanks in advance.
If I understand your question correctly, you actually want to track which data blocks is on which data node and that's not apache-spark specific.
You can use hadoop fsck command as followed :
hadoop fsck <path> -files -blocks -locations
This will print out locations for every block in the specified path.

Local file and cluster mode

I'm just getting started using Apache Spark. I'm using cluster mode and I want to process a big file. I am using the textFile method from SparkContext, it will read a local file system available on all nodes.
Due to the fact my file is really big it is a pain to copy and paste in each cluster node. My question is: is there any way to have this file in a unique location like a shared folder?
Thanks a lot
You can keep the file in Hadoop or S3 .
Then you can give the path of the file in textFile method itself .
for s3 :
val data = sc.textFile("s3n://yourAccessKey:yourSecretKey#/path/")
for hadoop :
val hdfsRDD = sc.textFile("hdfs://...")

Resources