copy data from one HDFS directory to another continuously - linux

I have a directory in hdfs which gets files populated every 2 days. I want to copy all the files in this directory to another in such a way that if a new file comes in today, I want the file to be copied to the duplicate directory.
How can we do that in Hdfs.
I know we can do that in linux using rsync. Is there any method like this in Hdfs as well?

No, there are no file sync methods available with HDFS. You have to either do hdfs dfs -cp or hadoop distcp manually or through any scheduler (cron).
If the number of files are more, distcp is preferred.
hadoop distcp -update <src_dir> <dest_dir>
The -update flag would overwrite if source and destination differ in size, blocksize, or checksum.

Related

Loading local file into HDFS using hdfs put vs spark

Usecase is to load local file into HDFS. Below two are approaches to do the same , Please suggest which one is efficient.
Approach1: Using hdfs put command
hadoop fs -put /local/filepath/file.parquet /user/table_nm/
Approach2: Using Spark .
spark.read.parquet("/local/filepath/file.parquet ").createOrReplaceTempView("temp")
spark.sql(s"insert into table table_nm select * from temp")
Note:
Source File can be in any format
No transformations needed for file loading .
table_nm is an hive external table pointing to /user/table_nm/
Assuming that they are already built local .parquet files, using -put will be faster as there is no overhead of starting the Spark App.
If there are many files, there is simply still less work to do via -put.

why does _spark_metadata has all parquet partitioned files inside 0 but cluster having 2 workers?

I have a small spark cluster with one master and two workers. I have a Kafka streaming app which streams data from Kafka and writes to a directory in parquet format and in append mode.
So far I am able to read from Kafka stream and write it to a parquet file using the following key line.
val streamingQuery = mydf.writeStream.format("parquet").option("path", "/root/Desktop/sampleDir/myParquet").outputMode(OutputMode.Append).option("checkpointLocation", "/root/Desktop/sampleDir/myCheckPoint").start()
I have checked in both of the workers. There are 3-4 snappy parquet files got created with file names having prefix as part-00006-XXX.snappy.parquet.
But when I try to read this parquet file using following command:
val dfP = sqlContext.read.parquet("/root/Desktop/sampleDir/myParquet")
it is showing file not found exceptions for some of the parquet split files. Strange thing is that, those files are already present in the one of the worker nodes.
When further checked in the logs, it is obeserved that spark is trying to get all the parquet files from only ONE worker nodes, and since not all parquet files are present in one worker, it is hitting with the exception that those files were not found in the mentioned path to parquet.
Am I missing some critical step in the streaming query or while reading data?
NOTE: I don't have a HADOOP infrastructure. I want to use filesystem only.
You need a shared file system.
Spark assumes the same file system is visible from all nodes (driver and workers).
If you are using the basic file system then each node sees their own file system which is different than the file system of other nodes.
HDFS is one way of getting a common, shared file system, another would be to use a common NFS mount (i.e. mount the same remote file system from all nodes to the same path). Other shared file systems also exist.

How to load lots of files into one RDD in Spark

I use saveAsTextFile method to save RDD, but it is not in a file, instead there are many parts files as the following picture.
So, my question is how to reload these files into one RDD.
You are trying to use Spark locally, rather than in a distributed manner is my guess. When you use saveAsTextFile it is just saving these using Hadoop's file writer and creating a file per RDD partition. One thing you could do is coalesce the partition to 1 file before writing if you want a single file. But if you go up one folder you will find that the folder's name is that which you saved. So you can just sc.textFile using that same path and it will pull everything into the partitions once again.
you know what? I just found it very elegant:
say your files are all in the /output directory, just use the following command to merge them into one, and then you can easily reload as one RDD:
hadoop fs -getmerge /output /local/file/path
Not a big deal, I'm Leifeng.

Show how a parquet file is replicated and stored on HDFS

Data stored in parquet format results in a folder with many small files on HDFS.
Is there a way to view how those files are replicated in HDFS (on which nodes)?
Thanks in advance.
If I understand your question correctly, you actually want to track which data blocks is on which data node and that's not apache-spark specific.
You can use hadoop fsck command as followed :
hadoop fsck <path> -files -blocks -locations
This will print out locations for every block in the specified path.

Move/Copy files in Spark hadoop

I have an input folder that contains many files. I would like to do a batch operation on them like copy/move them to a new path.
I would like to do this using Spark.
Please help/suggest how to proceed on this.
You can read it using val myfile = sc.textFile("file://file-path") if it is local dir and save them using myfile.saveAsTextFile("new-location"). It's also possible to save with compression Link to ScalaDoc
What spark will do is to read all files and at a same time save them to a new location and make a batch of those files and store them in new location (HDFS/local).
Make sure you have the same directory available in each worker nodes of your spark cluster
In the upper case you have to have the local files' path on each worker node.
If you want to get rid of that you can use a distributed filesystem like hadoop filesystem (hdfs).
In this case you have to give path like this:
hdfs://nodename-or-ip:port/path-to-directory

Resources