I have an input folder that contains many files. I would like to do a batch operation on them like copy/move them to a new path.
I would like to do this using Spark.
Please help/suggest how to proceed on this.
You can read it using val myfile = sc.textFile("file://file-path") if it is local dir and save them using myfile.saveAsTextFile("new-location"). It's also possible to save with compression Link to ScalaDoc
What spark will do is to read all files and at a same time save them to a new location and make a batch of those files and store them in new location (HDFS/local).
Make sure you have the same directory available in each worker nodes of your spark cluster
In the upper case you have to have the local files' path on each worker node.
If you want to get rid of that you can use a distributed filesystem like hadoop filesystem (hdfs).
In this case you have to give path like this:
hdfs://nodename-or-ip:port/path-to-directory
Related
Usecase is to load local file into HDFS. Below two are approaches to do the same , Please suggest which one is efficient.
Approach1: Using hdfs put command
hadoop fs -put /local/filepath/file.parquet /user/table_nm/
Approach2: Using Spark .
spark.read.parquet("/local/filepath/file.parquet ").createOrReplaceTempView("temp")
spark.sql(s"insert into table table_nm select * from temp")
Note:
Source File can be in any format
No transformations needed for file loading .
table_nm is an hive external table pointing to /user/table_nm/
Assuming that they are already built local .parquet files, using -put will be faster as there is no overhead of starting the Spark App.
If there are many files, there is simply still less work to do via -put.
I use saveAsTextFile method to save RDD, but it is not in a file, instead there are many parts files as the following picture.
So, my question is how to reload these files into one RDD.
You are trying to use Spark locally, rather than in a distributed manner is my guess. When you use saveAsTextFile it is just saving these using Hadoop's file writer and creating a file per RDD partition. One thing you could do is coalesce the partition to 1 file before writing if you want a single file. But if you go up one folder you will find that the folder's name is that which you saved. So you can just sc.textFile using that same path and it will pull everything into the partitions once again.
you know what? I just found it very elegant:
say your files are all in the /output directory, just use the following command to merge them into one, and then you can easily reload as one RDD:
hadoop fs -getmerge /output /local/file/path
Not a big deal, I'm Leifeng.
I have a directory in hdfs which gets files populated every 2 days. I want to copy all the files in this directory to another in such a way that if a new file comes in today, I want the file to be copied to the duplicate directory.
How can we do that in Hdfs.
I know we can do that in linux using rsync. Is there any method like this in Hdfs as well?
No, there are no file sync methods available with HDFS. You have to either do hdfs dfs -cp or hadoop distcp manually or through any scheduler (cron).
If the number of files are more, distcp is preferred.
hadoop distcp -update <src_dir> <dest_dir>
The -update flag would overwrite if source and destination differ in size, blocksize, or checksum.
Data stored in parquet format results in a folder with many small files on HDFS.
Is there a way to view how those files are replicated in HDFS (on which nodes)?
Thanks in advance.
If I understand your question correctly, you actually want to track which data blocks is on which data node and that's not apache-spark specific.
You can use hadoop fsck command as followed :
hadoop fsck <path> -files -blocks -locations
This will print out locations for every block in the specified path.
I'm just getting started using Apache Spark. I'm using cluster mode and I want to process a big file. I am using the textFile method from SparkContext, it will read a local file system available on all nodes.
Due to the fact my file is really big it is a pain to copy and paste in each cluster node. My question is: is there any way to have this file in a unique location like a shared folder?
Thanks a lot
You can keep the file in Hadoop or S3 .
Then you can give the path of the file in textFile method itself .
for s3 :
val data = sc.textFile("s3n://yourAccessKey:yourSecretKey#/path/")
for hadoop :
val hdfsRDD = sc.textFile("hdfs://...")