Show how a parquet file is replicated and stored on HDFS - apache-spark

Data stored in parquet format results in a folder with many small files on HDFS.
Is there a way to view how those files are replicated in HDFS (on which nodes)?
Thanks in advance.

If I understand your question correctly, you actually want to track which data blocks is on which data node and that's not apache-spark specific.
You can use hadoop fsck command as followed :
hadoop fsck <path> -files -blocks -locations
This will print out locations for every block in the specified path.

Related

Putting many small files to HDFS to train/evaluate model

I want to extract the contents of some large tar.gz archives, that contain millions of small files, to HDFS. After the data has been uploaded, it should be possible to access individual files in the archive by their paths, and list them. The most straight forward solution would be to write a small script, that extracts these archives to some HDFS base folder. However, since HDFS is known not to deal particularly well with small files, I'm wondering how this solution can be improved. These are the potential approaches I found so far:
Sequence Files
Hadoop Archives
HBase
Ideally, I want the solution to play well with Spark, meaning that accessing the data with Spark should not be more complicated than it was, if the data was extracted to HDFS directly. What are your suggestions and experiences in this domain?
You can land the files into a landing zone and then process them into something useful.
zcat <infile> | hdfs dfs -put - /LandingData/
Then build a table on top of that 'landed' data. Use Hive or Spark.
Then write out a new table (in a new folder) using the format of Parquet or ORC.
Whenever you need to run analytics on the data use this new table, it will perform well and remove the small file problem. This will keep the small file problem to a one time load.
Sequence files are the great way to handle small files hadoop problem.

Loading local file into HDFS using hdfs put vs spark

Usecase is to load local file into HDFS. Below two are approaches to do the same , Please suggest which one is efficient.
Approach1: Using hdfs put command
hadoop fs -put /local/filepath/file.parquet /user/table_nm/
Approach2: Using Spark .
spark.read.parquet("/local/filepath/file.parquet ").createOrReplaceTempView("temp")
spark.sql(s"insert into table table_nm select * from temp")
Note:
Source File can be in any format
No transformations needed for file loading .
table_nm is an hive external table pointing to /user/table_nm/
Assuming that they are already built local .parquet files, using -put will be faster as there is no overhead of starting the Spark App.
If there are many files, there is simply still less work to do via -put.

why does _spark_metadata has all parquet partitioned files inside 0 but cluster having 2 workers?

I have a small spark cluster with one master and two workers. I have a Kafka streaming app which streams data from Kafka and writes to a directory in parquet format and in append mode.
So far I am able to read from Kafka stream and write it to a parquet file using the following key line.
val streamingQuery = mydf.writeStream.format("parquet").option("path", "/root/Desktop/sampleDir/myParquet").outputMode(OutputMode.Append).option("checkpointLocation", "/root/Desktop/sampleDir/myCheckPoint").start()
I have checked in both of the workers. There are 3-4 snappy parquet files got created with file names having prefix as part-00006-XXX.snappy.parquet.
But when I try to read this parquet file using following command:
val dfP = sqlContext.read.parquet("/root/Desktop/sampleDir/myParquet")
it is showing file not found exceptions for some of the parquet split files. Strange thing is that, those files are already present in the one of the worker nodes.
When further checked in the logs, it is obeserved that spark is trying to get all the parquet files from only ONE worker nodes, and since not all parquet files are present in one worker, it is hitting with the exception that those files were not found in the mentioned path to parquet.
Am I missing some critical step in the streaming query or while reading data?
NOTE: I don't have a HADOOP infrastructure. I want to use filesystem only.
You need a shared file system.
Spark assumes the same file system is visible from all nodes (driver and workers).
If you are using the basic file system then each node sees their own file system which is different than the file system of other nodes.
HDFS is one way of getting a common, shared file system, another would be to use a common NFS mount (i.e. mount the same remote file system from all nodes to the same path). Other shared file systems also exist.

how to move hdfs files as ORC files in S3 using distcp?

I have a requirement to move text files in hdfs to aws s3. The files in HDFS are text files and non-partitioned.The output of the S3 files after migration should be in orc and partitioned on specific column. Finally a hive table is created on top of this data.
One way to achieve this is using spark. But I would like to know, is this possible using Distcp to copy files as ORC.
Would like to know any other best option is available to accomplish the above task.
Thanks in Advance.
DistCp is just a copy command; it doesn't do conversion of anything. You are trying to execute a query generating some ORC formatted output. You will have to use a tool like Hive, Spark or Hadoop MapReduce to do it.

copy data from one HDFS directory to another continuously

I have a directory in hdfs which gets files populated every 2 days. I want to copy all the files in this directory to another in such a way that if a new file comes in today, I want the file to be copied to the duplicate directory.
How can we do that in Hdfs.
I know we can do that in linux using rsync. Is there any method like this in Hdfs as well?
No, there are no file sync methods available with HDFS. You have to either do hdfs dfs -cp or hadoop distcp manually or through any scheduler (cron).
If the number of files are more, distcp is preferred.
hadoop distcp -update <src_dir> <dest_dir>
The -update flag would overwrite if source and destination differ in size, blocksize, or checksum.

Resources