I have a tool written in Scala that uses Spark's Dataframes API to write data to HDFS. This is the line that writes the data:
temp_table.write.mode(SaveMode.Overwrite).insertInto(tableName)
One of our internal teams is using the tool on their hadoop/spark cluster and when it writes files to HDFS it is doing so without a .parquet extension on the files which (for reasons I won't go into) creates downstream problems for them.
Here is a screenshot provided by that team which shows those files that don't have the .parquet extension:
Note that we have verified that they ARE parquet files (i.e. they can be read using spark.read.parquet(filename))
I have been unable to reproduce this problem in my test environment, when I run the same code there the files get written with a .parquet extension.
Does anyone know what might cause parquet files to not be written with a .parquet extension?
Related
I have a use case where spark application is running in one spark version, the event data is published to s3, and start history server from the same s3 path, but with different spark version. Will this cause any problems?
No, it will not cause any problem as long as you can read from S3 bucket using that specific format. Spark versions are mostly compatible. As long as you can figure out how to work in specific version, you're good.
EDIT:
Spark will write to S3 bucket in the data format that you specify. For example, on PC if you create txt file any computer can open that file. Similarly on S3, once you've created Parquet file any Spark version can open it, jus the API may be different.
I have custom c++ binaries which reads raw data file and writes derived data file. The size of files are in 100Gbs. Moreover, I would like to process multiple 100Gb files in parallel and generated a materialized view of derived metadata. Hence, map-reduce paradigm seems more scalable.
I am a newbie in Hadoop ecosystem. I have used Ambari to setup a Hadoop cluster on AWS. I have built my custom C++ binaries on every data node and loaded the raw data files on HDFS. What are my options to execute this binary on HDFS files?
Hadoop streaming is the simplest way to run non-Java applications as MapReduce.
Refer to Hadoop Streaming for more details.
We have a stream implemented with Spark Structured Streaming writing to an HDFS folder and thus creating the _spark_metadata subfolder, in order to achieve the exactly-once guarantee when writing to a file system.
We additionally have a mode, in which we re-generate the results of the stream for historical data in a separate folder. After re-processing has finished, we copy the re-generate subfolders under the "normal-mode" folder. You can imagine that the _spark_metadata of the "normal-mode" folder is not up-to-date anymore and this causes incorrect readings of this data in Zeppelin.
Is there a way to disable the use of the folder _spark_metadata when reading with spark from a HDFS folder?
I have a huge file stored in S3 and loading ii into my Spark Cluster and i want to invoke a custom Java Library which takes a Input File Location, process the Data and writes to a given output location. How ever i cannot rewrite that custom logic in Spark.
I am trying to see whether i can load the file from S3 and save the partition to local disk and give that location to Custom Java App and once it is processed load all the partitions and save it into S3.
Is this possible ? What ever i have read so far it looks like i need to use RDD Api. but couldn't find more info on how i can save each partition to local disk.
Appreciate any inputs.
Can S3DistCp merge multiple files stored as .snappy.parquet output by a Spark app into one file and have the resulting file be readable by Hive?
I was also trying to merge smaller snappy parquet files into larger snappy parquet files.
Used
aws emr add-steps --cluster-id {clusterID} --steps file://filename.json
and
aws emr wait step-complete --cluster-id {clusterID} --step-id {stepID}
Command runs fine but when I try to read the merged file back using parquet-tools, read fails with java.io.EOFException.
Reached out to AWS support team. They said they have a known issue when using s3distcp on parquet files and they are working on a fix but don't have an ETA for the fix.