I am trying to convert a parquet file into a csv file on hdfs with the following code.
df = spark.read.format("parquet").load('hdfs://hadoop1:9001/data1.parquet')
a = df.write.format("csv").mode("overwrite").save('hdfs://hadoop1:9001/data1.csv')
But find that the locality leve for my task is always any when writting the csv file. What is the prolem?
shcreen shot of spar ui
Related
In a pyspark session when I do this:
df = spark.read.parquet(file)
df.write.csv('output')
it creates a directory called output with a bunch of files, one of which is a target csv file with unpredictable name, example: part-00006-80ba8022-33cb-4478-aab3-29f08efc160a-c000.csv
Is there a way to know what the output file name is after the .csv() call?
When you read a parquet file in the dataframe it will have some partitions as we are using distributed storage here. Similarly when you save that dataframe as an csv file it would get saved in an distributed manner based on the number of partitions that dataframe had.
The path that you provide at the time of writing the csv file would create a folder with that name is what happens and then you would have multiple partitions files inside that folder. Each file would have some portion of data and when you combine all that partitions file you get the entire content of the csv file.
Also if you read that folder path then you would be able to see the entire content of the csv file. This is the default behaviour of how spark and distributed computing works.
For most of my files, when I read in delimited files and write them out to snappy parquet, spark is executing as I expected and creating multiple partitioned snappy parquet files.
That said, I have some large .out files that are pipe-separated (25GB+), and when I read them in:
inputFile = spark.read.load(s3PathIn, format='csv', sep=fileSeparator, quote=fileQuote, escape=fileEscape, inferSchema='true', header='true', multiline='true')
Then output the results to S3:
inputFile.write.parquet(pathOut, mode="overwrite")
I am getting large single snappy parquet files (20GB+). Is there a reason for this? All my other spark pipelines generate nicely split files that make query in Athena more performant, but in these specific cases I am only getting single-large files. I am NOT executing any repartition or coallesce commands.
check how much partitions you have on inputFile dataframe. Seems like it has single partitioned.
Seems like you are just reading a CSV file and then writing it as parquet file. check the size of your CSV file, seems like it really large.
inputFile.rdd.getNumPartitions
if it's one. Try repartition dataframe.
inputFile.repartition(10) \\or
inputFile.repartition("col_name")
I have a dataframe with similar schema, I need to append the data into the AVRO file. I don't like to add the avro file into folder as a part. For your information, my AVRO file is not into the folder as a part. Can you please help me to solve the task.
You can write the data by using mode overwrite while writing the dataframe.
But the part file is created as spark is distributed processing and each executor spits out a files based on the amount of data
I have an external ORC table with a large number of the small files, which are coming from the source on daily basis. I need to merge these files into larger files.
I tried to load ORC files to the spark and save with overwrite method
val fileName = "/user/db/table_data/" //This table contains multiple partition on date column with small data files.
val df = hiveContext.read.format("orc").load(fileName)
df.repartition(1).write.mode(SaveMode.Overwrite).partitionBy("date").orc("/user/db/table_data/)
But mode(SaveMode.Overwrite) is deleting all the data from the HDFS. When I tried without mode(SaveMode.Overwrite) method, it was throwing error file already exists.
Can anyone help me to proceed?
As suggested by #Avseiytsev, I have stored by merged orc files in different folder as source in HDFS and moved the data to the table path after the completion of the job.
I have a compressed file with .gz format, Is it possible to read the file directly using spark DF/DS?
Details : File is csv with tab delimited.
Reading a compressed csv is done in the same way as reading an uncompressed csv file. For Spark version 2.0+ it can be done as follows using Scala (note the extra option for the tab delimiter):
val df = spark.read.option("sep", "\t").csv("file.csv.gz")
PySpark:
df = spark.read.csv("file.csv.gz", sep='\t')
The only extra consideration to take into account is that the gz file is not splittable, therefore Spark needs to read the whole file using a single core which will slow things down. After the read is done the data can be shuffled to increase parallelism.