This question already has answers here:
Spark dataframe write method writing many small files
(6 answers)
Closed 4 years ago.
I have a data frame df, I want to partition it by date (a column in the df).
I have the code below:
df.write.partitionBy('date').mode(overwrite').orc('path')
Then under the path above, there are bunch folders, e.g. date=2018-10-08 etc...
But under the folder date=2018-10-08, there are 5 files, what I want is to reduce to only one file inside the date=2018-10-08 folder. How to do that? I still want it partitioned by date.
Thank you in advance!
In order to have 1 file per partition folder you will need to repartition the data by the partition column before writing. This will shuffle the data so the dates are in the same DataFrame/RDD partitions:
df.repartition('date').write.partitionBy('date').mode(overwrite').orc('path')
Related
how to generate column in each part file which is used for partitionby in pyspark? If it is valid question please help me on that.
PartitionBy is creating multiple folders but that column is not coming in any part file. How can I get it?
I have directory which has folders based on the date and running date is part of the folder name. I have a daily spark job in which i need to load last 7 days files on any given day.
Unfortunately the folder contains other files as well to try partition discovery.
I have files as below format.
prefix-yyyyMMdd/
How to load folders within last 7 days in one shot.?
Since it is running date, i cannot have predefined regex that can be used to load the data, as i have to consider month and year changes.
I have couple of brute force solutions
to load all the data into 7 dataframes and do unionAll with all 7, to get one dataframe from 7 dataframes. This looks performance inefficient, but not a entirely bad one
Load entire folder and do where condition on column that has the date.
This looks storage heavy, as the folder contains years worth of data
Both doesn't look performance efficient and considering each file data it self is huge, i would like to know if there are any better solutions.
Is there a better way to do it.?
DataFrameReader methods can take multiple paths, e.g.
spark.read.parquet("prefix-20190704", "prefix-20190703", ...)
This question already has answers here:
How to perform union on two DataFrames with different amounts of columns in Spark?
(22 answers)
Closed 4 years ago.
I have ‘n’ number of delimited data sets, CSVs may be. But one of them might have a few extra columns. I am trying to read all of them as dataframes and put them in one. How can I merge them as an unionAll and make them a single dataframe ?
P.S: I can do this when I know what is ‘n’. And, it’s a simple unionAll when the column counts are equal.
There is another approach other than the solutions mentioned in first two comments.
Read all CSV files to a single RDD producing RDD[String].
Map to create Rdd[Row] with appropriate length while filling missing values with null or any suitable values.
Create dataFrame schema.
Create DataFrame from RDD[Row] using created Schema.
This may not be a good approach if the CSVs has large number of columns.
Hope this helps
This question already has answers here:
Overwrite specific partitions in spark dataframe write method
(14 answers)
Overwrite only some partitions in a partitioned spark Dataset
(3 answers)
Closed 4 years ago.
I have a large table and in which I would like overwrite certain top level partitions. for e.g. I have table which is partitioned based on year and month, and I would like to overwrite partitions say from year 2000 to 2018.
How I can do that.
Note : I would not like to delete the previous table and overwrite entire table with new data.
This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 4 years ago.
I have a Hive table with the schema:
id bigint
name string
updated_dt bigint
There are many records having same id, but different name and updated_dt. For each id, I want to return the record (whole row) with the largest updated_dt.
My current approach is:
After reading data from Hive, I can use case class to convert data to RDD, and then use groupBy() to group by all the records with the same id together, and later picks the one with the largest updated_dt. Something like:
dataRdd.groupBy(_.id).map(x => x._2.toSeq.maxBy(_.updated_dt))
However, since I use Spark 2.1, it first convert data to dataset using case class, and then the above approach coverts data to RDD in order to use groupBy(). There may be some overhead converting dataset to RDD. So I was wondering if I can achieve this at the dataset level without converting to RDD?
Thanks a lot
Here is how you can do it using Dataset:
data.groupBy($"id").agg(max($"updated_dt") as "Max")
There is not much overhead if you convert it to RDD. If you choose to do using RDD, It can be more optimized by using .reduceByKey() instead of using .groupBy():
dataRdd.keyBy(_.id).reduceByKey((a,b) => if(a.updated_dt > b.updated_dt) a else b).values