How to read certain parquet file partitions using spark? - apache-spark

Is it possible to read certain partitions from a folder using spark?
I only know this way:
df = spark.read.parquet("/mnt/Staging/file_Name/")
Is there any way to read only those partitions where the date is not less than today minus 3 months?

if your dataframe is partitioned by date, you can just use filter, spark will read only partitions with this date
df = spark.read.parquet("/mnt/Staging/file_Name/").filter(col("your_date_col") === "2022-02-03")

Related

How to check size of each partition of a dataframe in Databricks using pyspark?

df=spark.read.parquet('path')
df.rdd.getNumPartitions() --->It gives the number of partitions but I would like to know information like size of each partition of this dataframe
How do I get it?

Does using multiple columns in partitioning Spark DataFrame makes read slower?

I wonder if using multiple columns while writing a Spark DataFrame in spark makes future read slower?
I know partitioning with critical columns for future filtering improves read performance, but what would be the effect of having multiple columns, even the ones not usable for filtering?
A sample would be:
(ordersDF
.write
.format("parquet")
.mode("overwrite")
.partitionBy("CustomerId", "OrderDate", .....) # <----------- add many columns
.save("/storage/Orders_parquet"))
Yes as spark have to do shuffle and short data to make so may partition .
As there will have many combination of partition key .
ie
suppose CustomerId have unique values 10
suppose orderDate have unique values 10
suppose Orderhave unique values 10
Number of partition will be 10 *10*10
In this small scenario we have 1000 bucket need to be created.
so hell loot of shuffle and short >> more time .

Spark DataFrame Repartition and Parquet Partition

I am using repartition on columns to store the data in parquet. But
I see that the no. of parquet partitioned files are not same with the
no. of Rdd partitions. Is there no correlation between rdd partitions
and parquet partitions?
When I write the data to parquet partition and I use Rdd
repartition and then I read the data from parquet partition , is
there any condition when the rdd partition numbers will be same
during read / write?
How is bucketing a dataframe using a column id and repartitioning a
dataframe via the same column id different?
While considering the performance of joins in Spark should we be
looking at bucketing or repartitioning (or maybe both)
Couple of things here that you;re asking - Partitioning, Bucketing and Balancing of data,
Partitioning:
Partitioning data is often used for distributing load horizontally, this has performance benefit, and helps in organizing data in a logical fashion.
Partitioning tables changes how persisted data is structured and will now create subdirectories reflecting this partitioning structure.
This can dramatically improve query performance, but only if the partitioning scheme reflects common filtering.
In Spark, this is done by df.write.partitionedBy(column*) and groups data by partitioning columns into same sub directory.
Bucketing:
Bucketing is another technique for decomposing data sets into more manageable parts. Based on columns provided, the entire data is hashed into a user-defined number of buckets (files).
Synonymous to Hive's Distribute By
In Spark, this is done by df.write.bucketBy(n, column*) and groups data by partitioning columns into same file. number of files generated is controlled by n
Repartition:
It returns a new DataFrame balanced evenly based on given partitioning expressions into given number of internal files. The resulting DataFrame is hash partitioned.
Spark manages data on these partitions that helps parallelize distributed data processing with minimal network traffic for sending data between executors.
In Spark, this is done by df.repartition(n, column*) and groups data by partitioning columns into same internal partition file. Note that no data is persisted to storage, this is just internal balancing of data based on constraints similar to bucketBy
Tl;dr
1) I am using repartition on columns to store the data in parquet. But I see that the no. of parquet partitioned files are not same with the no. of Rdd partitions. Is there no correlation between rdd partitions and parquet partitions?
repartition has correlation to bucketBy not partitionedBy. partitioned files is governed by other configs like spark.sql.shuffle.partitions and spark.default.parallelism
2) When I write the data to parquet partition and I use Rdd repartition and then I read the data from parquet partition , is there any condition when the rdd partition numbers will be same during read / write?
during read time, the number of partitions will be equal to spark.default.parallelism
3) How is bucketing a dataframe using a column id and repartitioning a dataframe via the same column id different?
Working similar, except, bucketing is a write operation and is used for persistence.
4) While considering the performance of joins in Spark should we be looking at bucketing or repartitioning (or maybe both)
repartition of both datasets are in memory, if one or both the datasets are persisted, then look into bucketBy also.

Apache Spark Counts of Events into Timestamp Buckets

I have loaded my data into a Spark dataframe and am using Spark SQL to further process it.
My Question is simple:
I have data like:
Event_ID Time_Stamp
1 2018-04-11T20:20..
2 2018-04-11T20:20..+1
and so on.
I want to get the number of events that happened every 2 minutes.
So,
My output will be:
Timestamp No_of_events
2018-04-11T20:20.. 2
2018-04-11T20:20..+2 3
In Pandas it was quite easy but I don't know how to do it in Spark SQL.
The above format data must have timestamp as a column and the number of events that happened within that time bucket (i.e. b/w timestamp and timestamp + 2 minutes) as another column.
Any help is very much appreciated.
Thanks.
You may try to use a window function:
df.groupBy(window(df["Time_Stamp"], "2 minutes"))
.count()
.show()

Join Spark dataframe with Cassandra table [duplicate]

Dataframe A (millions of records) one of the column is create_date,modified_date
Dataframe B 500 records has start_date and end_date
Current approach:
Select a.*,b.* from a join b on a.create_date between start_date and end_date
The above job takes half hour or more to run.
how can I improve the performance
DataFrames currently doesn't have an approach for direct joins like that. It will fully read both tables before performing a join.
https://issues.apache.org/jira/browse/SPARK-16614
You can use the RDD API to take advantage of the joinWithCassandraTable function
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md#using-joinwithcassandratable
As others suggested, one of the approach is to broadcast the smaller dataframe. This can be done automatically also by configuring the below parameter.
spark.sql.autoBroadcastJoinThreshold
If the dataframe size is smaller than the value specified here, Spark automatically broadcasts the smaller dataframe instead of performing a join. You can read more about this here.

Resources