I have a DataFrame of orders (contactidid, orderdate, orderamount) and I want a new column that contains, for each order, the sum of all order amounts for the contact for the 12 months prior to this order. I am thinking the best way is to use the Windowing functions and the new INTERVAL ability in Spark >1.5.
But I'm having difficulty making this work or finding documentation. My best guess is:
val dfOrdersPlus = dfOrders
.withColumn("ORDERAMOUNT12MONTH",
expr("sum(ORDERAMOUNT) OVER (PARTITION BY CONTACTID ORDER BY ORDERDATE RANGE BETWEEN INTERVAL 12 months preceding and INTERVAL 1 day preceding)"));
But I get a RuntimeException: 'end of input expected'. Any ideas of what I am doing wrong with this 'expr' and where I could find documentation on the new INTERVAL literals?
As for now:
Window functions are not supported in the expr. To use window functions you'll have to either use DataFrame DSL or raw SQL on the registered table (Spark 1.5 and 1.6 only)
window functions support range intervals only on for numeric types. You cannot use DateType / TimestampType and date INTERVAL expressions. (Spark 1.5, 1.6, 2.0.0-preview)
If you want to use window functions with date or time columns you can convert these to Unix timestamps first. You'll find a full example in Spark Window Functions - rangeBetween dates.
Related
In many data lakes I see that data is partitioned by year, then month, then day, for example:
year=2019 / month=05 / day=15
What is the advantage of doing this vs. simply partitioning by date? e.g.:
date=20190515
The only advantage I can think of is if, for example, analysts want to query all data for a particular month/year. If just partitioning on date, then they would have to write a query with a calculation on the partition key, such as below psuedocode:
SELECT * FROM myTable WHERE LEFT(date,4) = 2019
Would spark still be able to do partition pruning for queries like the above?
Are there any other advantages I haven't considered to the more nested partition structure?
Thank you
I would argue it's a disadvantage! Because splitting the date parts makes it much harder to do date filtering. For example say you want to query the last 10 days of data which may cross month boundaries? With a single date value you can just run simple queries like
...where date >= current_date() - interval 10 days
and Spark will figure out the right partitions for you. Spark can also handle other date functions, like year(date) = 2019 or month(date) = 2 and again it will properly do the partition pruning for you.
I always encourage using a single date column for partitioning. Let Spark do the work.
Also, an important thing to keep in mind is that date format should be yyyy-MM-dd.
Let's say I have the following join (modified from Spark documentation):
impressionsWithWatermark.join(
clicksWithWatermark,
expr("""
clickAdId = impressionAdId AND
clickTime >= cast(impressionTime as date) AND
clickTime <= cast(impressionTime as date) + interval 1 day
""")
)
Assume that both tables have trillions of rows for 2 years of data. I think that joining everything from both tables is unnecessary. What I want to do is create subsets, similar to this: create 365 * 2 * 2 smaller dataframes so that there is 1 dataframe for each day of each table for 2 years, then create 365 * 2 join queries and take a union of them. But that is inefficient. I am not sure how to do it properly. I think I should add table.repartition(factor/multiple of 365 * 2) for both tables and add write.partitionBy(cast(impressionTime as date), cast(impressionTime as date)) to the streamwriter, and set the number of executors times cores to a factor or multiple of 365 * 2.
What is a proper way to do this? Does Spark analyze the query and optimizes it so that the entries from a single day are automatically put in the same partition? What if I am not joining all records from the same day, but rather from the same hour but there are very few records from 11pm to 1am? Does Spark know that it is most efficient to partition by day or will it be even more efficient?
Initially just trying to specify what i have understood from your question. You have two tables with two years worth of data and it has around trillion records in both of them. You want to join them efficiently based on the timeframe that you provided . for example could be for any specific month of any year or could be any specific custom dates but it should only read that much data and not all the data.
Now to answer your question you can do something as below:
First of all when you are writing data to create the table , you should partition the table by day column so that you have each day data in separate directory/partition for both the tables. Spark won't do that by default for you. You will have to decide that based on your dataset.
Second now when you are reading the data and performing the joins it should not be done on whole table. You will have to read the data from the specific partitions only by applying filter condition on the dataframe so that spark would apply partition pruning and it would read only the partitions that satisfy the condition in filter clause.
Once you have filtered the data at the time of reading from the table and stored it in a dataframe then you should join those dataframe based on the key relationship and that would be most efficient and performant way of doing it at first shot.
If it is still not fast enough you can look at bucketing your data along with partition but in most cases it is not required.
I have loaded my data into a Spark dataframe and am using Spark SQL to further process it.
My Question is simple:
I have data like:
Event_ID Time_Stamp
1 2018-04-11T20:20..
2 2018-04-11T20:20..+1
and so on.
I want to get the number of events that happened every 2 minutes.
So,
My output will be:
Timestamp No_of_events
2018-04-11T20:20.. 2
2018-04-11T20:20..+2 3
In Pandas it was quite easy but I don't know how to do it in Spark SQL.
The above format data must have timestamp as a column and the number of events that happened within that time bucket (i.e. b/w timestamp and timestamp + 2 minutes) as another column.
Any help is very much appreciated.
Thanks.
You may try to use a window function:
df.groupBy(window(df["Time_Stamp"], "2 minutes"))
.count()
.show()
I could see that spark streaming windowing function does the grouping only based on "when it received the data". I would like to do the grouping based on the timestamp field available in the data itself. Is it possible?
For example - The data creation timestamp is available as part of the data as 1 PM. But spark streaming received the data at 1.05 PM. So it should do the grouping based on the timestamp (1 PM) available in the data.
I would like to do the grouping based on the timestamp field available in the data itself. Is it possible?
No. Spark Streaming does not offer such a feature.
You should instead use Structured Streaming that does offer window function to group by.
Quoting Window Operations on Event Time:
Aggregations over a sliding event-time window are straightforward with Structured Streaming and are very similar to grouped aggregations. In a grouped aggregation, aggregate values (e.g. counts) are maintained for each unique value in the user-specified grouping column. In case of window-based aggregations, aggregate values are maintained for each window the event-time of a row falls into.
Dataframe A (millions of records) one of the column is create_date,modified_date
Dataframe B 500 records has start_date and end_date
Current approach:
Select a.*,b.* from a join b on a.create_date between start_date and end_date
The above job takes half hour or more to run.
how can I improve the performance
DataFrames currently doesn't have an approach for direct joins like that. It will fully read both tables before performing a join.
https://issues.apache.org/jira/browse/SPARK-16614
You can use the RDD API to take advantage of the joinWithCassandraTable function
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md#using-joinwithcassandratable
As others suggested, one of the approach is to broadcast the smaller dataframe. This can be done automatically also by configuring the below parameter.
spark.sql.autoBroadcastJoinThreshold
If the dataframe size is smaller than the value specified here, Spark automatically broadcasts the smaller dataframe instead of performing a join. You can read more about this here.