I have loaded my data into a Spark dataframe and am using Spark SQL to further process it.
My Question is simple:
I have data like:
Event_ID Time_Stamp
1 2018-04-11T20:20..
2 2018-04-11T20:20..+1
and so on.
I want to get the number of events that happened every 2 minutes.
So,
My output will be:
Timestamp No_of_events
2018-04-11T20:20.. 2
2018-04-11T20:20..+2 3
In Pandas it was quite easy but I don't know how to do it in Spark SQL.
The above format data must have timestamp as a column and the number of events that happened within that time bucket (i.e. b/w timestamp and timestamp + 2 minutes) as another column.
Any help is very much appreciated.
Thanks.
You may try to use a window function:
df.groupBy(window(df["Time_Stamp"], "2 minutes"))
.count()
.show()
Related
Let's say I have the following join (modified from Spark documentation):
impressionsWithWatermark.join(
clicksWithWatermark,
expr("""
clickAdId = impressionAdId AND
clickTime >= cast(impressionTime as date) AND
clickTime <= cast(impressionTime as date) + interval 1 day
""")
)
Assume that both tables have trillions of rows for 2 years of data. I think that joining everything from both tables is unnecessary. What I want to do is create subsets, similar to this: create 365 * 2 * 2 smaller dataframes so that there is 1 dataframe for each day of each table for 2 years, then create 365 * 2 join queries and take a union of them. But that is inefficient. I am not sure how to do it properly. I think I should add table.repartition(factor/multiple of 365 * 2) for both tables and add write.partitionBy(cast(impressionTime as date), cast(impressionTime as date)) to the streamwriter, and set the number of executors times cores to a factor or multiple of 365 * 2.
What is a proper way to do this? Does Spark analyze the query and optimizes it so that the entries from a single day are automatically put in the same partition? What if I am not joining all records from the same day, but rather from the same hour but there are very few records from 11pm to 1am? Does Spark know that it is most efficient to partition by day or will it be even more efficient?
Initially just trying to specify what i have understood from your question. You have two tables with two years worth of data and it has around trillion records in both of them. You want to join them efficiently based on the timeframe that you provided . for example could be for any specific month of any year or could be any specific custom dates but it should only read that much data and not all the data.
Now to answer your question you can do something as below:
First of all when you are writing data to create the table , you should partition the table by day column so that you have each day data in separate directory/partition for both the tables. Spark won't do that by default for you. You will have to decide that based on your dataset.
Second now when you are reading the data and performing the joins it should not be done on whole table. You will have to read the data from the specific partitions only by applying filter condition on the dataframe so that spark would apply partition pruning and it would read only the partitions that satisfy the condition in filter clause.
Once you have filtered the data at the time of reading from the table and stored it in a dataframe then you should join those dataframe based on the key relationship and that would be most efficient and performant way of doing it at first shot.
If it is still not fast enough you can look at bucketing your data along with partition but in most cases it is not required.
I wonder if using multiple columns while writing a Spark DataFrame in spark makes future read slower?
I know partitioning with critical columns for future filtering improves read performance, but what would be the effect of having multiple columns, even the ones not usable for filtering?
A sample would be:
(ordersDF
.write
.format("parquet")
.mode("overwrite")
.partitionBy("CustomerId", "OrderDate", .....) # <----------- add many columns
.save("/storage/Orders_parquet"))
Yes as spark have to do shuffle and short data to make so may partition .
As there will have many combination of partition key .
ie
suppose CustomerId have unique values 10
suppose orderDate have unique values 10
suppose Orderhave unique values 10
Number of partition will be 10 *10*10
In this small scenario we have 1000 bucket need to be created.
so hell loot of shuffle and short >> more time .
I use pyspark to process a fix set of data records on a daily basis and store them as 16 parquet files in a Hive table using the date as partition. In theory, the number of records every day should be on the same order of magnitude showing below, about 1.2 billion rows and it is indeed on the same order.
When I look at the parquet files, the size of every parquet files in each day is around 86MB like 2019-09-04 showing below
But one thing I noticed to be very strange is the date of 2019-08-03, the file size is 10x larger than the files in other date, but the number of records seems to be more or less the same. I am so confused and could not come up with a reason for it. If you have any idea as to why, please share it with me. Thank you.
I've just realised that the way I saved the data for 2019-08-03 is as follows
cols = sparkSession \
.sql("SELECT * FROM {} LIMIT 1".format(table_name)).columns
df.select(cols).write.insertInto(table_name, overwrite=True)
For other days
insertSQL = """
INSERT OVERWRITE TABLE {}
PARTITION(crawled_at_ds = '{}')
SELECT column1, column2, column3, column4
FROM calendarCrawlsDF
"""
sparkSession.sql(
insertSQL.format(table_name,
calendarCrawlsDF.take(1)[0]["crawled_at_ds"]))
For 2019-08-03, I used Dataframe insertInto method. For other days, I used sparkSession sql to execute INSERT OVERWRITE TABLE
Could this be the reason?
Dataframe A (millions of records) one of the column is create_date,modified_date
Dataframe B 500 records has start_date and end_date
Current approach:
Select a.*,b.* from a join b on a.create_date between start_date and end_date
The above job takes half hour or more to run.
how can I improve the performance
DataFrames currently doesn't have an approach for direct joins like that. It will fully read both tables before performing a join.
https://issues.apache.org/jira/browse/SPARK-16614
You can use the RDD API to take advantage of the joinWithCassandraTable function
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md#using-joinwithcassandratable
As others suggested, one of the approach is to broadcast the smaller dataframe. This can be done automatically also by configuring the below parameter.
spark.sql.autoBroadcastJoinThreshold
If the dataframe size is smaller than the value specified here, Spark automatically broadcasts the smaller dataframe instead of performing a join. You can read more about this here.
I have a DataFrame of orders (contactidid, orderdate, orderamount) and I want a new column that contains, for each order, the sum of all order amounts for the contact for the 12 months prior to this order. I am thinking the best way is to use the Windowing functions and the new INTERVAL ability in Spark >1.5.
But I'm having difficulty making this work or finding documentation. My best guess is:
val dfOrdersPlus = dfOrders
.withColumn("ORDERAMOUNT12MONTH",
expr("sum(ORDERAMOUNT) OVER (PARTITION BY CONTACTID ORDER BY ORDERDATE RANGE BETWEEN INTERVAL 12 months preceding and INTERVAL 1 day preceding)"));
But I get a RuntimeException: 'end of input expected'. Any ideas of what I am doing wrong with this 'expr' and where I could find documentation on the new INTERVAL literals?
As for now:
Window functions are not supported in the expr. To use window functions you'll have to either use DataFrame DSL or raw SQL on the registered table (Spark 1.5 and 1.6 only)
window functions support range intervals only on for numeric types. You cannot use DateType / TimestampType and date INTERVAL expressions. (Spark 1.5, 1.6, 2.0.0-preview)
If you want to use window functions with date or time columns you can convert these to Unix timestamps first. You'll find a full example in Spark Window Functions - rangeBetween dates.