I currently have data on a Spark data frame that is formatted as such:
Timestamp Number
......... ......
M-D-Y 3
M-D-Y 4900
The timestamp data is in no way uniform or consistent (i.e., I could have one value that is present on March 1, 2015, and the next value in the table be for the date September 1, 2015 ... also, I could have multiple entries per date).
So I wanted to do two things
Calculate the number of entries per week. So I would essentially want a new table that represented the number of rows in which the timestamp column was in the week that the row corresponded to. If there are multiple years present, I would ideally want to average the values per each year to get a single value.
Average the number column for each week. So for every week of the year, I would have a value that represents the average of the number column (0 if there is no entry within that week).
Parsing date is relatively easy using built-in functions by combining unix_timestamp and simple type casting:
sqlContext.sql(
"SELECT CAST(UNIX_TIMESTAMP('March 1, 2015', 'MMM d, yyyy') AS TIMESTAMP)"
).show(false)
// +---------------------+
// |_c0 |
// +---------------------+
// |2015-03-01 00:00:00.0|
// +---------------------+
With DataFrame DSL equivalent code would be something like this:
import org.apache.spark.sql.functions.unix_timestamp
unix_timestamp($"date", "MMM d, yyyy").cast("timestamp")
To fill missing entries you can use different tricks. The simplest approach is to use the same parsing logic as above. First let's create a few helpers:
def leap(year: Int) = {
((year % 4 == 0) && (year % 100 != 0)) || (year % 400 == 0)
}
def weeksForYear(year: Int) = (1 to 52).map(w => s"$year $w")
def daysForYear(year: Int) = (1 to { if(leap(2000)) 366 else 366 }).map(
d => s"$year $d"
)
and example reference data (here for weeks but you can do the same thing for days):
import org.apache.spark.sql.functions.{year, weekofyear}'
val exprs = Seq(year($"date").alias("year"), weekofyear($"date").alias("week"))
val weeks2015 = Seq(2015)
.flatMap(weeksForYear _)
.map(Tuple1.apply)
.toDF("date")
.withColumn("date", unix_timestamp($"date", "yyyy w").cast("timestamp"))
.select(exprs: _*)
Finally you can transform the original data:
val df = Seq(
("March 1, 2015", 3), ("September 1, 2015", 4900)).toDF("Timestamp", "Number")
val dfParsed = df
.withColumn("date", unix_timestamp($"timestamp", "MMM d, yyyy").cast("timestamp"))
.select(exprs :+ $"Number": _*)
merge and aggregate:
weeks2015.join(dfParsed, Seq("year", "week"), "left")
.groupBy($"year", $"week")
.agg(count($"Number"), avg($"Number"))
.na.fill(0)
Related
Suppose we have a very large table that we'd like to process statistics for incrementally.
Date
Amount
Customer
2022-12-20
30
Mary
2022-12-21
12
Mary
2022-12-20
12
Bob
2022-12-21
15
Bob
2022-12-22
15
Alice
We'd like to be able to calculate incrementally how much we made per distinct customer for a date range. So from 12-20 to 12-22 (inclusive), we'd have 3 distinct customers, but 12-20 to 12-21 there are 2 distinct customers.
If we want to run this pipeline once a day and there are many customers, how can we keep a rolling count of distinct customers for an arbitrary date range? Is there a way to do this without storing a huge list of customer names for each day?
We'd like to support a frontend that has a date range filter and can quickly calculate results for that date range. For example:
Start Date
End Date
Average Income Per Customer
2022-12-20
2022-12-21
(30+12+12+15)/2 = 34.5
2022-12-20
2022-12-22
(30+12+12+15+15)/3 = 28
The only approach I can think of is to store a set of customer names for each day, and when viewing the results calculate the size of the joined set of sets to calculate distinct customers. This seems inefficient. In this case we'd store the following table, with the customer column being extremely large.
Date
Total Income
Customers
2022-12-20
42
set(Mary, Bob)
2022-12-21
27
set(Mary, Bob)
2022-12-22
15
set(Alice)
For me the best solution is to do some pre calculations for the existing data, then for the new data that come everyday, do the caclulation only on new data, and add the results to the previous calclulated data, also do partitioning on date column as we filter on dates, this will trigger spark push down filters and accelerate your queries.
There's 2 approach: one to get the sum amount between 2 dates, and other for the distinct customers between 2 dates:
For amout use prefix sum by adding the sum of all previous days to the last day, then to get the difference between the 2 dates you can just substract these 2 days only without looping all dates between.
For distinct customers, the best approach I can think of is to save the date and customer columns in a new file, and partition by dates, that would help to optimize the queries, then use the fast approx_count_distinct.
Here's some code:
spark = SparkSession.builder.master("local[*]").getOrCreate()
data = [
["2022-12-20", 30, "Mary"],
["2022-12-21", 12, "Mary"],
["2022-12-20", 12, "Bob"],
["2022-12-21", 15, "Bob"],
["2022-12-22", 15, "Alice"],
]
df = spark.createDataFrame(data).toDF("Date", "Amount", "Customer")
def init_amout_data(df):
w = Window.orderBy(col("Date"))
amount_sum_df = df.groupby("Date").agg(sum("Amount").alias("Amount")) \
.withColumn("amout_sum", sum(col("Amount")).over(w)) \
.withColumn("prev_amout_sum", lag("amout_sum", 1, 0).over(w)).select("Date", "amout_sum", "prev_amout_sum")
amount_sum_df.write.mode("overwrite").partitionBy("Date").parquet("./path/amount_data_df")
amount_sum_df.show(truncate=False)
# keep only customer data to avoid unecessary data when querying, partitioning by Date will make query faster due to spark filter push down mechanism
def init_customers_data(df):
df.select("Date", "Customer").write.mode("overwrite").partitionBy("Date").parquet("./path/customers_data_df")
# each day update the amount data dataframe (example at midnight), with only yesterday data: by talking the last amout_sum and adding to it the amount of the last day
def update_amount_data(last_partition):
amountDataDf = spark.read.parquet("./path/amount_data_df")
maxDate = getMaxDate("./path/amount_data_df") # implement a hadoop method to get the last partition date
lastMaxPartition = amountDataDf.filter(col("date") == maxDate)
lastPartitionAmountSum = lastMaxPartition.select("amout_sum").first.getLong(0)
yesterday_amount_sum = last_partition.groupby("Date").agg(sum("Amount").alias("amount_sum"))
newPartition = yesterday_amount_sum.withColumn("amount_sum", col("amount_sum") + lastPartitionAmountSum) \
.withColumn("prev_amout_sum", lit(lastPartitionAmountSum))
newPartition.write.mode("append").partitionBy("Date").parquet("./path/amount_data_df")
def update_cusomers_data(last_partition):
last_partition.write.mode("append").partitionBy("Date").parquet("./path/customers_data_df")
def query_amount_date(beginDate, endDate):
amountDataDf = spark.read.parquet("./path/amount_data_df")
endDateAmount = amountDataDf.filter(col("Date") == endDate).select("amout_sum").first.getLong(0)
beginDateDf = amountDataDf.filter(col("Date") == beginDate).select("prev_amout_sum").first.getLong(0)
diff_amount = endDateAmount - beginDateDf
return diff_amount
def query_customers_date(beginDate, endDate):
customersDataDf = spark.read.parquet("./path/customers_data_df")
distinct_customers_nb = customersDataDf.filter(col("date").between(lit(beginDate), lit(endDate))) \
.agg(approx_count_distinct(df.Customer).alias('distinct_customers')).first.getLong(0)
return distinct_customers_nb
# This is should be executed the first time only
init_amout_data(df)
init_customers_data(df)
# This is should be executed everyday at midnight with data of the last day only
last_day_partition = df.filter(col("date") == yesterday_date)
update_amount_data(last_day_partition)
update_cusomers_data(last_day_partition)
# Optimized queries that should be executed with
beginDate = "2022-12-20"
endDate = "2022-12-22"
answer = query_amount_date(beginDate, endDate) / query_customers_date(beginDate, endDate)
print(answer)
If calculating the distinct customer is not fast enough, there's another approach using the same pre sum calculation of all distinct customers and another table for distinct customer, each day if there's a new customer increment the first table and add that customer to the second table, if not don't do anything.
Finally there are some tricks for optimizing the goupBy or window functions using salting oo extended partitioning.
You can achieve this by filtering rows with dates between start_date and end_date then grouping by customer_id and calculating the sum of amounts and then getting avg of these amounts. this approach works for only one start_date and end_date and you should run this code with different parameters to solve with different date ranges
start_date = '2022-12-20'
end_date = '2022-12-21'
(
df
.withColumn('isInRange', F.col('date').between(start_date, end_date))
.filter(F.col('isInRange'))
.groupby('customer')
.agg(F.sum('amount').alias('sum'))
.agg(F.avg('sum').alias('avg income'))
).show()
I have a dataframe which has 5 columns ColA,ColB,country,start_time,end_time. I need to form a new df from the existing df after doing the below processing
If df.country == US then we have to do df.filter(start_time < todays date)
For remaining countries we have to do df.filter(end_time < todays date)
You can do this by applying two filters, first on country then on date like
df.filter(((f.col('country')=='US') & (f.col('start_time')<datetime.datetime.now())) |
((f.col('country')!='US') & (f.col('end_time')<datetime.datetime.now())))
I have a data frame which contains huge number of records. In that DF a record can be repeated multiple times and every time when it got updated the last updated field will have the date on which the record modified.
We have a group of columns on which we want to compare the rows of similar ids. During this comparison we want to capture what are the fields/columns has changed from previous record to current record and capture that in a "updated_columns" column of the updated record. Compare this second record to third record and identify the updated columns and capture that in "updated_columns" field of third record, continue the same till the last record of that id and do the same thing for each and every id which has more than one entry.
Initially we grouped the columns and created a hash out of that group of columns and compare against hash values of next row,this way it is helping me to identify records which has updates, but want the columns which got updated.
Here I am sharing some data, which is expected outcome and that is how the final data should like look post adding updated columns (here I can say, use columns Col1, Col2, Col3, col4 and Col5 for comparison between two rows):
Want to do this in a efficient way. Any one tried some thing like this.
Looking for a help!
~Krish.
A window can be used.
The idea is to group the data by ID, sort it by LAST-UPDATED, copy the values of the previous row (if it exists) into the current row and then compare the copied data with the current values.
val data = ... //the dataframe has the columns ID,Col1,Col2,Col3,Col4,Col5,LAST_UPDATED,IS_DELETED
val fieldNames = data.schema.fieldNames.dropRight(1) //1
val columns = fieldNames.map(f => col(f))
val windowspec = Window.partitionBy("ID").orderBy("LAST_UPDATED") //2
def compareArrayUdf() = ... //3
val result = data
.withColumn("cur", array(columns: _*)) //4
.withColumn("prev", lag($"cur", 1).over(windowspec)) //5
.withColumn("updated_columns", compareArrayUdf()($"cur", $"prev")) //6
.drop("cur", "prev") //7
.orderBy("LAST_UPDATED")
Remarks:
create a list of all fields to compare. All fields but the last one (LAST-UPDATED) are used
create a window that is partitioned by ID and each partition is sorted by LAST-UPDATED
create a udf that compares two arrays and maps the discovered differences to the field names, code see below
create a new column that contains all values that should be compared
create a new column that contains all values of the previous row (by using the lag-function) that should be compared. The previous row is the row with the same ID and the biggest LAST-UPDATED that is smaller than the current one. This field can be null
compare the two new columns and put the result into updated-columns
drop the two intermediate columns created in step 3 and 4
The compareArraysUdf is
def compareArray(cur: mutable.WrappedArray[String], prev: mutable.WrappedArray[String]): String = {
if (prev == null || cur == null) return ""
val res = new StringBuilder
for (i <- cur.indices) {
if (!cur(i).contentEquals(prev(i))) {
if (res.nonEmpty) res.append(",")
res.append(fieldNames(i))
}
}
res.toString()
}
def compareArrayUdf() = udf[String, mutable.WrappedArray[String], mutable.WrappedArray[String]](compareArray)
You can join your DataFrame or DataSet to itself, joining the rows where the id is the same in both rows and where the version of the left row is i and the version of the right row is i+1. Here's an example
case class T(id: String, version: Int, data: String)
val data = Seq(T("1", 1, "d1-1"), T("1", 2, "d1-2"), T("2", 1, "d2-1"), T("2", 2, "d2-2"), T("2", 3, "d2-3"), T("3", 1, "d3-1"))
data: Seq[T] = List(T(1,1,d1-1), T(1,2,d1-2), T(2,1,d2-1), T(2,2,d2-2), T(2,3,d2-3), T(3,1,d3-1))
val ds = data.toDS
val joined = ds.as("ds1").join(ds.as("ds2"), $"ds1.id" === $"ds2.id" && (($"ds1.version"+1) === $"ds2.version"))
And then you can reference the columns in the new DataFrame/DataSet like $"ds1.data and $"ds2.data etc.
To find the rows where the data changed from one version to another, you can do
joined.filter($"ds1.data" !== $"ds2.data")
I'm using pySpark to read and calculate statistics for a dataframe.
The dataframe looks like:
TRANSACTION_URL START_TIME END_TIME SIZE FLAG COL6 COL7 ...
www.google.com 20170113093210 20170113093210 150 1 ... ...
www.cnet.com 20170113114510 20170113093210 150 2 ... ...
I'm adding a new timePeriod column to the dataframe, and after adding it, I would like to save the first 50K records with timePeriod matching some pre-defined value.
My intention is to save those lines to CSV with the dataframe header.
I know this should be a combination of col and write.csv but I'm not sure how to properly use those for my intentions.
My current code is:
encodeUDF = udf(encode_time, StringType())
log_df = log_df.withColumn('timePeriod', encodeUDF(col('START_TIME')))
And after the column has been added i'm guessing I should use something like:
log_df.select(col('timePeriod') == 'Weekday').write.csv(....)
Can someone please help me fill the blanks here, to match my intentions?
unix_timestamp and date_format are useful methods here as START_TIME is not timestamp type.
dfWithDayNum = log_df.withColumn("timePeriod", date_format(
unix_timestamp(col("START_TIME"), "yyyyMMddHHmmss").cast(TimestampType), "u")
)
timePeriod will have the day number of week (1 = Monday, ..., 7 = Sunday)
dfWithDayNum
.filter(col("timePeriod") < 6) //to filter weekday
.limit(50000) //X lines
.write.format("csv")
.option("header", "true")
.csv("location/to/save/df")
Solved using filter() and limit() methods in the following way:
new_log_df.filter(col('timePeriod') == '20161206, Morning').limit(50).write.\
format('csv').option("header", "true").save("..Path..")
I have a data frame with 2 columns: timestamp, value
timestamp is a time since the epoch and value is a float value.
I want to merge rows to average values by min.
That means that I want to take all rows where timestamp is from the same round minute (60 seconds intervals since the epoch) and merge them into a single row, where the value column will be the mean of all the values.
To give an example, lets assume that my dataframe looks like this:
timestamp value
--------- -----
1441637160 10.0
1441637170 20.0
1441637180 30.0
1441637210 40.0
1441637220 10.0
1441637230 0.0
The first 4 rows are part of the same min (1441637160 % 60 == 0, 1441637160 + 60 == 1441637220)
The last 2 rows are part of another min.
I would like to merge all rows of the same min. to get a result that looks like:
timestamp value
--------- -----
1441637160 25.0 (since (10+20+30+40)/4 = 25)
1441637220 5.0 (since (10+0)/2 = 5)
What's the best way to do that?
You can simply group and aggregate. With data as:
val df = sc.parallelize(Seq(
(1441637160, 10.0),
(1441637170, 20.0),
(1441637180, 30.0),
(1441637210, 40.0),
(1441637220, 10.0),
(1441637230, 0.0))).toDF("timestamp", "value")
import required functions and classes:
import org.apache.spark.sql.functions.{lit, floor}
import org.apache.spark.sql.types.IntegerType
create interval column:
val tsGroup = (floor($"timestamp" / lit(60)) * lit(60))
.cast(IntegerType)
.alias("timestamp")
and use it to perform aggregation:
df.groupBy(tsGroup).agg(mean($"value").alias("value")).show
// +----------+-----+
// | timestamp|value|
// +----------+-----+
// |1441637160| 25.0|
// |1441637220| 5.0|
// +----------+-----+
First map the timestamp to the minute bucket, then use groupByKey to calculate the averages. For example:
rdd.map(x=>{val round = x._1%60; (x._1-round, x._2);})
.groupByKey
.map(x=>(x._1, (x._2.sum.toDouble/x._2.size)))
.collect()