Merging multiple rows in a spark dataframe into a single row - apache-spark

I have a data frame with 2 columns: timestamp, value
timestamp is a time since the epoch and value is a float value.
I want to merge rows to average values by min.
That means that I want to take all rows where timestamp is from the same round minute (60 seconds intervals since the epoch) and merge them into a single row, where the value column will be the mean of all the values.
To give an example, lets assume that my dataframe looks like this:
timestamp value
--------- -----
1441637160 10.0
1441637170 20.0
1441637180 30.0
1441637210 40.0
1441637220 10.0
1441637230 0.0
The first 4 rows are part of the same min (1441637160 % 60 == 0, 1441637160 + 60 == 1441637220)
The last 2 rows are part of another min.
I would like to merge all rows of the same min. to get a result that looks like:
timestamp value
--------- -----
1441637160 25.0 (since (10+20+30+40)/4 = 25)
1441637220 5.0 (since (10+0)/2 = 5)
What's the best way to do that?

You can simply group and aggregate. With data as:
val df = sc.parallelize(Seq(
(1441637160, 10.0),
(1441637170, 20.0),
(1441637180, 30.0),
(1441637210, 40.0),
(1441637220, 10.0),
(1441637230, 0.0))).toDF("timestamp", "value")
import required functions and classes:
import org.apache.spark.sql.functions.{lit, floor}
import org.apache.spark.sql.types.IntegerType
create interval column:
val tsGroup = (floor($"timestamp" / lit(60)) * lit(60))
.cast(IntegerType)
.alias("timestamp")
and use it to perform aggregation:
df.groupBy(tsGroup).agg(mean($"value").alias("value")).show
// +----------+-----+
// | timestamp|value|
// +----------+-----+
// |1441637160| 25.0|
// |1441637220| 5.0|
// +----------+-----+

First map the timestamp to the minute bucket, then use groupByKey to calculate the averages. For example:
rdd.map(x=>{val round = x._1%60; (x._1-round, x._2);})
.groupByKey
.map(x=>(x._1, (x._2.sum.toDouble/x._2.size)))
.collect()

Related

Pyspark Dataframe: filter by row if it does not meet condition dependent on next column

Sorry if the title does not explain clearly, I could not think of a better way to phrase this. So I have a data frame that is organized as follows:
ID Depart Arrive Time
****************************
A 1 2 1pm
A 2 3 2pm
A 4 1 5pm
So what I'm to do is find all the Times where one rows Depart does not match the next row's Arrive.
For example, the second column here has a 3 as it's Arrive but the third column has 4 as its depart (as opposed to 3).
What I'm hoping to get would be a new data frame with all these conditions. In the case of this data frame it would look like this:
ID From To Time
********************************
A 3 4 [2pm,5pm]
I'm struggling to figure out how to do this with spark as opposed to converting the data frame into a different data structure and iterating over it. Apologies if I missed anything, I'm new to spark.
You can check the value of Depart in the next row using lead and compare the the value of Arrive in the current row. If they are different, collect all the necessary information into a struct, and expand it later. Note that this solution only works for your time format (ha: hour followed by am/pm).
from pyspark.sql import functions as F, Window
w = Window.partitionBy('ID').orderBy('Time2')
df2 = df.withColumn(
'Time2',
F.to_timestamp('Time', 'ha')
).withColumn(
'unmatched_arrive',
F.when(
F.lead('Depart').over(w) != F.col('Arrive'),
F.struct(
F.col('Arrive').alias('From'),
F.lead('Depart').over(w).alias('To'),
F.array('Time', F.lead('Time').over(w)).alias('Time')
)
)
).dropna(subset=['unmatched_arrive']).dropDuplicates().select('ID', 'unmatched_arrive.*')
df2.show()
+---+----+---+----------+
|ID |From|To |Time |
+---+----+---+----------+
|A |3 |4 |[2pm, 5pm]|
+---+----+---+----------+

Generating monthly timestamps between two dates in pyspark dataframe

I have some DataFrame with "date" column and I'm trying to generate a new DataFrame with all monthly timestamps between the min and max date from the "date" column.
One of the solution is below:
month_step = 31*60*60*24
min_date, max_date = df.select(min_("date").cast("long"), max_("date").cast("long")).first()
df_ts = spark.range(
(min_date / month_step) * month_step,
((max_date / month_step) + 1) * month_step,
month_step
).select(col("id").cast("timestamp").alias("yearmonth"))
df_formatted_ts = df_ts.withColumn(
"yearmonth",
f.concat(f.year("yearmonth"), f.lit('-'), format_string("%02d", f.month("yearmonth")))
).select('yearmonth')
df_formatted_ts.orderBy(asc('yearmonth')).show(150, False)
The problem is that I took as a month_step 31 days and its not really correct because some of the months have 30 days and even 28 days. Is possible to somehow make it more precise?
Just as a note: Later I only need year and month values so I will ignore day and time. But anyway because I'm generating timestamps between quite a big date range (between 2001 and 2018) the timestamps shifting.
That's why sometimes some months will be skipped. For example, this snapshot is missing 2010-02:
|2010-01 |
|2010-03 |
|2010-04 |
|2010-05 |
|2010-06 |
|2010-07 |
I checked and there are just 3 months which were skipped from 2001 through 2018.
Suppose you had the following DataFrame:
data = [("2000-01-01","2002-12-01")]
df = spark.createDataFrame(data, ["minDate", "maxDate"])
df.show()
#+----------+----------+
#| minDate| maxDate|
#+----------+----------+
#|2000-01-01|2002-12-01|
#+----------+----------+
You can add a column date with all of the months in between minDate and maxDate, by following the same approach as my answer to this question.
Just replace pyspark.sql.functions.datediff with pyspark.sql.functions.months_between, and use add_months instead of date_add:
import pyspark.sql.functions as f
df.withColumn("monthsDiff", f.months_between("maxDate", "minDate"))\
.withColumn("repeat", f.expr("split(repeat(',', monthsDiff), ',')"))\
.select("*", f.posexplode("repeat").alias("date", "val"))\
.withColumn("date", f.expr("add_months(minDate, date)"))\
.select('date')\
.show(n=50)
#+----------+
#| date|
#+----------+
#|2000-01-01|
#|2000-02-01|
#|2000-03-01|
#|2000-04-01|
# ...skipping some rows...
#|2002-10-01|
#|2002-11-01|
#|2002-12-01|
#+----------+

spark - Calculating average of values in 2 or more columns and putting in new column in every row [duplicate]

This question already has answers here:
Spark DataFrame: Computing row-wise mean (or any aggregate operation)
(2 answers)
Closed 4 years ago.
Suppose I have a Dataset/Dataframe with following contents:-
name, marks1, marks2
Alice, 10, 20
Bob, 20, 30
I want to add a new column which should have the average of column B and C.
Expected Result:-
name, marks1, marks2, Result(Avg)
Alice, 10, 20, 15
Bob, 20, 30, 25
for Summing or any other arithmetic operation I use df.withColumn("xyz", $"marks1"+$"marks2"). I cannot find a similar way for Average. Please help.
Additionally:- The number of columns are not fixed. Like sometimes it might be average of 2 columns, sometimes 3 or even more. So I want a generic code which should work.
One of the easiest and optimized way is to create a list of columns of marks columns and use it with withColumn as
pyspark
from pyspark.sql.functions import col
marksColumns = [col('marks1'), col('marks2')]
averageFunc = sum(x for x in marksColumns)/len(marksColumns)
df.withColumn('Result(Avg)', averageFunc).show(truncate=False)
and you should get
+-----+------+------+-----------+
|name |marks1|marks2|Result(Avg)|
+-----+------+------+-----------+
|Alice|10 |20 |15.0 |
|Bob |20 |30 |25.0 |
+-----+------+------+-----------+
scala-spark
the process is almost same in scala as done in python above
import org.apache.spark.sql.functions.{col, lit}
val marksColumns = Array(col("marks1"), col("marks2"))
val averageFunc = marksColumns.foldLeft(lit(0)){(x, y) => x+y}/marksColumns.length
df.withColumn("Result(Avg)", averageFunc).show(false)
which should give you same output as in pyspark
I hope the answer is helpful
It's as easy as using User Defined Functions. By creating a specific UDF to deal with average of many columns, you will be able to reuse it as many times as you want.
Python
In this snippet, I'm creating a UDF that takes an array of columns, and calculates the average of it.
from pyspark.sql.functions import udf, array
from pyspark.sql.types import DoubleType
avg_cols = udf(lambda array: sum(array)/len(array), DoubleType())
df.withColumn("average", avg_cols(array("marks1", "marks2"))).show()
Output:
+-----+------+------+--------+
| name|marks1|marks2| average|
+-----+------+------+--------+
|Alice| 10| 20| 15.0|
| Bob| 20| 30| 25.0|
+-----+------+------+--------+
Scala
With the Scala API, you must process the selected columns as a Row. You just have to select the columns using the Spark struct function.
import org.apache.spark.sql.functions._
import spark.implicits._
import scala.util.Try
def average = udf((row: Row) => {
val values = row.toSeq.map(x => Try(x.toString.toDouble).toOption).filter(_.isDefined).map(_.get)
if(values.nonEmpty) values.sum / values.length else 0.0
})
df.withColumn("average", average(struct($"marks1", $"marks2"))).show()
As you can see, I'm casting all any values to Double with Try, so that if the value cannot be casted, it won't throw any exception, performing the average only on those columns that are defined.
And that's all :)

filter and save first X lines of a dataframe

I'm using pySpark to read and calculate statistics for a dataframe.
The dataframe looks like:
TRANSACTION_URL START_TIME END_TIME SIZE FLAG COL6 COL7 ...
www.google.com 20170113093210 20170113093210 150 1 ... ...
www.cnet.com 20170113114510 20170113093210 150 2 ... ...
I'm adding a new timePeriod column to the dataframe, and after adding it, I would like to save the first 50K records with timePeriod matching some pre-defined value.
My intention is to save those lines to CSV with the dataframe header.
I know this should be a combination of col and write.csv but I'm not sure how to properly use those for my intentions.
My current code is:
encodeUDF = udf(encode_time, StringType())
log_df = log_df.withColumn('timePeriod', encodeUDF(col('START_TIME')))
And after the column has been added i'm guessing I should use something like:
log_df.select(col('timePeriod') == 'Weekday').write.csv(....)
Can someone please help me fill the blanks here, to match my intentions?
unix_timestamp and date_format are useful methods here as START_TIME is not timestamp type.
dfWithDayNum = log_df.withColumn("timePeriod", date_format(
unix_timestamp(col("START_TIME"), "yyyyMMddHHmmss").cast(TimestampType), "u")
)
timePeriod will have the day number of week (1 = Monday, ..., 7 = Sunday)
dfWithDayNum
.filter(col("timePeriod") < 6) //to filter weekday
.limit(50000) //X lines
.write.format("csv")
.option("header", "true")
.csv("location/to/save/df")
Solved using filter() and limit() methods in the following way:
new_log_df.filter(col('timePeriod') == '20161206, Morning').limit(50).write.\
format('csv').option("header", "true").save("..Path..")

Getting weekly and daily averages of timestamp data

I currently have data on a Spark data frame that is formatted as such:
Timestamp Number
......... ......
M-D-Y 3
M-D-Y 4900
The timestamp data is in no way uniform or consistent (i.e., I could have one value that is present on March 1, 2015, and the next value in the table be for the date September 1, 2015 ... also, I could have multiple entries per date).
So I wanted to do two things
Calculate the number of entries per week. So I would essentially want a new table that represented the number of rows in which the timestamp column was in the week that the row corresponded to. If there are multiple years present, I would ideally want to average the values per each year to get a single value.
Average the number column for each week. So for every week of the year, I would have a value that represents the average of the number column (0 if there is no entry within that week).
Parsing date is relatively easy using built-in functions by combining unix_timestamp and simple type casting:
sqlContext.sql(
"SELECT CAST(UNIX_TIMESTAMP('March 1, 2015', 'MMM d, yyyy') AS TIMESTAMP)"
).show(false)
// +---------------------+
// |_c0 |
// +---------------------+
// |2015-03-01 00:00:00.0|
// +---------------------+
With DataFrame DSL equivalent code would be something like this:
import org.apache.spark.sql.functions.unix_timestamp
unix_timestamp($"date", "MMM d, yyyy").cast("timestamp")
To fill missing entries you can use different tricks. The simplest approach is to use the same parsing logic as above. First let's create a few helpers:
def leap(year: Int) = {
((year % 4 == 0) && (year % 100 != 0)) || (year % 400 == 0)
}
def weeksForYear(year: Int) = (1 to 52).map(w => s"$year $w")
def daysForYear(year: Int) = (1 to { if(leap(2000)) 366 else 366 }).map(
d => s"$year $d"
)
and example reference data (here for weeks but you can do the same thing for days):
import org.apache.spark.sql.functions.{year, weekofyear}'
val exprs = Seq(year($"date").alias("year"), weekofyear($"date").alias("week"))
val weeks2015 = Seq(2015)
.flatMap(weeksForYear _)
.map(Tuple1.apply)
.toDF("date")
.withColumn("date", unix_timestamp($"date", "yyyy w").cast("timestamp"))
.select(exprs: _*)
Finally you can transform the original data:
val df = Seq(
("March 1, 2015", 3), ("September 1, 2015", 4900)).toDF("Timestamp", "Number")
val dfParsed = df
.withColumn("date", unix_timestamp($"timestamp", "MMM d, yyyy").cast("timestamp"))
.select(exprs :+ $"Number": _*)
merge and aggregate:
weeks2015.join(dfParsed, Seq("year", "week"), "left")
.groupBy($"year", $"week")
.agg(count($"Number"), avg($"Number"))
.na.fill(0)

Resources