i have this dataset
+---------+------+------------------+--------------------+-------------+
| LCLid|season| sum(KWH/hh)| avg(KWH/hh)|Acorn_grouped|
+---------+------+------------------+--------------------+-------------+
|MAC000023|autumn|4067.4269999000007| 0.31550007755972703| 4|
|MAC000128|spring| 961.2639999999982| 0.10876487893188484| 2|
|MAC000012|summer| 121.7360000000022|0.027548314098212765| 0|
|MAC000053|autumn| 2289.498000000006| 0.17883908764255632| 2|
|MAC000121|spring| 1893.635999900008| 0.21543071671217384| 1|
for every consumerID we have the sum and avg consumption in every month, acron grouped is fixed for each consumer
i want to aggregate according to the id and in the same time extract those new features and have round numbers to finally have this data
+---------+-------------+-------------------+------------------+------------------+------------------
| LCLid|Acorn_grouped|autumn_avg(KWH/hh) |autumn_sum(KWH/hh)|autumn_max(KWH/hh)|spring_avg(KWH/hh)
+---------+-------------+-------------------+------------------+------------------+-----------------
|MAC000023| 4| | | |
|MAC000128| 2| | | |
|MAC000012| 0| | | |
|MAC000053| 2| | | |
|MAC000121| 1| | | |
You can do a pivot:
import pyspark.sql.functions as F
result = df.groupBy('LCLid', 'Acorn_grouped') \
.pivot('season') \
.agg(
F.round(F.first('sum(KWH/hh)')).alias('sum(KWH/hh)'),
F.round(F.first('avg(KWH/hh)')).alias('avg(KWH/hh)')
).fillna(0) # replace nulls with zero -
# you can skip this if you want to keep nulls
Related
I'm working on Apache spark 2.3.0 cloudera4 and I have an issue processing a Dataframe.
I've got this input dataframe:
+---+---+----+
| id| d1| d2 |
+---+---+----+
| 1| | 2.0|
| 2| |-4.0|
| 3| | 6.0|
| 4|3.0| |
+---+---+----+
And I need this output:
+---+---+----+----+
| id| d1| d2 | r |
+---+---+----+----+
| 1| | 2.0| 7.0|
| 2| |-4.0| 5.0|
| 3| | 6.0| 9.0|
| 4|3.0| | 3.0|
+---+---+.---+----+
Which is, from an iterating perspective, get the biggest id row (4) and put the d1 value on the r column, then take the next row (3) and put r[4] + d2[3] on r column, and so on.
Is it posible to do something like that on Spark? because I will need a computed value from a row to calculate the value for another row.
How about this? The important bit is sum($"r1").over(Window.orderBy($"id".desc) which calculates a cumulative sum of a column. Other than that, I'm creating a couple of helper columns to get the max id and get the ordering right.
val result = df
.withColumn("max_id", max($"id").over(Window.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)))
.withColumn("r1", when($"id" === $"max_id", $"d1").otherwise($"d2"))
.withColumn("r", sum($"r1").over(Window.orderBy($"id".desc)))
.drop($"max_id").drop($"r1")
.orderBy($"id")
result.show
+---+----+----+---+
| id| d1| d2| r|
+---+----+----+---+
| 1|null| 2.0|7.0|
| 2|null|-4.0|5.0|
| 3|null| 6.0|9.0|
| 4| 3.0|null|3.0|
+---+----+----+---+
Given CSV file, I converted to Dataframe using something code like the following.
raw_df = spark.read.csv(input_data, header=True)
That creates dataframe looks something like this:
| Name |
========
| 23 |
| hi2 |
| me3 |
| do |
I want to convert this column to only contain numbers. The final result should be like where hi and me are removed:
| Name |
========
| 23 |
| 2 |
| 3 |
| do |
I want to sanitize the values and make sure it only contains number. But I'm not sure if it's possible in Spark.
Yes, It's possible. You can use regex_replace from function.
Please check this:
import pyspark.sql.functions as f
df = spark.sparkContext.parallelize([('12',), ('hi2',), ('me3',)]).toDF(["name"])
df.show()
+----+
|name|
+----+
| 12|
| hi2|
| me3|
+----+
final_df = df.withColumn('sanitize', f.regexp_replace('name', '[a-zA-Z]', ''))
final_df.show()
+----+--------+
|name|sanitize|
+----+--------+
| 12| 12|
| hi2| 2|
| me3| 3|
+----+--------+
final_df.withColumn('len', f.length('sanitize')).show()
+----+--------+---+
|name|sanitize|len|
+----+--------+---+
| 12| 12| 2|
| hi2| 2| 1|
| me3| 3| 1|
+----+--------+---+
You can adjust regex.
Otherway doing the same. It's just an another way but better use spark inbuilt functions if available. as shown above also.
from pyspark.sql.functions import udf
import re
user_func = udf (lambda x: re.findall("\d+", x)[0])
newdf = df.withColumn('new_column',user_func(df.Name))
>>> newdf.show()
+----+----------+
|Name|new_column|
+----+----------+
| 23| 23|
| hi2| 2|
| me3| 3|
+----+----------+
Assume we have a spark DataFrame that looks like the following (ordered by time):
+------+-------+
| time | value |
+------+-------+
| 1 | A |
| 2 | A |
| 3 | A |
| 4 | B |
| 5 | B |
| 6 | A |
+------+-------+
I'd like to calculate the start/end times of each sequence of uninterrupted values. The expected output from the above DataFrame would be:
+-------+-------+-----+
| value | start | end |
+-------+-------+-----+
| A | 1 | 3 |
| B | 4 | 5 |
| A | 6 | 6 |
+-------+-------+-----+
(The end value for the final row could also be null.)
Doing this with a simple group aggregation:
.groupBy("value")
.agg(
F.min("time").alias("start"),
F.max("time").alias("end")
)
doesn't take into account the fact that the same value can appear in multiple different intervals.
the idea is to create an identifier for each group and use it to group by and compute your min and max time.
assuming df is your dataframe:
from pyspark.sql import functions as F, Window
df = df.withColumn(
"fg",
F.when(
F.lag('value').over(Window.orderBy("time"))==F.col("value"),
0
).otherwise(1)
)
df = df.withColumn(
"rn",
F.sum("fg").over(
Window
.orderBy("time")
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
)
)
From that point, you have your dataframe with an identifier for each consecutive group.
df.show()
+----+-----+---+---+
|time|value| rn| fg|
+----+-----+---+---+
| 1| A| 1| 1|
| 2| A| 1| 0|
| 3| A| 1| 0|
| 4| B| 2| 1|
| 5| B| 2| 0|
| 6| A| 3| 1|
+----+-----+---+---+
then you just have to do the aggregation
df.groupBy(
'value',
"rn"
).agg(
F.min('time').alias("start"),
F.max('time').alias("end")
).drop("rn").show()
+-----+-----+---+
|value|start|end|
+-----+-----+---+
| A| 1| 3|
| B| 4| 5|
| A| 6| 6|
+-----+-----+---+
A crossJoin can be done as follows:
df1 = pd.DataFrame({'subgroup':['A','B','C','D']})
df2 = pd.DataFrame({'dates':pd.date_range(date_today, date_today + timedelta(3), freq='D')})
sdf1 = spark.createDataFrame(df1)
sdf2 = spark.createDataFrame(df2)
sdf1.crossJoin(sdf2).toPandas()
In this example there are two dataframes each containing 4 rows, in the end, I get 16 rows.
However, for my problem, I would like to do a cross join per user, and the user is another column in the two dataframes, e.g.:
df1 = pd.DataFrame({'user':[1,1,1,1,2,2,2,2],'subgroup':['A','B','C','D','A','B','D','E']})
df2 = pd.DataFrame({'user':[1,1,1,1,2,2,2,2],'dates':np.hstack([np.array(pd.date_range(date_today, date_today + timedelta(3), freq='D')),np.array(pd.date_range(date_today+timedelta(1), date_today + timedelta(4), freq='D'))])})
The result of applying the per-user crossJoin should be a dataframe with 32 rows. Is this possible in pyspark and how can this be done?
A cross join is a join that generates a multiplication of lines because the joining key does not identify rows uniquely (in our case the joining key is trivial or there is no joining key at all)
Let's start with sample data frames:
import pyspark.sql.functions as psf
import pyspark.sql.types as pst
df1 = spark.createDataFrame(
[[user, value] for user, value in zip(5 * list(range(2)), np.random.randint(0, 100, 10).tolist())],
schema=pst.StructType([pst.StructField(c, pst.IntegerType()) for c in ['user', 'value1']]))
df2 = spark.createDataFrame(
[[user, value] for user, value in zip(5 * list(range(2)), np.random.randint(0, 100, 10).tolist())],
schema=pst.StructType([pst.StructField(c, pst.IntegerType()) for c in ['user', 'value2']]))
+----+------+
|user|value1|
+----+------+
| 0| 76|
| 1| 59|
| 0| 14|
| 1| 71|
| 0| 66|
| 1| 61|
| 0| 2|
| 1| 22|
| 0| 16|
| 1| 83|
+----+------+
+----+------+
|user|value2|
+----+------+
| 0| 65|
| 1| 81|
| 0| 60|
| 1| 69|
| 0| 21|
| 1| 61|
| 0| 98|
| 1| 76|
| 0| 40|
| 1| 21|
+----+------+
Let's try joining the data frames on a constant column to see the equivalence between a cross join and regular join on a constant (trivial) column:
df = df1.withColumn('key', psf.lit(1)) \
.join(df2.withColumn('key', psf.lit(1)), on=['key'])
We get an error from spark > 2, because it realises we're trying to do a cross join (cartesian product)
Py4JJavaError: An error occurred while calling o1865.showString.
: org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for INNER join between logical plans
LogicalRDD [user#1538, value1#1539], false
and
LogicalRDD [user#1542, value2#1543], false
Join condition is missing or trivial.
Either: use the CROSS JOIN syntax to allow cartesian products between these
relations, or: enable implicit cartesian products by setting the configuration
variable spark.sql.crossJoin.enabled=true;
If your joining key (user here) is not a column that uniquely identifies rows, you'll get a multiplication of lines as well but within each user group:
df = df1.join(df2, on='user')
print("Number of rows : \tdf1: {} \tdf2: {} \tdf: {}".format(df1.count(), df2.count(), df.count()))
Number of rows : df1: 10 df2: 10 df: 50
+----+------+------+
|user|value1|value2|
+----+------+------+
| 1| 59| 81|
| 1| 59| 69|
| 1| 59| 61|
| 1| 59| 76|
| 1| 59| 21|
| 1| 71| 81|
| 1| 71| 69|
| 1| 71| 61|
| 1| 71| 76|
| 1| 71| 21|
| 1| 61| 81|
| 1| 61| 69|
| 1| 61| 61|
| 1| 61| 76|
| 1| 61| 21|
| 1| 22| 81|
| 1| 22| 69|
| 1| 22| 61|
| 1| 22| 76|
| 1| 22| 21|
+----+------+------+
5 * 5 rows for user 0 + 5 * 5 rows for user 1, hence 50
Note: Using a self join followed by a filter usually means you should be using window functions instead.
I need to write a user defined aggregate function that captures the number of days between previous discharge_date and following admit_date for each consecutive visits.
I will also need to pivot on the "PERSON_ID" values.
I have the following input_df :
input_df :
+---------+----------+--------------+
|PERSON_ID|ADMIT_DATE|DISCHARGE_DATE|
+---------+----------+--------------+
| 111|2018-03-15| 2018-03-16|
| 333|2018-06-10| 2018-06-11|
| 111|2018-03-01| 2018-03-02|
| 222|2018-12-01| 2018-12-02|
| 222|2018-12-05| 2018-12-06|
| 111|2018-03-30| 2018-03-31|
| 333|2018-06-01| 2018-06-02|
| 333|2018-06-20| 2018-06-21|
| 111|2018-01-01| 2018-01-02|
+---------+----------+--------------+
First, I need to group by each person and sort the corresponding rows by ADMIT_DATE. That would yield "input_df2".
input_df2:
+---------+----------+--------------+
|PERSON_ID|ADMIT_DATE|DISCHARGE_DATE|
+---------+----------+--------------+
| 111|2018-01-01| 2018-01-03|
| 111|2018-03-01| 2018-03-02|
| 111|2018-03-15| 2018-03-16|
| 111|2018-03-30| 2018-03-31|
| 222|2018-12-01| 2018-12-02|
| 222|2018-12-05| 2018-12-06|
| 333|2018-06-01| 2018-06-02|
| 333|2018-06-10| 2018-06-11|
| 333|2018-06-20| 2018-06-21|
+---------+----------+--------------+
The desired output_df :
+------------------+-----------------+-----------------+----------------+
|PERSON_ID_DISTINCT| FIRST_DIFFERENCE|SECOND_DIFFERENCE|THIRD_DIFFERENCE|
+------------------+-----------------+-----------------+----------------+
| 111| 1 month 26 days | 13 days| 14 days|
| 222| 3 days| NAN| NAN|
| 333| 8 days| 9 days| NAN|
+------------------+-----------------+-----------------+----------------+
I know the maximum number a person appears in my input_df, so I know how many columns should be created by :
print input_df.groupBy('PERSON_ID').count().sort('count', ascending=False).show(5)
Thanks a lot in advance,
You can use pyspark.sql.functions.datediff() to compute the difference between two dates in days. In this case, you just need to compute the difference between the current row's ADMIT_DATE and the previous row's DISCHARGE_DATE. You can do this by using pyspark.sql.functions.lag() over a Window.
For example, we can compute the duration between visits in days as a new column DURATION.
import pyspark.sql.functions as f
from pyspark.sql import Window
w = Window.partitionBy('PERSON_ID').orderBy('ADMIT_DATE')
input_df.withColumn(
'DURATION',
f.datediff(f.col('ADMIT_DATE'), f.lag('DISCHARGE_DATE').over(w))
)\
.withColumn('INDEX', f.row_number().over(w)-1)\
.sort('PERSON_ID', 'INDEX')\
.show()
#+---------+----------+--------------+--------+-----+
#|PERSON_ID|ADMIT_DATE|DISCHARGE_DATE|DURATION|INDEX|
#+---------+----------+--------------+--------+-----+
#| 111|2018-01-01| 2018-01-02| null| 0|
#| 111|2018-03-01| 2018-03-02| 58| 1|
#| 111|2018-03-15| 2018-03-16| 13| 2|
#| 111|2018-03-30| 2018-03-31| 14| 3|
#| 222|2018-12-01| 2018-12-02| null| 0|
#| 222|2018-12-05| 2018-12-06| 3| 1|
#| 333|2018-06-01| 2018-06-02| null| 0|
#| 333|2018-06-10| 2018-06-11| 8| 1|
#| 333|2018-06-20| 2018-06-21| 9| 2|
#+---------+----------+--------------+--------+-----+
Notice, I also added an INDEX column using pyspark.sql.functions.row_number(). We can just filter for INDEX > 0 (because the first value will always be null) and then just pivot the DataFrame:
input_df.withColumn(
'DURATION',
f.datediff(f.col('ADMIT_DATE'), f.lag('DISCHARGE_DATE').over(w))
)\
.withColumn('INDEX', f.row_number().over(w) - 1)\
.where('INDEX > 0')\
.groupBy('PERSON_ID').pivot('INDEX').agg(f.first('DURATION'))\
.sort('PERSON_ID')\
.show()
#+---------+---+----+----+
#|PERSON_ID| 1| 2| 3|
#+---------+---+----+----+
#| 111| 58| 13| 14|
#| 222| 3|null|null|
#| 333| 8| 9|null|
#+---------+---+----+----+
Now you can rename the columns to whatever you desire.
Note: This assumes that ADMIT_DATE and DISCHARGE_DATE are of type date.
input_df.printSchema()
#root
# |-- PERSON_ID: long (nullable = true)
# |-- ADMIT_DATE: date (nullable = true)
# |-- DISCHARGE_DATE: date (nullable = true)