Store aggregate value of a PySpark dataframe column into a variable - apache-spark

I am working with PySpark dataframes here. "test1" is my PySpark dataframe and event_date is a TimestampType. So when I try to get a distinct count of event_date, the result is a integer variable but when I try to get max of the same column the result is a dataframe. I would like to understand what operations result in a dataframe and variable. I would also like to know how to store the max of the event date as a variable
Code that results in an integer type:
loop_cnt=test1.select('event_date').distinct().count()
type(loop_cnt)
Code that results in dataframe type:
last_processed_dt=test1.select([max('event_date')])
type(last_processed_dt)
Edited to add a reproducible example:
schema = StructType([StructField("event_date", TimestampType(), True)])
df = sqlContext.createDataFrame([(datetime(2015, 8, 10, 2, 44, 15),),(datetime(2015, 8, 10, 3, 44, 15),)], schema)
Code that returns a dataframe:
last_processed_dt=df.select([max('event_date')])
type(last_processed_dt)
Code that returns a varible:
loop_cnt=df.select('event_date').distinct().count()
type(loop_cnt)

You cannot directly access the values in a dataframe. Dataframe returns a Row Object. Instead Dataframe gives you a option to convert it into a python dictionary. Go through the following example where I will calculate average wordcount:
wordsDF = sqlContext.createDataFrame([('cat',), ('elephant',), ('rat',), ('rat',), ('cat', )], ['word'])
wordCountsDF = wordsDF.groupBy(wordsDF['word']).count()
wordCountsDF.show()
Here are the word count results:
+--------+-----+
| word|count|
+--------+-----+
| cat| 2|
| rat| 2|
|elephant| 1|
+--------+-----+
Now I calculate the average of count column apply collect() operation on it. Remember collect() returns a list.Here the list contains one element only.
averageCount = wordCountsDF.groupBy().avg('count').collect()
Result looks something like this.
[Row(avg(count)=1.6666666666666667)]
You cannot access directly the average value using some python variable. You have to convert it into a dictionary to access it.
results={}
for i in averageCount:
results.update(i.asDict())
print results
Our final results look like these:
{'avg(count)': 1.6666666666666667}
Finally you can access average value using:
print results['avg(count)']
1.66666666667

I'm pretty sure df.select([max('event_date')]) returns a DataFrame because there could be more than one row that has the max value in that column. In your particular use case no two rows may have the same value in that column, but it is easy to imagine a case where more than one row can have the same max event_date.
df.select('event_date').distinct().count() returns an integer because it is telling you how many distinct values there are in that particular column. It does NOT tell you which value is the largest.
If you want code to get the max event_date and store it as a variable, try the following max_date = df.select([max('event_date')]).distinct().collect()

Using collect()
import pyspark.sql.functions as sf
distinct_count = df.agg(sf.countDistinct('column_name')).collect()[0][0]
Using first()
import pyspark.sql.functions as sf
distinct_count = df.agg(sf.countDistinct('column_name')).first()[0]

last_processed_dt=df.select([max('event_date')])
to get the max of date, we should try something like
last_processed_dt=df.select([max('event_date').alias("max_date")]).collect()[0]
last_processed_dt["max_date"]
Based on sujit's example.We can actually print the value without iterating/looping by
[Row(avg(count)=1.6666666666666667)] by providing averageCount[0][0].
Note: we are not going through the loop, because it's going to return only one value.

try this
loop_cnt=test1.select('event_date').distinct().count()
var = loop_cnt.collect()[0]
Hope this helps

trainDF.fillna({'Age':trainDF.select('Age').agg(avg('Age')).collect()[0][0]})

What you can try is accessing the collect() function.
As of spark 3.0, you can do the following:
loop_cnt=test1.select('event_date').distinct().count().collect()[0][0]
print(loop_cnt)

Related

How to get the size of a list returned by column in pyspark

name
contact
address
"max"
[{"email": "watson#commerce.gov", "phone": "650-333-3456"}, {"email": "emily#gmail.com", "phone": "238-111-7689"}]
{"city": "Baltimore", "state": "MD"}
"kyle"
[{"email": "johnsmith#yahoo.com", "phone": "425-231-8754"}]
{"city": "Barton", "state": "TN"}
I am working with a dataframe in Pyspark that has a few columns including the two mentioned above. I need to create columns dynamically based on the contact fields.
When I use the "." operator on contact as contact.email I get a list of emails. I need to create separate column for each of the emails.
contact.email0, contact.email1, etc.
I found this code online, which partially does what I want, but I don't completely understand it.
employee_data.select(
'name', *[col('contact.email')[i].alias(f'contact.email{i}') for i in range(2)]).show(truncate=False)
The range is static in this case, but my range could be dynamic. How can I get the size of list to loop through it? I tried size(col('contact.email')) or len(col('contact.email')) but got an error saying the col('column name') object is not iterable.
Desired output something like -
name
contact.email0
contact.email1
max
watson#commerce.gov
emily#gmail.com
kyle
johnsmith#yahoo.com
null
You can get desired output by using pivot function,
# convert contact struct to array of emails by using transform function
# explode the array
# perform pivot
df.select("name", posexplode_outer(expr("transform(contact, c-> c.email)"))) \
.withColumn("email", concat(lit("contact.email"), col("pos"))) \
.groupBy("name").pivot("email").agg(first("col")) \
.show(truncate=False)
+----+-------------------+---------------+
|name|contact.email0 |contact.email1 |
+----+-------------------+---------------+
|kyle|johnsmith#yahoo.com|null |
|max |watson#commerce.gov|emily#gmail.com|
+----+-------------------+---------------+
To understand what the solution you found does, we can print the expression in a shell:
>>> [F.col('contact.email')[i].alias(f'contact.email{i}') for i in range(2)]
[Column<'contact.email[0] AS `contact.email0`'>, Column<'contact.email[1] AS `contact.email1`'>]
Basically, it creates two columns, one for the first element of the array contact.email and one for the second element. That's all there is to it.
SOLUTION 1
Keep this solution. But you need to find the max size of your array first:
max_size = df.select(F.max(F.size("contact"))).first()[0]
df.select('name',
*[F.col('contact')[i]['email'].alias(f'contact.email{i}') for i in range(max_size)])\
.show(truncate=False)
SOLUTION 2
Use posexplode to generate one row per element of the array + a pos column containing the index of the email in the array. Then use a pivot to create the columns you want.
df.select('name', F.posexplode('contact.email').alias('pos', 'email'))\
.withColumn('pos', F.concat(F.lit('contact.email'), 'pos'))\
.groupBy('name')\
.pivot('pos')\
.agg(F.first('email'))\
.show()
Both solutions yield:
+----+-------------------+---------------+
|name|contact.email0 |contact.email1 |
+----+-------------------+---------------+
|max |watson#commerce.gov|emily#gmail.com|
|kyle|johnsmith#yahoo.com|null |
+----+-------------------+---------------+
You can use size or array_length functions to get the length of the list in the contact column, and then use that in the range function to dynamically create columns for each email. Here's an example:
from pyspark.sql.functions import size, array_length
contact_size = size(col('contact'))
employee_data.select(
'name', *[col('contact')[i]['email'].alias(f'contact.email{i}') for i in range(contact_size)]).show(truncate=False)
Or, using array_length:
from pyspark.sql.functions import size, array_length
contact_size = array_length(col('contact'))
employee_data.select(
'name', *[col('contact')[i]['email'].alias(f'contact.email{i}') for i in range(contact_size)]).show(truncate=False)

max aggregation on grouped spark dataframe returns wrong value

I have a spark dataframe containing 2 columns(CPID and PluginDuration). I need to find maximum pluginDuration and average pluginDuration for each CPID in the dataframe.
Rows returned for a CPID AN04773 dataframe returned below rows:
df.filter('CPID = "AN04773"').show(10)
Result:
+-------+--------------+
| CPID|PluginDuration|
+-------+--------------+
|AN04773| 1.933333333|
|AN04773| 13.03444444|
|AN04773| 9.2875|
|AN04773| 20.50027778|
+-------+--------------+
when I did groupBy on PID column of the dataframe to find max and avg plugin duration as below, I found the max value returned for some PIDs is not as expected. For example, for PID AN04773 (same PID which I used to show rows from original df). The max pluginDuration should be 20.50027778 but from the result from the below code, the max value is 9.2875 which is not right.
from pyspark.sql import functions as F
fdf = df.groupBy('CPID').agg(F.max('PluginDuration').alias('max_duration'),F.avg('PluginDuration').alias('avg_duration'))
fdf.filter('CPID = "AN04773"').show()
Result:
+-------+------------+--------------+
| CPID|max_duration| avg_duration|
+-------+------------+--------------+
|AN04773| 9.2875|11.18888888825|
+-------+------------+--------------+
want to know why it's not functioning as expected.
The wrong calculation happens because PluginDuration is not defined as a numeric datatype but as a string column. All you have to do is to cast PluginDuration column to be of a numeric type (double, float, etc).
Here is your issue (reproduced in scala but works the same in PySpark):
val data = Seq(("AN04773", "1.933333333"), ("AN04773", "13.03444444"), ("AN04773", "9.2875"), ("AN04773", "20.50027778")).toDF("id", "value")
data.groupBy("id").agg(functions.max("value"), avg("value")).show
// output:
+-------+----------+--------------+
| id|max(value)| avg(value)|
+-------+----------+--------------+
|AN04773| 9.2875|11.18888888825|
+-------+----------+--------------+
but after casting the value column to Double datatype, we get correct calculation values:
data.withColumn("value",col("value").cast("double")).groupBy("id").agg(functions.max("value"), avg("value")).show
// output:
+-------+-----------+--------------+
| id| max(value)| avg(value)|
+-------+-----------+--------------+
|AN04773|20.50027778|11.18888888825|
+-------+-----------+--------------+
As there is no datatype defined in the columns so Scala treated it as string and as a string 9 is greater than 2 , so the maximum is 9.285
Note: If you convert the datatype in pyspark to string you will get the same result as you are getting in Scala.
def transform(self, df):
from pyspark.sql.functions import sum, avg, max
from pyspark.sql.functions import col
#cast PluginDuration column from string to double to perform aggregation
df1=df.withColumn('PluginDuration',
col('PluginDuration').cast('double'))
df2=df1.groupBy("CPID") \
.agg(max("PluginDuration").alias("max_duration"),avg("PluginDuration").alias("avg_duration"))
#df2.show(truncate=False)
#rename and round the column values
df3 = df2.select(col("CPID").alias("chargepoint_id"),
func.round(df2["max_duration"], 2).alias("max_duration"),
func.round(df2["avg_duration"], 2).alias("avg_duration"))
#df3.show(truncate=False)
return df3

Use data in Spark Dataframe column as condition or input in another column expression

I have an operation that I want to perform within PySpark 2.0 that would be easy to perform as a df.rdd.map, but since I would prefer to stay inside the Dataframe execution engine for performance reasons, I want to find a way to do this using Dataframe operations only.
The operation, in RDD-style, is something like this:
def precision_formatter(row):
formatter = "%.{}f".format(row.precision)
return row + [formatter % row.amount_raw / 10 ** row.precision]
df = df.rdd.map(precision_formatter)
Basically, I have a column that tells me, for each row, what the precision for my string formatting operation should be, and I want to selectively format the 'amount_raw' column as a string depending on that precision.
I don't know of a way to use the contents of one or more columns as input to another Column operation. The closest I can come is suggesting the use of Column.when with an externally-defined set of boolean operations that correspond to the set of possible boolean conditions/cases within the column or columns.
In this specific case, for instance, if you can obtain (or better yet, already have) all possible values of row.precision, then you can iterate over that set and apply a Column.when operation for each value in the set. I believe this set can be obtained with df.select('precision').distinct().collect().
Because the pyspark.sql.functions.when and Column.when operations themselves return a Column object, you can iterate over the items in the set (however it was obtained) and keep 'appending' when operations to each other programmatically until you have exhausted the set:
import pyspark.sql.functions as PSF
def format_amounts_with_precision(df, all_precisions_set):
amt_col = PSF.when(df['precision'] == 0, df['amount_raw'].cast(StringType()))
for precision in all_precisions_set:
if precision != 0: # this is a messy way of having a base case above
fmt_str = '%.{}f'.format(precision)
amt_col = amt_col.when(df['precision'] == precision,
PSF.format_string(fmt_str, df['amount_raw'] / 10 ** precision)
return df.withColumn('amount', amt_col)
You can do it with a python UDF. They can take as many input values (values from columns of a Row) and spit out a single output value. It would look something like this:
from pyspark.sql import types as T, functions as F
from pyspark.sql.function import udf, col
# Create example data frame
schema = T.StructType([
T.StructField('precision', T.IntegerType(), False),
T.StructField('value', T.FloatType(), False)
])
data = [
(1, 0.123456),
(2, 0.123456),
(3, 0.123456)
]
rdd = sc.parallelize(data)
df = sqlContext.createDataFrame(rdd, schema)
# Define UDF and apply it
def format_func(precision, value):
format_str = "{:." + str(precision) + "f}"
return format_str.format(value)
format_udf = F.udf(format_func, T.StringType())
new_df = df.withColumn('formatted', format_udf('precision', 'value'))
new_df.show()
Also, if instead of the column precision value you wanted to use a global one, you could use the lit(..) function when you call it like this:
new_df = df.withColumn('formatted', format_udf(F.lit(2), 'value'))

How to split column of vectors into two columns?

I use PySpark.
Spark ML's Random Forest output DataFrame has a column "probability" which is a vector with two values. I just want to add two columns to the output DataFrame, "prob1" and "prob2", which correspond to the first and second values in the vector.
I've tried the following:
output2 = output.withColumn('prob1', output.map(lambda r: r['probability'][0]))
but I get the error that 'col should be Column'.
Any suggestions on how to transform a column of vectors into columns of its values?
I figured out the problem with the suggestion above. In pyspark, "dense vectors are simply represented as NumPy array objects", so the issue is with python and numpy types. Need to add .item() to cast a numpy.float64 to a python float.
The following code works:
split1_udf = udf(lambda value: value[0].item(), FloatType())
split2_udf = udf(lambda value: value[1].item(), FloatType())
output2 = randomforestoutput.select(split1_udf('probability').alias('c1'), split2_udf('probability').alias('c2'))
Or to append these columns to the original dataframe:
randomforestoutput.withColumn('c1', split1_udf('probability')).withColumn('c2', split2_udf('probability'))
Got the same problem, below is the code adjusted for the situation when you have n-length vector.
splits = [udf(lambda value: value[i].item(), FloatType()) for i in range(n)]
out = tstDF.select(*[s('features').alias("Column"+str(i)) for i, s in enumerate(splits)])
You may want to use one UDF to extract the first value and another to extract the second. You can then use the UDF with a select call on the output of the random forrest data frame. Example:
from pyspark.sql.functions import udf, col
split1_udf = udf(lambda value: value[0], FloatType())
split2_udf = udf(lambda value: value[1], FloatType())
output2 = randomForrestOutput.select(split1_udf(col("probability")).alias("c1"),
split2_udf(col("probability")).alias("c2"))
This should give you a dataframe output2 which has columns c1 and c2 corresponding to the first and second values in the list stored in the column probability.
I tried #Rookie Boy 's loop but it seems the splits udf loop doesn't work for me.
I modified a bit.
out = df
for i in range(len(n)):
splits_i = udf(lambda x: x[i].item(), FloatType())
out = out.withColumn('{col_}'.format(i), splits_i('probability'))
out.select(*['col_{}'.format(i) for i in range(3)]).show()

How to loop through each row of dataFrame in pyspark

E.g
sqlContext = SQLContext(sc)
sample=sqlContext.sql("select Name ,age ,city from user")
sample.show()
The above statement prints theentire table on terminal. But I want to access each row in that table using for or while to perform further calculations.
You simply cannot. DataFrames, same as other distributed data structures, are not iterable and can be accessed using only dedicated higher order function and / or SQL methods.
You can of course collect
for row in df.rdd.collect():
do_something(row)
or convert toLocalIterator
for row in df.rdd.toLocalIterator():
do_something(row)
and iterate locally as shown above, but it beats all purpose of using Spark.
To "loop" and take advantage of Spark's parallel computation framework, you could define a custom function and use map.
def customFunction(row):
return (row.name, row.age, row.city)
sample2 = sample.rdd.map(customFunction)
or
sample2 = sample.rdd.map(lambda x: (x.name, x.age, x.city))
The custom function would then be applied to every row of the dataframe. Note that sample2 will be a RDD, not a dataframe.
Map may be needed if you are going to perform more complex computations. If you just need to add a simple derived column, you can use the withColumn, with returns a dataframe.
sample3 = sample.withColumn('age2', sample.age + 2)
Using list comprehensions in python, you can collect an entire column of values into a list using just two lines:
df = sqlContext.sql("show tables in default")
tableList = [x["tableName"] for x in df.rdd.collect()]
In the above example, we return a list of tables in database 'default', but the same can be adapted by replacing the query used in sql().
Or more abbreviated:
tableList = [x["tableName"] for x in sqlContext.sql("show tables in default").rdd.collect()]
And for your example of three columns, we can create a list of dictionaries, and then iterate through them in a for loop.
sql_text = "select name, age, city from user"
tupleList = [{name:x["name"], age:x["age"], city:x["city"]}
for x in sqlContext.sql(sql_text).rdd.collect()]
for row in tupleList:
print("{} is a {} year old from {}".format(
row["name"],
row["age"],
row["city"]))
It might not be the best practice, but you can simply target a specific column using collect(), export it as a list of Rows, and loop through the list.
Assume this is your df:
+----------+----------+-------------------+-----------+-----------+------------------+
| Date| New_Date| New_Timestamp|date_sub_10|date_add_10|time_diff_from_now|
+----------+----------+-------------------+-----------+-----------+------------------+
|2020-09-23|2020-09-23|2020-09-23 00:00:00| 2020-09-13| 2020-10-03| 51148 |
|2020-09-24|2020-09-24|2020-09-24 00:00:00| 2020-09-14| 2020-10-04| -35252 |
|2020-01-25|2020-01-25|2020-01-25 00:00:00| 2020-01-15| 2020-02-04| 20963548 |
|2020-01-11|2020-01-11|2020-01-11 00:00:00| 2020-01-01| 2020-01-21| 22173148 |
+----------+----------+-------------------+-----------+-----------+------------------+
to loop through rows in Date column:
rows = df3.select('Date').collect()
final_list = []
for i in rows:
final_list.append(i[0])
print(final_list)
Give A Try Like this
result = spark.createDataFrame([('SpeciesId','int'), ('SpeciesName','string')],["col_name", "data_type"]);
for f in result.collect():
print (f.col_name)
If you want to do something to each row in a DataFrame object, use map. This will allow you to perform further calculations on each row. It's the equivalent of looping across the entire dataset from 0 to len(dataset)-1.
Note that this will return a PipelinedRDD, not a DataFrame.
above
tupleList = [{name:x["name"], age:x["age"], city:x["city"]}
should be
tupleList = [{'name':x["name"], 'age':x["age"], 'city':x["city"]}
for name, age, and city are not variables but simply keys of the dictionary.

Resources