subtract mean from pyspark dataframe - apache-spark

I'm trying to calculate the average for each column in a dataframe and subtract from each element in the column. I've created a function that attempts to do that, but when I try to implement it using a UDF, I get an error: 'float' object has no attribute 'map'. Any ideas on how I can create such a function? Thanks!
def normalize(data):
average=data.map(lambda x: x[0]).sum()/data.count()
out=data.map(lambda x: (x-average))
return out
mapSTD=udf(normalize,IntegerType())
dats = data.withColumn('Normalized', mapSTD('Fare'))

In your example there is problem with UDF function which can not be applied to row and whole DataFrame. UDF can be applied only to single row, but Spark also enables implementing UDAF (User Defined Aggregate Functions) working on whole DataFrame.
To solve your problem you can use below function:
from pyspark.sql.functions import mean
def normalize(df, column):
average = df.agg(mean(df[column]).alias("mean")).collect()[0]["mean"]
return df.select(df[column] - average)
Use it like this:
normalize(df, "Fare")
Please note that above only works on single column, but it is possible to implement something more generic:
def normalize(df, columns):
selectExpr = []
for column in columns:
average = df.agg(mean(df[column]).alias("mean")).collect()[0]["mean"]
selectExpr.append(df[column] - average)
return df.select(selectExpr)
use it like:
normalize(df, ["col1", "col2"])
This works, but you need to run aggregation for each column, so with many columns performance could be issue, but it is possible to generate only one aggregate expression:
def normalize(df, columns):
aggExpr = []
for column in columns:
aggExpr.append(mean(df[column]).alias(column))
averages = df.agg(*aggExpr).collect()[0]
selectExpr = []
for column in columns:
selectExpr.append(df[column] - averages[column])
return df.select(selectExpr)

Adding onto Piotr's answer. If you need to keep the existing dataframe and add normalized columns with aliases, the function can be modified as:
def normalize(df, columns):
aggExpr = []
for column in columns:
aggExpr.append(mean(df[column]).alias(column))
averages = df.agg(*aggExpr).collect()[0]
selectExpr = ['*']
for column in columns:
selectExpr.append((df[column] - averages[column]).alias('normalized_'+column))
return df.select(selectExpr)

Related

Spark keeping words in column that match a list

I currently have a list and a Spark dataframe:
['murder', 'violence', 'flashback', 'romantic', 'cult', 'revenge', 'psychedelic', 'comedy', 'suspenseful', 'good versus evil']
I am having a tough time figuring out a way to create a new column in the dataframe that takes the first matching word from the tags column for each row and puts it in the newly created column for that row.
For example, lets say the first row in the tags column has only "murder" in it, I would want that to show in the new column. Then, if the next row had "boring", "silly" and "cult" in it, I would want it to show cult in the new column since it matches the list. If the last row in tags column had "revenge", "cult" in it, I would want it to only show revenge, since its the first word that matches the list.
from pyspark.sql import functions as F
df = spark.createDataFrame([('murder',), ('boring silly cult',), ('revenge cult',)], ['tags'])
mylist = ['murder', 'violence', 'flashback', 'romantic', 'cult', 'revenge', 'psychedelic', 'comedy', 'suspenseful', 'good versus evil']
pattern = '|'.join([f'({x})' for x in mylist])
df = df.withColumn('first_from_list', F.regexp_extract('tags', pattern, 0))
df.show()
# +-----------------+---------------+
# | tags|first_from_list|
# +-----------------+---------------+
# | murder| murder|
# |boring silly cult| cult|
# | revenge cult| revenge|
# +-----------------+---------------+
You could use a PySpark UDF (User Defined Function).
First, let's write a python function to find a first match between a list (in this case the list you provided) and a string, that is, the value of the tags column:
def find_first_match(tags):
first_match = ''
genres= ['murder', 'violence', 'flashback', 'romantic', 'cult', 'revenge', 'psychedelic', 'comedy', 'suspenseful', 'good versus evil']
for tag in tags.split():
for genre in genres:
if tag==genre:
first_match=tag
return first_match
Then, we need to convert this function into a PySpark udf so that we can use it in combination with the .withColumn() operation:
find_first_matchUDF = udf(lambda z:find_first_match(z))
Now we can apply the udf function to generate a new column. Assuming df is the name of your DataFrame:
from pyspark.sql.functions import col
new_df = df.withColumn("first_match", find_first_matchUDF(col("tags")))
This approach only works if all tag in your tags column are separated by white spaces.
P.S
You can avoid the second step using annotation:
from pyspark.sql.functions import col
#udf(returnType=StringType())
def find_first_match(tags):
first_match = ''
genres= ['murder', 'violence', 'flashback', 'romantic', 'cult', 'revenge', 'psychedelic', 'comedy', 'suspenseful', 'good versus evil']
for tag in tags.split():
for genre in genres:
if tag==genre:
first_match=tag
return first_match
new_df = df.withColumn("first_match", find_first_match(col("tags")))

Dynamic Columns .withColumn Python DataFrame

I wanted to apply .withColumn dynamically on my Spark DataFrame with column names in list
from pyspark.sql.functions import col
from pyspark.sql.types import BooleanType
def get_dtype(dataframe,colname):
return [dtype for name, dtype in dataframe.dtypes if name == colname][0]
def get_matches(dataframe):
return [x for x in dataframe.columns if get_dtype(dataframe,x)=='tinyint']
matches = get_matches(srcpartyaddressDF)
matches
Above code give me list of columns where column datatype is 'tinyint'
Result:
Out[67]: ['verified_flag', 'standard_flag', 'overseas_flag', 'active']
Now I want to do below for each column from list matches dynamically
partyaddressDF = srcpartyaddressDF.withColumn("verified_flag", col("verified_flag").cast(BooleanType())).withColumn("standard_flag", col("standard_flag").cast(BooleanType())).withColumn("overseas_flag", col("overseas_flag").cast(BooleanType())).withColumn("active", col("active").cast(BooleanType()))
How can this be acheived in Python3
you can do something like this:
# import is necessary only for python 3
from functools import reduce
def do_cast(df, cl):
return df.withColumn(cl, col(cl).cast(BooleanType()))
matches = ['verified_flag', 'standard_flag', 'overseas_flag', 'active']
partyaddressDF = reduce(do_cast, matches, srcpartyaddressDF)
basically, it takes initial value (srcpartyaddressDF), and apply first item from list (column name), then takes 2nd value from list, and use it with result that was obtained on first execution, then 3rd value, ...

How to multiply two columns in a spark dataframe

Say I am having a dataframe named "orderitems" with below schema
DataFrame[order_item_id: int, order_item_order_id: int, order_item_product_id: int, order_item_quantity: int, order_item_subtotal: float, order_item_product_price: float]
So As a part of checking the data quality , I need to ensure all rows satisfies the formula : order_item_subtotal = (order_item_quantity*order_item_product_price).
For this I need to add a seperate column named "valid" which should have 'Y' as value for all those rows which satisfy the above formula and for all other rows it should have 'N' as value.
I have decided to use when() and otherwise() along with withColumn() method as below.
orderitems.withColumn("valid",when(orderitems.order_item_subtotal != (orderitems.order_item_product_price * orderitems.order_item_quantity),'N').otherwise("Y"))
But it returns me below Error:
TypeError: 'Column' object is not callable
I know this happened because I have tried to multiply two column objects. But I am not sure how to resolve this since I am still on a learnig proccess in spark.
I would like to know , how to fix this. I am using Spark 2.3.0 with Python
Try something like this:
from pyspark.sql.functions import col,when
orderitems.withColumn("valid",
when(col("order_item_subtotal") != (col("order_item_product_price") * col("order_item_quantity")),"N")
.otherwise("Y")).show()
This can be implemented through spark UDF functions which are very efficient in performing row operartions.
Before running this code make sure the comparison you are doing should have the same datatype.
def check(subtotal, item_quantity, item_product_price):
if subtotal == (item_quantity * item_product_price):
return "Y"
else:
return "N"
validate = udf(check)
orderitems = orderitems.withColumn("valid", validate("order_item_subtotal", "order_item_quantity", "order_item_product_price"))

Use data in Spark Dataframe column as condition or input in another column expression

I have an operation that I want to perform within PySpark 2.0 that would be easy to perform as a df.rdd.map, but since I would prefer to stay inside the Dataframe execution engine for performance reasons, I want to find a way to do this using Dataframe operations only.
The operation, in RDD-style, is something like this:
def precision_formatter(row):
formatter = "%.{}f".format(row.precision)
return row + [formatter % row.amount_raw / 10 ** row.precision]
df = df.rdd.map(precision_formatter)
Basically, I have a column that tells me, for each row, what the precision for my string formatting operation should be, and I want to selectively format the 'amount_raw' column as a string depending on that precision.
I don't know of a way to use the contents of one or more columns as input to another Column operation. The closest I can come is suggesting the use of Column.when with an externally-defined set of boolean operations that correspond to the set of possible boolean conditions/cases within the column or columns.
In this specific case, for instance, if you can obtain (or better yet, already have) all possible values of row.precision, then you can iterate over that set and apply a Column.when operation for each value in the set. I believe this set can be obtained with df.select('precision').distinct().collect().
Because the pyspark.sql.functions.when and Column.when operations themselves return a Column object, you can iterate over the items in the set (however it was obtained) and keep 'appending' when operations to each other programmatically until you have exhausted the set:
import pyspark.sql.functions as PSF
def format_amounts_with_precision(df, all_precisions_set):
amt_col = PSF.when(df['precision'] == 0, df['amount_raw'].cast(StringType()))
for precision in all_precisions_set:
if precision != 0: # this is a messy way of having a base case above
fmt_str = '%.{}f'.format(precision)
amt_col = amt_col.when(df['precision'] == precision,
PSF.format_string(fmt_str, df['amount_raw'] / 10 ** precision)
return df.withColumn('amount', amt_col)
You can do it with a python UDF. They can take as many input values (values from columns of a Row) and spit out a single output value. It would look something like this:
from pyspark.sql import types as T, functions as F
from pyspark.sql.function import udf, col
# Create example data frame
schema = T.StructType([
T.StructField('precision', T.IntegerType(), False),
T.StructField('value', T.FloatType(), False)
])
data = [
(1, 0.123456),
(2, 0.123456),
(3, 0.123456)
]
rdd = sc.parallelize(data)
df = sqlContext.createDataFrame(rdd, schema)
# Define UDF and apply it
def format_func(precision, value):
format_str = "{:." + str(precision) + "f}"
return format_str.format(value)
format_udf = F.udf(format_func, T.StringType())
new_df = df.withColumn('formatted', format_udf('precision', 'value'))
new_df.show()
Also, if instead of the column precision value you wanted to use a global one, you could use the lit(..) function when you call it like this:
new_df = df.withColumn('formatted', format_udf(F.lit(2), 'value'))

How to loop through each row of dataFrame in pyspark

E.g
sqlContext = SQLContext(sc)
sample=sqlContext.sql("select Name ,age ,city from user")
sample.show()
The above statement prints theentire table on terminal. But I want to access each row in that table using for or while to perform further calculations.
You simply cannot. DataFrames, same as other distributed data structures, are not iterable and can be accessed using only dedicated higher order function and / or SQL methods.
You can of course collect
for row in df.rdd.collect():
do_something(row)
or convert toLocalIterator
for row in df.rdd.toLocalIterator():
do_something(row)
and iterate locally as shown above, but it beats all purpose of using Spark.
To "loop" and take advantage of Spark's parallel computation framework, you could define a custom function and use map.
def customFunction(row):
return (row.name, row.age, row.city)
sample2 = sample.rdd.map(customFunction)
or
sample2 = sample.rdd.map(lambda x: (x.name, x.age, x.city))
The custom function would then be applied to every row of the dataframe. Note that sample2 will be a RDD, not a dataframe.
Map may be needed if you are going to perform more complex computations. If you just need to add a simple derived column, you can use the withColumn, with returns a dataframe.
sample3 = sample.withColumn('age2', sample.age + 2)
Using list comprehensions in python, you can collect an entire column of values into a list using just two lines:
df = sqlContext.sql("show tables in default")
tableList = [x["tableName"] for x in df.rdd.collect()]
In the above example, we return a list of tables in database 'default', but the same can be adapted by replacing the query used in sql().
Or more abbreviated:
tableList = [x["tableName"] for x in sqlContext.sql("show tables in default").rdd.collect()]
And for your example of three columns, we can create a list of dictionaries, and then iterate through them in a for loop.
sql_text = "select name, age, city from user"
tupleList = [{name:x["name"], age:x["age"], city:x["city"]}
for x in sqlContext.sql(sql_text).rdd.collect()]
for row in tupleList:
print("{} is a {} year old from {}".format(
row["name"],
row["age"],
row["city"]))
It might not be the best practice, but you can simply target a specific column using collect(), export it as a list of Rows, and loop through the list.
Assume this is your df:
+----------+----------+-------------------+-----------+-----------+------------------+
| Date| New_Date| New_Timestamp|date_sub_10|date_add_10|time_diff_from_now|
+----------+----------+-------------------+-----------+-----------+------------------+
|2020-09-23|2020-09-23|2020-09-23 00:00:00| 2020-09-13| 2020-10-03| 51148 |
|2020-09-24|2020-09-24|2020-09-24 00:00:00| 2020-09-14| 2020-10-04| -35252 |
|2020-01-25|2020-01-25|2020-01-25 00:00:00| 2020-01-15| 2020-02-04| 20963548 |
|2020-01-11|2020-01-11|2020-01-11 00:00:00| 2020-01-01| 2020-01-21| 22173148 |
+----------+----------+-------------------+-----------+-----------+------------------+
to loop through rows in Date column:
rows = df3.select('Date').collect()
final_list = []
for i in rows:
final_list.append(i[0])
print(final_list)
Give A Try Like this
result = spark.createDataFrame([('SpeciesId','int'), ('SpeciesName','string')],["col_name", "data_type"]);
for f in result.collect():
print (f.col_name)
If you want to do something to each row in a DataFrame object, use map. This will allow you to perform further calculations on each row. It's the equivalent of looping across the entire dataset from 0 to len(dataset)-1.
Note that this will return a PipelinedRDD, not a DataFrame.
above
tupleList = [{name:x["name"], age:x["age"], city:x["city"]}
should be
tupleList = [{'name':x["name"], 'age':x["age"], 'city':x["city"]}
for name, age, and city are not variables but simply keys of the dictionary.

Resources