How to select an aggregated column from a dataframe - apache-spark

df = df.groupby(F.upper(F.col('count'))).agg({'totalvis.age':'avg'}).show() creates avg(totalvis.age AS age) column.
I want to use another aggregate function to select max of the newly created column, but there is an issue with column name not being resolvable.

You can use this syntax below to assign a column alias to the aggregation:
import pyspark.sql.functions as F
df2 = df.groupby(F.upper(F.col('county'))).agg(F.avg('totalvisitor.age').alias('age_avg'))
Then you can select the maximum as df2.select(F.max('age_avg')).
PS Note that in the code you provided in the question, you have overwritten df with None after calling
df = df.(...).show()
because df.show() returns None.

Related

split row value by separator and create new columns

I I have a Pyspark dataset with a column “channels” that looks like this:
channels
name1,name2,name3,name4
happy1,happy2
entity1,entity2,entity3,entity4,entity5
I want to create 5 new columns i.e “channel1, channel2, channel3, channel4, channel5”.
Then, I want to split the contents of the “channels” column using the comma separator. After splitting values from each row, I want to put each separated value in a different column.
For example for the first row, the columns should look like this:
channel1 channel2 channel3 channel4 channel5
name1 name2 name3 name4 ~
When an element is not found, i want to use ~ as the column value. For example in the first row, there were only 4 values instead of 5 so for the channel5 column, I used ~
I only want to use ~, not None or NULL.
How can I achieve this result in pyspark?
I tried this:
df = df.withColumn("channels_split", split(df["channels"], ","))
df = df.withColumn("channel1", coalesce(df["channels_split"][0], "~"))
df = df.withColumn("channel2", coalesce(df["channels_split"][1], "~"))
df = df.withColumn("channel3", coalesce(df["channels_split"][2], "~"))
df = df.withColumn("channel4", coalesce(df["channels_split"][3], "~"))
df = df.withColumn("channel5", coalesce(df["channels_split"][4], "~"))
df = df.drop("channels_split")
but it gives me an error that:
`~` is missing
You're referencing the column `~`, but it is missing from the schema. Please check your code.
Note that I am using pyspark within Foundry
Coalesce expects cols as arguments and you are providing String, i think that you should use lit("~")
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.coalesce.html
This is more of a spark problem than a Foundry problem though.
What #M_S said is correct, as the error message was stating, you need a column and should use the lit function.
But be careful if
spark.sql.ansi.enabled is set to True
then this code will throw an ArrayIndexOutOfBoundsException if there are less than 5 items in your array.
Another way to do this would be to ensure that your array has at least 5 elements on each row by adding some ~, then get the first 5 items or by adding a temporary column with the length of the array and use a when condition.
By the way, you don't need to repeat df = every time
df = (
df
.withColumn("channels_split", split(df["channels"], ","))
.withColumn("default_values", array_repeat(lit("~"), 5)))
.withColumn("channels_split", concat(col("channels_split"), col("default_values")))
.withColumn("channel1", df["channels_split"][0])
.withColumn("channel2", df["channels_split"][1])
.withColumn("channel3", df["channels_split"][2])
.withColumn("channel4", df["channels_split"][3])
.withColumn("channel5", df["channels_split"][4])
.drop("channels_split")
)

Get only rows of dataframe where a subset of columns exist in another dataframe

I want to get all rows of a dataframe (df2) where the city column value and postcode column value also exist in another dataframe (df1).
Important is here that I want the combination of both columns and not look at the column individually.
My approach was this:
#1. Get all combinations
df_combinations=np.array(df1.select("Ort","Postleitzahl").dropDuplicates().collect())
sc.broadcast(df_combinations)
#2.Define udf
def combination_in_vx(ort,plz):
for arr_el in dfSpark_combinations:
if str(arr_el[0]) == ort and int(arr_el[1]) == plz:
return True
return False
combination_in_vx = udf(combination_in_vx, BooleanType())
#3.
df_tmp=df_2.withColumn("Combination_Exists", combination_in_vx('city','postcode'))
df_result=df_tmp.filter(df_tmp.Combination_Exists)
Although this should theoretically work it takes forever!
Does anybody know about a better solution here? Thank you very much!
You can do a left semi join using the two columns. This will include the rows in df2 where the values in both of the two specified columns exist in df1:
import pyspark.sql.functions as F
df_result = df2.join(df1, ["Ort", "Postleitzahl"], 'left_semi')

Get n rows based on column filter in a Dataframe pandas

I have a dataframe df as below.
I want the final dataframe to be like this as follows. i.e, for each unique Name only last 2 rows must be present in the final output.
i tried the following snippet but its not working.
df = df[df['Name']].tail(2)
Use GroupBy.tail:
df1 = df.groupby('Name').tail(2)
Just one more way to solve this using GroupBy.nth:
df1 = df.groupby('Name').nth([-1,-2]) ## this will pick the last 2 rows

How to drop rows with nulls in one column pyspark

I have a dataframe and I would like to drop all rows with NULL value in one of the columns (string). I can easily get the count of that:
df.filter(df.col_X.isNull()).count()
I have tried dropping it using following command. It executes but the count still returns as positive
df.filter(df.col_X.isNull()).drop()
I tried different attempts but it returns 'object is not callable' error.
Use either drop with subset:
df.na.drop(subset=["col_X"])
or isNotNull()
df.filter(df.col_X.isNotNull())
Dataframes are immutable. so just applying a filter that removes not null values will create a new dataframe which wouldn't have the records with null values.
df = df.filter(df.col_X. isNotNull())
if you want to drop any row in which any value is null, use
df.na.drop() //same as df.na.drop("any") default is "any"
to drop only if all values are null for that row, use
df.na.drop("all")
to drop by passing a column list, use
df.na.drop("all", Seq("col1", "col2", "col3"))
another variation is:
from pyspark.sql.functions import col
df = df.where(col("columnName").isNotNull())
you can add empty string condition also somtimes
df = df.filter(df.col_X. isNotNull() | df.col_X != "")
You can use expr() functions that accept SQL-like query syntax.
from pyspark.sql.functions import expr
filteredDF = rawDF.filter(expr("col_X is not null")).filter("col_Y is not null")

How to modify a column value in a row of a spark dataframe?

I am working with data frame with following structure
Here I need to modify each record so that if a column is listed in post_event_list I need to populate that column with corresponding post_column value. So in the above example for both records I need to populate col4 and col5 with post_col4 and post_col5 values. Can someone please help me to do this in pyspark.
Maybe this is what you want in pyspark2
suppose df is the DataFrame
row = df.rdd.first()
d = row.asDict()
d['col4'] = d['post_col4']
new_row = pyspark.sql.types.Row(**d)
now we has a new Row object;
put these codes in a map function can help to change all df.
You can use when/otherwise in pyspark.sql.functions. Something likes:
import pyspark.sql.functions as sf
from pyspark.sql.types import BooleanType
contains_col4_udf = udf(lambda x: 'col4' in x, BooleanType())
df.select(sf.when(contains_col4_udf('post_event_list'), sf.col('post_col4')).otherwise(sf.col('col_4')).alias('col_4'))
Here is the doc: https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html#pyspark.sql.Column.otherwise

Resources