Pass the column names from a list - apache-spark

I have a list of column names which varies every time. The column names are stored in a list. So, I need to pass the column names from the list (in the below example its id and programid) to the when clause and check if both the columns are holding null values. Please help me with the solution.
Pyspark Code:
ColumnList = ['id','programid']
joinSrcTgt.withColumn(
'action',
when(joinSrcTgt.id.isNull() & joinSrcTgt.prgmid.isNull(),'insert')
)

You can use a list comprehension to check if each column is null:
[col(c).isNull() for c in ColumnList]
Then you can use functools.reduce to bitwise-and (&) these together:
from functools import reduce
from pyspark.sql.functions import col, when
ColumnList = ['id','programid']
joinSrcTgt.withColumn(
'action',
when(
reduce(lambda a, b: a&b, [col(c).isNull() for c in ColumnList]),
'insert'
)
)

Related

split row value by separator and create new columns

I I have a Pyspark dataset with a column “channels” that looks like this:
channels
name1,name2,name3,name4
happy1,happy2
entity1,entity2,entity3,entity4,entity5
I want to create 5 new columns i.e “channel1, channel2, channel3, channel4, channel5”.
Then, I want to split the contents of the “channels” column using the comma separator. After splitting values from each row, I want to put each separated value in a different column.
For example for the first row, the columns should look like this:
channel1 channel2 channel3 channel4 channel5
name1 name2 name3 name4 ~
When an element is not found, i want to use ~ as the column value. For example in the first row, there were only 4 values instead of 5 so for the channel5 column, I used ~
I only want to use ~, not None or NULL.
How can I achieve this result in pyspark?
I tried this:
df = df.withColumn("channels_split", split(df["channels"], ","))
df = df.withColumn("channel1", coalesce(df["channels_split"][0], "~"))
df = df.withColumn("channel2", coalesce(df["channels_split"][1], "~"))
df = df.withColumn("channel3", coalesce(df["channels_split"][2], "~"))
df = df.withColumn("channel4", coalesce(df["channels_split"][3], "~"))
df = df.withColumn("channel5", coalesce(df["channels_split"][4], "~"))
df = df.drop("channels_split")
but it gives me an error that:
`~` is missing
You're referencing the column `~`, but it is missing from the schema. Please check your code.
Note that I am using pyspark within Foundry
Coalesce expects cols as arguments and you are providing String, i think that you should use lit("~")
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.coalesce.html
This is more of a spark problem than a Foundry problem though.
What #M_S said is correct, as the error message was stating, you need a column and should use the lit function.
But be careful if
spark.sql.ansi.enabled is set to True
then this code will throw an ArrayIndexOutOfBoundsException if there are less than 5 items in your array.
Another way to do this would be to ensure that your array has at least 5 elements on each row by adding some ~, then get the first 5 items or by adding a temporary column with the length of the array and use a when condition.
By the way, you don't need to repeat df = every time
df = (
df
.withColumn("channels_split", split(df["channels"], ","))
.withColumn("default_values", array_repeat(lit("~"), 5)))
.withColumn("channels_split", concat(col("channels_split"), col("default_values")))
.withColumn("channel1", df["channels_split"][0])
.withColumn("channel2", df["channels_split"][1])
.withColumn("channel3", df["channels_split"][2])
.withColumn("channel4", df["channels_split"][3])
.withColumn("channel5", df["channels_split"][4])
.drop("channels_split")
)

Get only rows of dataframe where a subset of columns exist in another dataframe

I want to get all rows of a dataframe (df2) where the city column value and postcode column value also exist in another dataframe (df1).
Important is here that I want the combination of both columns and not look at the column individually.
My approach was this:
#1. Get all combinations
df_combinations=np.array(df1.select("Ort","Postleitzahl").dropDuplicates().collect())
sc.broadcast(df_combinations)
#2.Define udf
def combination_in_vx(ort,plz):
for arr_el in dfSpark_combinations:
if str(arr_el[0]) == ort and int(arr_el[1]) == plz:
return True
return False
combination_in_vx = udf(combination_in_vx, BooleanType())
#3.
df_tmp=df_2.withColumn("Combination_Exists", combination_in_vx('city','postcode'))
df_result=df_tmp.filter(df_tmp.Combination_Exists)
Although this should theoretically work it takes forever!
Does anybody know about a better solution here? Thank you very much!
You can do a left semi join using the two columns. This will include the rows in df2 where the values in both of the two specified columns exist in df1:
import pyspark.sql.functions as F
df_result = df2.join(df1, ["Ort", "Postleitzahl"], 'left_semi')

How to select an aggregated column from a dataframe

df = df.groupby(F.upper(F.col('count'))).agg({'totalvis.age':'avg'}).show() creates avg(totalvis.age AS age) column.
I want to use another aggregate function to select max of the newly created column, but there is an issue with column name not being resolvable.
You can use this syntax below to assign a column alias to the aggregation:
import pyspark.sql.functions as F
df2 = df.groupby(F.upper(F.col('county'))).agg(F.avg('totalvisitor.age').alias('age_avg'))
Then you can select the maximum as df2.select(F.max('age_avg')).
PS Note that in the code you provided in the question, you have overwritten df with None after calling
df = df.(...).show()
because df.show() returns None.

How do I filter a dataframe column using regex?

Here is my regex
date_regex='\d{1,2}\/\d{1,2}\/\d{4}$'
Here is my dates_as_first_row Dataframe
I am trying to filter out the date column but get an empty (377,0) Dataframe.
date_column=dates_as_first_row.filter(regex=attempt,axis='columns')
You can do this using .str.match.
If your column is named '0', it looks like this:
indexer=df['0'].str.match('\d{1,2}\/\d{1,2}\/\d{4}$')
df[indexer]
If you want to select all rows which contain the pattern in any of the string columns, you can do:
# v- select takes all object columns
# v- apply the lambda expression to each of the selected object columns with True if the row contains the pattern in the specific column
# v- if any of the column vectors contains True, return True, otherwise False (the column vectors contain the result of str.match which is a boolean)
indexer=df.select_dtypes('O').apply(lambda ser: ser.str.match('\d{1,2}\/\d{1,2}\/\d{4}$'), axis='index').any(axis='columns')
df[indexer]
But note, that this only works, if all your object columns actually store strings. Which is usually the case if you let pandas figure out the types of the columns itself when the dataframe is created. If that is not the case, you need to add a type check, to avoid a runtime error:
import re
def filter_dates(ser):
date_re= re.compile('\d{1,2}\/\d{1,2}\/\d{4}$')
return ser.map(lambda val: type(val) == str and bool(date_re.match(val)))
df.select_dtypes('O').apply(filter_dates, axis='index').any(axis='columns')

How to insert values from numpy into sql database with given columns?

I need to insert some columns into a table in my Mariadb. The table name is Customer and has 6 columns, A,B,C,D,E,F. The primary keys are in the first column, column B has an address, C,D, and E contain None values and F the zip code.
I have an pandas dataframe that follows similar format. I converted it to numpy array by doing the following:
data = df.iloc[:,1:4].values
hence data is a numpy array containing 3 columns and i need this inserted into C,D and E. I tried:
query = """
Insert Into Customer (C,D,E) VALUES (?,?,?)
"""
cur.executemany(query,data)
cur.commit()
But i get an error:
The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
I solved it. Although very slow...
query = """
Alter Customer SET
C = %s
D = %s
E = %s
where A = %s
"""
for row in data:
cur.execute(query,args=(row[1],row[2],row[3],row[0])
con.commit()

Resources