Pyspark how to convert repeated row element to dictionary list - python-3.x

I have below joined data frame and i want to convert each Row values in to two dictionary list
df=
+--------+----------+-----+----+-----+-----------+-------+---+
|name |department|state|id |name | department| state | id|
+--------+----------+-----+----+-----+-----------+-------+---+
|James |Sales |NY |101 |James| Sales1 |null |101|
|Maria |Finance |CA |102 |Maria| Finance | |102|
when in convert to row
df.collect()
[Row(name=James,department=Sales,state=NY,id=101,name=James,department=Sales1,state=None,id=101),Row(name=Maria,department=Finance,state=CA,id=102,name=Maria,
department=Finance,state='',id=102)]
I need to create a two dictionary from each row. that is, each Row has repeated keys as such 'name,departement,state,id' but the values are different. so i need to have two dictionary from each Row, so that i can compare the differences of two dictionary from each row
#Expected:
[({name:James,department:Sales,state:NY,id:101},{name:James,department:Sales1,state:None,id:101}),
({name:Maria,department:Finance,state:CA,id:102},{name:Maria,department:Finance,state:CA,id:102,name:Maria,department:Finance,state:'',id:102})]
Is there any other solution to this?

Related

Unable to create a new column from a list using spark concat method?

i have below data frame in which i am trying to create a new column by concatinating name from a list
df=
----------------------------------
| name| department| state| id| hash
------+-----------+-------+---+----
James| Sales1 |null |101|4df2
Maria| Finance | |102|5rfg
Jen | |NY2 |103|234
key_list=['name','state','id']
df.withColumn('prim_key', concat(*key_list)
df.show()
but above return the same result
----------------------------------
| name| department| state| id| hash
------+-----------+-------+---+----
James| Sales1 |null |101|4df2
Maria| Finance | |102|5rfg
Jen | |NY2 |103|234
i suspecting it might be due to space in the column names in DF. so i used trim to remove all space in column names, but no luck . it returning the same result
Any solution to this?
i found it... the issue was due to assigning the result to new or existing df
df=df.withColumn('prim_key', concat(*key_list)

How to create a combined data frame from each columns?

I am trying to concatenate same column values from two data frame to single data frame
For eg:
df1=
name | department| state | id|hash
-----+-----------+-------+---+---
James|Sales |NY |101| c123
Maria|Finance |CA |102| d234
Jen |Marketing |NY |103| df34
df2=
name | department| state | id|hash
-----+-----------+-------+---+----
James| Sales1 |null |101|4df2
Maria| Finance | |102|5rfg
Jen | |NY2 |103|234
Since both having same column names, i renamed columns of df1
new_col=[c+ '_r' for c in df1.columns]
df1=df1.toDF(*new_col)
joined_df=df1.join(df2,df3._rid==df2.id,"inner")
+--------+------------+-----+----+-----+-----------+-------+---+---+----+
|name_r |department_r|state_r|id_r|hash_r |name | department|state| id|hash
+--------+------------+-------+----+-------+-----+-----------+-----+---+----
|James |Sales |NY |101 | c123 |James| Sales1 |null |101| 4df2
|Maria |Finance |CA |102 | d234 |Maria| Finance | |102| 5rfg
|Jen |Marketing |NY |103 | df34 |Jen | |NY2 |103| 2f34
so now i am trying to concatenate values of same columns and create a single data frame
combined_df=spark.createDataFrame([],StuctType[])
for col1 in df1.columns:
for col2 in df2.columns:
if col1[:-2]==col2:
joindf=joindf.select(concate(list('[')(col(col1),lit(","),col(col2),lit(']')).alias("arraycol"+col2))
col_to_select="arraycol"+col2
filtered_df=joindf.select(col_to_select)
renamed_df=filtered_df.withColumnRenamed(col_to_select,col2)
renamed_df.show()
if combined_df.count() < 0:
combined_df=renamed_df
else:
combined_df=combined_df.rdd.zip(renamed_df.rdd).map(lambda x: x[0]+x[1])
new_combined_df=spark.createDataFrame(combined_df,df2.schema)
new_combined_df.show()
but it return error says:
an error occurred while calling z:org.apache.spark.api.python.PythonRdd.runJob. can only zip RDD with same number of elements in each partition
i see in the loop -renamed_df.show()-it producing expected column with values
eg:
renamed_df.show()
+----------------+
|name |
['James','James']|
['Maria','Maria']|
['Jen','Jen'] |
but i am expecting to create a combined df as seen below
+-----------------------------------------------------------+-----+--------------+
|name | department | state | id | hash
['James','James']|['Sales','Sales'] |['NY',null] |['101','101']|['c123','4df2']
['Maria','Maria']|['Finance','Finance']|['CA',''] |['102','102']|['d234','5rfg']
['Jen','Jen'] |['Marketing',''] |['NY','NY2']|['102','103']|['df34','2f34']
Any solution to this?
You actually want to use collect_list to do this. Gather all the data in one data frame, group it to enable us to use collect_list..
union_all = df1.unionByName(df2, allowMissingColumns=True)
myArray = []
for myCol in union_all.columns:
myArray += [f.collect_list(myCol)]
union_all.withColumn( "temp_name", col("id"))\ # to use for grouping.
.groupBy("temp_name")\
.agg( *myArray )\
.drop("temp_name") # cleanup of extra column used for grouping.
If you only want unique values you can use collect_set instead.

How to extract all column and value from pyspark row?

i am trying to extract all columns and value respectively from the below data frame row?
+--------+----------+-----+----+-----+-----------+-------+---+
|name |department|state|id |name | department| state | id|
+--------+----------+-----+----+-----+-----------+-------+---+
|James |Sales |NY |101 |James| Sales1 |null |101|
row=
[Row(name=James,department=Sales,state=NY,id=101,name=James,department=Sales1,state=None,id=101)]
tem_dict={}
for indext,value in enumerate(row):
if row[index] in tem_dict:
tem_dict[row[index]]=[row[index], value]
else:
tem_dict[row[index]]=value
But this not giving expected result. since it has repeated element i want to combine the value of similar column and print it in a array as seen below
#expected
[{name:[james,james],departement:[Sales,Sales1],state:[NY,none],id:[101,101]}]
Or any way to do with rdd?
Any solution to this?
Let's say:
>>> row
[Row(name=James,department=Sales,state=NY,id=101,name=James,department=Sales1,state=None,id=101), Row(...), Row(...), etc]
In that case to get what you expect I'd do:
tem_dict={}
for inner_row in row:
for key, val in inner_row.asDict().items():
if tem_dict.get(key, None) is None:
tem_dict[key] = [val]
else:
tem_dict[key].append(val)
print(tem_dict)

How to split rows in a dataframe to multiple rows based on delimiter

Input
Column1 column2. column3
(1,2) (xyz,abc). (123,456)
Output should be
Column1 column2 column3
1. Xyz. 123
2. Abc. 456
I need to split the data in data frame. As 1st element of every column should come as one row.and 2nd second and so on element of that data would be splitted and coming as another row subsequently.
If you are using a recent version of Spark, arrays_zip will help you do what you want:
// define test dataset
val df = spark.createDataset(List(("(1,2)","(xyz,abc)","(123,456)")))
.toDF("Column1","Column2","Column3")
df.show
+-------+---------+---------+
|Column1| Column2| Column3|
+-------+---------+---------+
| (1,2)|(xyz,abc)|(123,456)|
+-------+---------+---------+
With this dataset, you can split all delimited text values into arrays:
val reshape_cols = df.columns
.map(c => split(regexp_replace(col(c),"[()]",""),",").as(c))
val reshaped_df = df.select(reshape_cols:_*)
reshaped_df.show
+-------+----------+----------+
|Column1| Column2| Column3|
+-------+----------+----------+
| [1, 2]|[xyz, abc]|[123, 456]|
+-------+----------+----------+
Now that you have arrays, you can use arrays_zip to generate a single column of type array of struct
val zipped_df = reshaped_df
.select(arrays_zip(reshaped_df.columns.map(col):_*).as("value"))
zipped_df.show(false)
+------------------------------+
|value |
+------------------------------+
|[[1, xyz, 123], [2, abc, 456]]|
+------------------------------+
Now that you have an array of struct, you can use explode to transform your single row into multiple rows:
val final_df = zipped_df
.select(explode('value).as("s"))
.select(df.columns.map(c => 's(c).as(c)):_*)
final_df.show
+-------+-------+-------+
|Column1|Column2|Column3|
+-------+-------+-------+
| 1| xyz| 123|
| 2| abc| 456|
+-------+-------+-------+

PySpark: How to check if a column contains a number using isnan [duplicate]

This question already has answers here:
Count number of non-NaN entries in each column of Spark dataframe in PySpark
(5 answers)
Closed 5 years ago.
I have a dataframe which looks like this:
+------------------------+----------+
|Postal code |PostalCode|
+------------------------+----------+
|Muxía |null |
|Fuensanta |null |
|Salobre |null |
|Bolulla |null |
|33004 |null |
|Santa Eulàlia de Ronçana|null |
|Cabañes de Esgueva |null |
|Vallarta de Bureba |null |
|Villaverde del Monte |null |
|Villaluenga del Rosario |null |
+------------------------+----------+
If the Postal code column contains only numbers, I want to create a new column where only numerical postal codes are stored. If the postal code column contains only text, want to create an new column called 'Municipality'.
I tried to use 'isnan' as my understanding this will check if a value is not a number, but this does not seem to work. Should the column type be string for this to work or?
So far my attempt are:
> df2 = df.withColumn('PostalCode', when(isnan(df['Postal code']), df['Postal code'])
Looking at the dataframe results example posted above, you can see all values 'Null' are returned for new column, also for postal code '33004'
Any ideas will be much appreciated
isnan only returns true if the column contains an mathematically invalid number, for example 5/0. In any other case, including strings, it will return false. If you want to check if a column contains a numerical value, you need to define your own udf, for example as shown below:
from pyspark.sql.functions import when,udf
from pyspark.sql.types import BooleanType
df = spark.createDataFrame([('33004', ''),('Muxia', None), ('Fuensanta', None)], ("Postal code", "PostalCode"))
def is_digit(value):
if value:
return value.isdigit()
else:
return False
is_digit_udf = udf(is_digit, BooleanType())
df = df.withColumn('PostalCode', when(is_digit_udf(df['Postal code']), df['Postal code']))
df = df.withColumn('Municipality', when(~is_digit_udf(df['Postal code']), df['Postal code']))
df.show()
This gives as output:
+-----------+----------+------------+
|Postal code|PostalCode|Municipality|
+-----------+----------+------------+
| 33004| 33004| null|
| Muxia| null| Muxia|
| Fuensanta| null| Fuensanta|
+-----------+----------+------------+

Resources