How to extract all column and value from pyspark row? - python-3.x

i am trying to extract all columns and value respectively from the below data frame row?
+--------+----------+-----+----+-----+-----------+-------+---+
|name |department|state|id |name | department| state | id|
+--------+----------+-----+----+-----+-----------+-------+---+
|James |Sales |NY |101 |James| Sales1 |null |101|
row=
[Row(name=James,department=Sales,state=NY,id=101,name=James,department=Sales1,state=None,id=101)]
tem_dict={}
for indext,value in enumerate(row):
if row[index] in tem_dict:
tem_dict[row[index]]=[row[index], value]
else:
tem_dict[row[index]]=value
But this not giving expected result. since it has repeated element i want to combine the value of similar column and print it in a array as seen below
#expected
[{name:[james,james],departement:[Sales,Sales1],state:[NY,none],id:[101,101]}]
Or any way to do with rdd?
Any solution to this?

Let's say:
>>> row
[Row(name=James,department=Sales,state=NY,id=101,name=James,department=Sales1,state=None,id=101), Row(...), Row(...), etc]
In that case to get what you expect I'd do:
tem_dict={}
for inner_row in row:
for key, val in inner_row.asDict().items():
if tem_dict.get(key, None) is None:
tem_dict[key] = [val]
else:
tem_dict[key].append(val)
print(tem_dict)

Related

How to create a combined data frame from each columns?

I am trying to concatenate same column values from two data frame to single data frame
For eg:
df1=
name | department| state | id|hash
-----+-----------+-------+---+---
James|Sales |NY |101| c123
Maria|Finance |CA |102| d234
Jen |Marketing |NY |103| df34
df2=
name | department| state | id|hash
-----+-----------+-------+---+----
James| Sales1 |null |101|4df2
Maria| Finance | |102|5rfg
Jen | |NY2 |103|234
Since both having same column names, i renamed columns of df1
new_col=[c+ '_r' for c in df1.columns]
df1=df1.toDF(*new_col)
joined_df=df1.join(df2,df3._rid==df2.id,"inner")
+--------+------------+-----+----+-----+-----------+-------+---+---+----+
|name_r |department_r|state_r|id_r|hash_r |name | department|state| id|hash
+--------+------------+-------+----+-------+-----+-----------+-----+---+----
|James |Sales |NY |101 | c123 |James| Sales1 |null |101| 4df2
|Maria |Finance |CA |102 | d234 |Maria| Finance | |102| 5rfg
|Jen |Marketing |NY |103 | df34 |Jen | |NY2 |103| 2f34
so now i am trying to concatenate values of same columns and create a single data frame
combined_df=spark.createDataFrame([],StuctType[])
for col1 in df1.columns:
for col2 in df2.columns:
if col1[:-2]==col2:
joindf=joindf.select(concate(list('[')(col(col1),lit(","),col(col2),lit(']')).alias("arraycol"+col2))
col_to_select="arraycol"+col2
filtered_df=joindf.select(col_to_select)
renamed_df=filtered_df.withColumnRenamed(col_to_select,col2)
renamed_df.show()
if combined_df.count() < 0:
combined_df=renamed_df
else:
combined_df=combined_df.rdd.zip(renamed_df.rdd).map(lambda x: x[0]+x[1])
new_combined_df=spark.createDataFrame(combined_df,df2.schema)
new_combined_df.show()
but it return error says:
an error occurred while calling z:org.apache.spark.api.python.PythonRdd.runJob. can only zip RDD with same number of elements in each partition
i see in the loop -renamed_df.show()-it producing expected column with values
eg:
renamed_df.show()
+----------------+
|name |
['James','James']|
['Maria','Maria']|
['Jen','Jen'] |
but i am expecting to create a combined df as seen below
+-----------------------------------------------------------+-----+--------------+
|name | department | state | id | hash
['James','James']|['Sales','Sales'] |['NY',null] |['101','101']|['c123','4df2']
['Maria','Maria']|['Finance','Finance']|['CA',''] |['102','102']|['d234','5rfg']
['Jen','Jen'] |['Marketing',''] |['NY','NY2']|['102','103']|['df34','2f34']
Any solution to this?
You actually want to use collect_list to do this. Gather all the data in one data frame, group it to enable us to use collect_list..
union_all = df1.unionByName(df2, allowMissingColumns=True)
myArray = []
for myCol in union_all.columns:
myArray += [f.collect_list(myCol)]
union_all.withColumn( "temp_name", col("id"))\ # to use for grouping.
.groupBy("temp_name")\
.agg( *myArray )\
.drop("temp_name") # cleanup of extra column used for grouping.
If you only want unique values you can use collect_set instead.

Pyspark how to convert repeated row element to dictionary list

I have below joined data frame and i want to convert each Row values in to two dictionary list
df=
+--------+----------+-----+----+-----+-----------+-------+---+
|name |department|state|id |name | department| state | id|
+--------+----------+-----+----+-----+-----------+-------+---+
|James |Sales |NY |101 |James| Sales1 |null |101|
|Maria |Finance |CA |102 |Maria| Finance | |102|
when in convert to row
df.collect()
[Row(name=James,department=Sales,state=NY,id=101,name=James,department=Sales1,state=None,id=101),Row(name=Maria,department=Finance,state=CA,id=102,name=Maria,
department=Finance,state='',id=102)]
I need to create a two dictionary from each row. that is, each Row has repeated keys as such 'name,departement,state,id' but the values are different. so i need to have two dictionary from each Row, so that i can compare the differences of two dictionary from each row
#Expected:
[({name:James,department:Sales,state:NY,id:101},{name:James,department:Sales1,state:None,id:101}),
({name:Maria,department:Finance,state:CA,id:102},{name:Maria,department:Finance,state:CA,id:102,name:Maria,department:Finance,state:'',id:102})]
Is there any other solution to this?

How to split rows in a dataframe to multiple rows based on delimiter

Input
Column1 column2. column3
(1,2) (xyz,abc). (123,456)
Output should be
Column1 column2 column3
1. Xyz. 123
2. Abc. 456
I need to split the data in data frame. As 1st element of every column should come as one row.and 2nd second and so on element of that data would be splitted and coming as another row subsequently.
If you are using a recent version of Spark, arrays_zip will help you do what you want:
// define test dataset
val df = spark.createDataset(List(("(1,2)","(xyz,abc)","(123,456)")))
.toDF("Column1","Column2","Column3")
df.show
+-------+---------+---------+
|Column1| Column2| Column3|
+-------+---------+---------+
| (1,2)|(xyz,abc)|(123,456)|
+-------+---------+---------+
With this dataset, you can split all delimited text values into arrays:
val reshape_cols = df.columns
.map(c => split(regexp_replace(col(c),"[()]",""),",").as(c))
val reshaped_df = df.select(reshape_cols:_*)
reshaped_df.show
+-------+----------+----------+
|Column1| Column2| Column3|
+-------+----------+----------+
| [1, 2]|[xyz, abc]|[123, 456]|
+-------+----------+----------+
Now that you have arrays, you can use arrays_zip to generate a single column of type array of struct
val zipped_df = reshaped_df
.select(arrays_zip(reshaped_df.columns.map(col):_*).as("value"))
zipped_df.show(false)
+------------------------------+
|value |
+------------------------------+
|[[1, xyz, 123], [2, abc, 456]]|
+------------------------------+
Now that you have an array of struct, you can use explode to transform your single row into multiple rows:
val final_df = zipped_df
.select(explode('value).as("s"))
.select(df.columns.map(c => 's(c).as(c)):_*)
final_df.show
+-------+-------+-------+
|Column1|Column2|Column3|
+-------+-------+-------+
| 1| xyz| 123|
| 2| abc| 456|
+-------+-------+-------+

how to map each column to other column in pyspark dataframe?

I have created dataframe by executing below code .
from pyspark.sql import Row
l = [('Ankit',25,'Ankit','Ankit'),('Jalfaizy',22,'Jalfaizy',"aa"),('saurabh',20,'saurabh',"bb"),('Bala',26,"aa","bb")]
rdd = sc.parallelize(l)
people = rdd.map(lambda x: Row(name=x[0], age=int(x[1]),lname=x[2],mname=x[3]))
schemaPeople = sqlContext.createDataFrame(people)
schemaPeople.show()
after executing above code my result is like below.
+---+--------+-----+--------+
|age| lname|mname| name|
+---+--------+-----+--------+
| 25| Ankit|Ankit| Ankit|
| 22|Jalfaizy| aa|Jalfaizy|
| 20| saurabh| bb| saurabh|
| 26| aa| bb| Bala|
+---+--------+-----+--------+
but I want map each column value in the each row and based on age column what are the columns are same , my expected result is like below.
+---+----------------+-------------------+------------------+
|age| lname_map_same | mname_map_same | name_map_same |
+---+----------------+-------------------+------------------+
| 25| mname,name | lname,name | lname,mname |
| 22| name | none | lname |
| 20| name | none | lname |
| 26| none | none | none |
+---+----------------+-------------------+------------------+
You can solve your problem with a map function. Have a look at the following code:
df_new = spark.createDataFrame([
( 25,"Ankit","Ankit","Ankit"),( 22,"Jalfaizy","aa","Jalfaizy"),( 26,"aa","bb","Bala")
], ("age", "lname","mname","name"))
#only 3 records added to dataset
def find_identical(row):
labels = ["lname","mname","name"]
result = [row[0],] #save the age for final result
row = row[1:] #drop the age from row
for i in range(3):
s = []
field = row[i]
if field == row[(i+1)%3]: #check whether field is identical with next field
s.append(labels[(i+1)%3])
if field == row[(i-1)%3]: #check whether field is identical with previous field
s.append(labels[(i-1)%3])
if not s: #if no identical values found return None
s = None
result.append(s)
return result
df_new.rdd.map(find_identical).toDF(["age","lname_map_same","mname_map_same","name_map_same"]).show()
Output:
+---+--------------+--------------+--------------+
|age|lname_map_same|mname_map_same| name_map_same|
+---+--------------+--------------+--------------+
| 25| [mname, name]| [name, lname]|[lname, mname]|
| 22| [name]| null| [lname]|
| 26| null| null| null|
+---+--------------+--------------+--------------+
If you want to have 5 columns that should be considered, you can follow the instructions in the comment. So you have to modify the labels list and add an additional if statement. Furthermore, all modulo operations have to be adjusted to match with 5 and the for-loop should iterate over 5 elements. Then you end up with the code looking like:
df_new = spark.createDataFrame([
( 25,"Ankit","Ankit","Ankit","Ankit","Ankit"),( 22,"Jalfaizy","aa","Jalfaizy","Jalfaizy","aa"),( 26,"aa","bb","Bala","cc","dd")
], ("age", "lname","mname","name","n1","n2"))
def find_identical(row):
labels = ["lname","mname","name","n1","n2"]
result = [row[0],]
row = row[1:]
for i in range(5):
s = []
field = row[i]
if field == row[(i+1)%5]:
s.append(labels[(i+1)%5])
if field == row[(i-1)%5]:
s.append(labels[(i-1)%5])
if field == row[(i+2)%5]:
s.append(labels[(i+2)%5])
if field == row[(i+3)%5]:
s.append(labels[(i+3)%5])
if not s:
s = None
result.append(s)
return result
df_new.rdd.map(find_identical).toDF(["age","lname_map_same","mname_map_same","name_map_same","n1_map_same","n2_map_same"]).show(truncate=False)
Output:
+---+---------------------+---------------------+----------------------+------------------------+------------------------+
|age|lname_map_same |mname_map_same |name_map_same |n1_map_same |n2_map_same |
+---+---------------------+---------------------+----------------------+------------------------+------------------------+
|25 |[mname, n2, name, n1]|[name, lname, n1, n2]|[n1, mname, n2, lname]|[n2, name, lname, mname]|[lname, n1, mname, name]|
|22 |[name, n1] |[n2] |[n1, lname] |[name, lname] |[mname] |
|26 |null |null |null |null |null |
+---+---------------------+---------------------+----------------------+------------------------+------------------------+
The dynamic approach takes the number of columns as a parameter. But in my case the number should be between 1 and 5 since the dataset was created with maximum of 5 attributes. IT could look like this:
df_new = spark.createDataFrame([
( 25,"Ankit","Ankit","Ankit","Ankit","Ankit"),( 22,"Jalfaizy","aa","Jalfaizy","Jalfaizy","aa"),( 26,"aa","bb","Bala","cc","dd")
], ("age", "n1","n2","n3","n4","n5"))
def find_identical(row,number):
labels = []
for n in range(1,number+1):
labels.append("n"+str(n)) #create labels dynamically
result = [row[0],]
row = row[1:]
for i in range(number):
s = []
field = row[i]
for x in range(1,number):
if field == row[(i+x)%number]:
s.append(labels[(i+x)%number]) #check for similarity in all the other fields
if not s:
s = None
result.append(s)
return result
number=4
colNames=["age",]
for x in range(1,number+1):
colNames.append("n"+str(x)+"_same") #create the 'nX_same' column names
df_new.rdd.map(lambda r: find_identical(r,number)).toDF(colNames).show(truncate=False)
Depending on the number parameter the output varies, and I kept the age column statically as the first column.
Output:
+---+------------+------------+------------+------------+
|age|n1_same |n2_same |n3_same |n4_same |
+---+------------+------------+------------+------------+
|25 |[n2, n3, n4]|[n3, n4, n1]|[n4, n1, n2]|[n1, n2, n3]|
|22 |[n3, n4] |null |[n4, n1] |[n1, n3] |
|26 |null |null |null |null |
+---+------------+------------+------------+------------+

PySpark: How to check if a column contains a number using isnan [duplicate]

This question already has answers here:
Count number of non-NaN entries in each column of Spark dataframe in PySpark
(5 answers)
Closed 5 years ago.
I have a dataframe which looks like this:
+------------------------+----------+
|Postal code |PostalCode|
+------------------------+----------+
|Muxía |null |
|Fuensanta |null |
|Salobre |null |
|Bolulla |null |
|33004 |null |
|Santa Eulàlia de Ronçana|null |
|Cabañes de Esgueva |null |
|Vallarta de Bureba |null |
|Villaverde del Monte |null |
|Villaluenga del Rosario |null |
+------------------------+----------+
If the Postal code column contains only numbers, I want to create a new column where only numerical postal codes are stored. If the postal code column contains only text, want to create an new column called 'Municipality'.
I tried to use 'isnan' as my understanding this will check if a value is not a number, but this does not seem to work. Should the column type be string for this to work or?
So far my attempt are:
> df2 = df.withColumn('PostalCode', when(isnan(df['Postal code']), df['Postal code'])
Looking at the dataframe results example posted above, you can see all values 'Null' are returned for new column, also for postal code '33004'
Any ideas will be much appreciated
isnan only returns true if the column contains an mathematically invalid number, for example 5/0. In any other case, including strings, it will return false. If you want to check if a column contains a numerical value, you need to define your own udf, for example as shown below:
from pyspark.sql.functions import when,udf
from pyspark.sql.types import BooleanType
df = spark.createDataFrame([('33004', ''),('Muxia', None), ('Fuensanta', None)], ("Postal code", "PostalCode"))
def is_digit(value):
if value:
return value.isdigit()
else:
return False
is_digit_udf = udf(is_digit, BooleanType())
df = df.withColumn('PostalCode', when(is_digit_udf(df['Postal code']), df['Postal code']))
df = df.withColumn('Municipality', when(~is_digit_udf(df['Postal code']), df['Postal code']))
df.show()
This gives as output:
+-----------+----------+------------+
|Postal code|PostalCode|Municipality|
+-----------+----------+------------+
| 33004| 33004| null|
| Muxia| null| Muxia|
| Fuensanta| null| Fuensanta|
+-----------+----------+------------+

Resources