How to pivot 2 columns in PySpark - apache-spark

I have a dataframe as below and I need to pivot it such that two new columns are created with the values "var1" and "var2" taken from the column headers as "var3" and then with the amount associated with each as "var4", grouped by id. There are other columns in the dataframe I am working with, but they are all at the same level of id.
id
var1
var2
465
1000
200
455
2000
400
The resulting output would be:
id
var3
var4
465
var1
1000
465
var2
200
455
var1
2000
455
var2
400

Use unpivot:
df.unpivot(['id'], ['var1', 'var2'], 'var3', 'var4').show()
Or stack:
df.selectExpr("id", "stack(2, 'var1', var1, 'var2', var2) as (var3, var4)").show()
Or melt:
df.melt(ids=['id'], values=['var1', 'var2'],variableColumnName="var3",valueColumnName="var4").show()
Input:
Output:

Related

Split corresponding column values in pyspark

Below table would be the input dataframe
col1
col2
col3
1
12;34;56
Aus;SL;NZ
2
31;54;81
Ind;US;UK
3
null
Ban
4
Ned
null
Expected output dataframe [values of col2 and col3 should be split by ; correspondingly]
col1
col2
col3
1
12
Aus
1
34
SL
1
56
NZ
2
31
Ind
2
54
US
2
81
UK
3
null
Ban
4
Ned
null
You can use the pyspark function split() to convert the column with multiple values into an array and then the function explode() to make multiple rows out of the different values.
It may look like this:
df = df.withColumn("<columnName>", explode(split(df.<columnName>, ";")))
If you want to keep NULL values you can use explode_outer().
If you want the values of multiple exploded arrays to match in the rows, you could work with posexplode() and then filter() to the rows where the positions are corresponding.
Below code works perfectly fine
data = [(1,'12;34;56', 'Aus;SL;NZ'),
(2,'31;54;81', 'Ind;US;UK'),
(3,None, 'Ban'),
(4,'Ned', None) ]
columns = ['Id', 'Score','Countries']
df = spark.createDataFrame(data, columns)
#df.show()
df2=df.select("*",posexplode_outer(split("Countries",";")).alias("pos1","value1"))
#df2.show()
df3=df2.select("*",posexplode_outer(split("Score",";")).alias("pos2","value2"))
#df3.show()
df4=df3.filter((df3.pos1==df3.pos2) | (df3.pos1.isNull() | df3.pos2.isNull()))
df4=df4.select("Id","value2","value1")
df4.show() #Final Output

pandas help: map and match tab delimted strings in a column and print into new column

I have a dataframe data which have last column containing a bunch of sting and digits and i have one more dataframe info where those sting and digits means, i want to map user input(item) with info and match, print and count how many of them present in the last column in data and prioritize the dataframe data based on numbder of match
import pandas
#data
data = {'id': [123, 456, 789, 1122, 3344],
'Name': ['abc', 'def', 'hij', 'klm', 'nop'],
'MP-ID': ['MP:001|MP:0085|MP:0985', 'MP:005|MP:0258', 'MP:025|MP:5890', 'MP:0589|MP:02546', 'MP:08597|MP:001|MP:005']}
test_data = pd.DataFrame(data)
#info
info = {'MP-ID': ['MP:001', 'MP:002', 'MP:003', 'MP:004', 'MP:005'], 'Item': ['apple', 'orange', 'grapes', 'bannan', 'mango']}
test_info = pd.DataFrame(info)
user input exmaple:
run.py apple mango
desired output:
id Name MP-ID match count
3344 nop MP:08597|MP:001|MP:005 MP:001|MP:005 2
123 abc MP:001|MP:0085|MP:0985 MP:001 1
456 def MP:005|MP:0258 MP:005 1
789 hij MP:025|MP:5890 0
1122 klm MP:0589|MP:02546 0
Thank you for your help in advance
First get all arguments to variable vars, filter MP-ID by Series.isin with DataFrame.loc and extract them by Series.str.findall with Series.str.join, last use Series.str.count with DataFrame.sort_values:
import sys
vals = sys.argv[1:]
#vals = ['apple','mango']
s = test_info.loc[test_info['Item'].isin(vals), 'MP-ID']
test_data['MP-ID match'] = test_data['MP-ID'].str.findall('|'.join(s)).str.join('|')
test_data['count'] = test_data['MP-ID match'].str.count('MP')
test_data = test_data.sort_values('count', ascending=False, ignore_index=True)
print (test_data)
id Name MP-ID MP-ID match count
0 3344 nop MP:08597|MP:001|MP:005 MP:001|MP:005 2
1 123 abc MP:001|MP:0085|MP:0985 MP:001 1
2 456 def MP:005|MP:0258 MP:005 1
3 789 hij MP:025|MP:5890 0
4 1122 klm MP:0589|MP:02546 0

Compare two files based on column and output should contain both matched and non matched entries

I am new to python. I have written a code and it gives desired solution. Would it be possible for you to help me rewrite the same in pandas/NumPy?
Here are the file content:
File1 contains the following information:
ID1 USA 18 200
ID1 IND 1 100
ID1 BEL 186 150
ID2 UK 185 200
ID3 UK 200 130
file2 contains:
mgk:ID1:brs 1-20 5000
vho:ID1:gld 30-40 4000
sun:ID3:slv 198-400 5500
My code:
`with open(r"x.txt","r")as X,open(r'y.txt',"r")as Y:
datax=[]
datay=[]
for eachx in X:
datax.append(eachx.strip())
for eachy in Y:
datay.append(eachy.strip())
for eachofx in datax:
dataxplitted=eachofx.split("\t")
xid=dataxplitted[0]
locx=int(dataxplitted[2])
#print (xid,locx)
for eachofy in datay:
dataysplitted=eachofy.split("\t")
Yends=dataysplitted[1].split("-")
ySTART=int(Yends[0])
ySTOP=int(Yends[1])
yIDdetails=(dataysplitted[0].split(":"))
yid=yIDdetails[1]
# print(yIDdetails,yid)
if(xid==yid):
if (int(ySTART)<= locx <=int(ySTOP)):
print (xid,mutID,locx,"exists",ySTART,ySTOP,locx-ySTART,ySTOP-locx)
`
Output:
ID1 18 exits 1 20 17 2
ID2 1 exits 1 20 0 19
ID3 200 exits 198 400 2 200
Explanation:
The files are to be compared based on ID1,ID2 etc. In file2(y.txt) it is a part of a string separated by ":".Once I find a matching, I need to check the values in the third column of File1 lies between the values in the second column of File2 (values are separated by "-"). If yes, I need to print "exists". Also I like to have a difference of those values in file1 to the two values (separated by "-") and print it along with. Thank you all.
If you can use pandas:
import pandas as pd
df1 = pd.read_csv('x.txt', sep=' ', names=['id', 'country', 'num1', 'num2'])
df2 = pd.read_csv('y.txt', sep=' ', names=['id', 'num1', 'num3'])
final_df = df1.merge(df2, on = ['id', 'num1'], how='outer')
final_df['status'] = final_df.apply(lambda x: 'exist' if pd.notnull(x['num3']) else 'no', axis=1)

Show One to Many in One Row

I'd like to reformat a cross reference table I am using before merging it to my data.
Certain parts have a one to many relationship and I want to reformat these cases into a single row so I capturing all the info when I later merge/vlookup this table to my data. Most of the data is a one to one relationship so the solution has to be selective.
Currently:
Marketing Number SKU
0 XXX 111
1 XXX 222
2 XXX 333
What I want:
Marketing Number SKU
0 XXX 111; 222; 333
Use groupby to get the SKU values into a list
Then join the list values.
Since the values in the list are int type, they must be converted to strings, to join them.
import pandas as pd
# data and dataframe
data = {'Marketing Number': ['XXX', 'XXX', 'XXX', 'y', 'z', 'a'],
'SKU': [111, 222, 333, 444, 555, 666]}
df = pd.DataFrame(data)
Marketing Number SKU
0 XXX 111
1 XXX 222
2 XXX 333
3 y 444
4 z 555
5 a 666
# groupby with agg list
dfg = df.groupby('Marketing Number', as_index=False).agg({'SKU': list})
# join into string
dfg.SKU = dfg.SKU.apply(lambda x: '; '.join([str(i) for i in x]))
Marketing Number SKU
0 XXX 111; 222; 333
1 a 666
2 y 444
3 z 555

Updating multiple columns of df from another df

I have two dataframes, df1 and df2. I want to update some columns(not all) of df1 from the value which is in df2 columns(names of common column is same in both dataframes) based on key column. df1 can have multiple entries of that key but in df2 each key has only one entry.
df2 :
party_id age person_name col2
0 1 12 abdjc abc
1 2 35 fAgBS sfd
2 3 65 Afdc shd
3 5 34 Afazbf qfwjk
4 6 78 asgsdb fdgd
5 7 35 sdgsd dsfbds
df1:
party_id account_id product_type age dob status col2
0 1 1 Current 25 28-01-1994 active sdag
1 2 2 Savings 31 14-07-1988 pending asdg
2 3 3 Loans 65 22-07-1954 frozen sgsdf
3 3 4 Over Draft Facility 93 29-01-1927 active dsfhgd
4 4 5 Mortgage 93 01-03-1926 pending sdggsd
In this example I want to update age, col2 in df1 based on the value present in df2. And key column here is party_id.
I tried mapping df2 into dict with their key (column wise, one column at time). Here key_name = party_id and column_name = age
dict_key = df2[key_name]
dict_value = df2[column_name]
temp_dict = dict(zip(dict_key, dict_value))
and then map it to df1
df1[column_name].map(temp_dict).fillna(df1[column_name])
But issue here is it is only mapping the one entry not all for that key value.In this example party_id == 3 have multiple entry in df1.
Keys which is not in df2, their respective value for that column should be unchanged.
Can anyone help me with efficient solution as my df1 is of big size more than 500k? So that all columns can update at the same time.
df2 is of moderate size around 3k or something.
Thanks
Idea is use DataFrame.merge with left join first, then get columns with are same in both DataFrames to cols and replace missing values by original values by DataFrame.fillna:
df = df1.merge(df2.drop_duplicates('party_id'), on='party_id', suffixes=('','_'), how='left')
cols = df2.columns.intersection(df1.columns).difference(['party_id'])
df[cols] = df[cols + '_'].rename(columns=lambda x: x.strip('_')).fillna(df[cols])
df = df[df1.columns]
print (df)
party_id age person_name col2
0 1 25.0 abdjc sdag
1 2 31.0 fAgBS asdg
2 3 65.0 Afdc sgsdf
3 5 34.0 Afazbf qfwjk
4 6 78.0 asgsdb fdgd
5 7 35.0 sdgsd dsfbds

Resources