I have the following dictionary toy example:
d = {"id":666,"dp":[{"Value":"11","Key":"abc"},
{"Value":"88","Key":"kuku"},
{"Value":"99","Key":"lulu"},
{"Value":"John","Key":"name"}]}
I want to convert it to the following dataframe:
id key value
666 abc 11
666 kuku 88
666 lulu 99
666 name John
import pandas as pd
I have tried to use pd.DataFrame.from_dict(d) but I am getting id --> dp dicts.
Please advise, is there any quick method/best practice to attack this kind of format?
I know I can do it in few steps (to create the id column and add it to the key-value pairs.
You can use json_normalize, but repeated values are in last column(s):
df = pd.json_normalize(d, 'dp', 'id')
print(df)
Value Key id
0 11 abc 666
1 88 kuku 666
2 99 lulu 666
3 John name 666
For correct ordering use:
#create list of columns dynamic - all columns names without dp
cols = [c for c in d.keys() if c != 'dp']
print(cols)
['id']
df = pd.json_normalize(d, 'dp', 'id')
#change ordering by joined lists
df = df[cols + df.columns.difference(cols, sort=False).tolist()]
print(df)
id Value Key
0 666 11 abc
1 666 88 kuku
2 666 99 lulu
3 666 John name
you can also try:
df = pd.DataFrame(d)
df[['Value','Key']]=df.dp.apply(pd.Series)
df = df.drop('dp', axis=1)
Related
I have two data frames. One dataframe (dfA) looks like:
Name gender start_coordinate end_coordinate ID
Peter M 30 150 1
Hugo M 4500 6000 2
Jennie F 300 700 3
The other dataframe (dfB) looks like
Name position string
Peter 89 aa
Jennie 568 bb
Jennie 90 cc
I want to filter data from dfA such that position from dfB falls in the interval of dfA (start coordinate and end coordinate) and names should be same as well. For example, position value of row # 1 of dfB falls in interval specified by row # 1 of dfA and the corresponding name value is also the same therefore, I want this row. In contrast, row # 3 of dfB also falls in interval of row # 1 of dfA but the name value is different therefore, I don't want this record.
The expected out therefore, becomes:
##new_dfA
Name gender start_coordinate end_coordinate ID
Peter M 30 150 1
Jennie F 300 700 3
##new_dfB
Name position string
Peter 89 aa
Jennie 568 bb
In reality, dfB is of size (443068765,10) and dfA is of size (100000,3) therefore, I don't want to use numpy broadcasting because I run into memory error. Is there a way to deal with this problem within pandas framework. Insights will be appreciated.
If you have that many rows, pandas might not be well suited for your application.
That said, if there aren't many rows with identical "Name", you could merge on "Name" and then filter the rows matching your condition:
dfC = dfA.merge(dfB, on='Name')
dfC = dfC[dfC['position'].between(dfC['start_coordinate'], dfC['end_coordinate'])]
dfA_new = dfC[df1.columns]
dfB_new = dfC[df2.columns]
output:
>>> dfA_new
Name gender start_coordinate end_coordinate ID
0 Peter M 30 150 1
1 Jennie F 300 700 3
>>> dfB_new
Name position string
0 Peter 89 aa
1 Jennie 568 bb
use pandasql
pd.sql("select df1.* from df1 inner join df2 on df2.name=df1.name and df2.position between df1.start_coordinate and df1.end_coordinate",globals())
Name gender start_coordinate end_coordinate ID
0 Peter M 30 150 1
1 Jennie F 300 700 3
pd.sql("select df2.* from df1 inner join df2 on df2.name=df1.name and df2.position between df1.start_coordinate and df1.end_coordinate",globals())
Name position string
0 Peter 89 aa
1 Jennie 568 bb
I have a dataframe data which have last column containing a bunch of sting and digits and i have one more dataframe info where those sting and digits means, i want to map user input(item) with info and match, print and count how many of them present in the last column in data and prioritize the dataframe data based on numbder of match
import pandas
#data
data = {'id': [123, 456, 789, 1122, 3344],
'Name': ['abc', 'def', 'hij', 'klm', 'nop'],
'MP-ID': ['MP:001|MP:0085|MP:0985', 'MP:005|MP:0258', 'MP:025|MP:5890', 'MP:0589|MP:02546', 'MP:08597|MP:001|MP:005']}
test_data = pd.DataFrame(data)
#info
info = {'MP-ID': ['MP:001', 'MP:002', 'MP:003', 'MP:004', 'MP:005'], 'Item': ['apple', 'orange', 'grapes', 'bannan', 'mango']}
test_info = pd.DataFrame(info)
user input exmaple:
run.py apple mango
desired output:
id Name MP-ID match count
3344 nop MP:08597|MP:001|MP:005 MP:001|MP:005 2
123 abc MP:001|MP:0085|MP:0985 MP:001 1
456 def MP:005|MP:0258 MP:005 1
789 hij MP:025|MP:5890 0
1122 klm MP:0589|MP:02546 0
Thank you for your help in advance
First get all arguments to variable vars, filter MP-ID by Series.isin with DataFrame.loc and extract them by Series.str.findall with Series.str.join, last use Series.str.count with DataFrame.sort_values:
import sys
vals = sys.argv[1:]
#vals = ['apple','mango']
s = test_info.loc[test_info['Item'].isin(vals), 'MP-ID']
test_data['MP-ID match'] = test_data['MP-ID'].str.findall('|'.join(s)).str.join('|')
test_data['count'] = test_data['MP-ID match'].str.count('MP')
test_data = test_data.sort_values('count', ascending=False, ignore_index=True)
print (test_data)
id Name MP-ID MP-ID match count
0 3344 nop MP:08597|MP:001|MP:005 MP:001|MP:005 2
1 123 abc MP:001|MP:0085|MP:0985 MP:001 1
2 456 def MP:005|MP:0258 MP:005 1
3 789 hij MP:025|MP:5890 0
4 1122 klm MP:0589|MP:02546 0
I am new to python. I have written a code and it gives desired solution. Would it be possible for you to help me rewrite the same in pandas/NumPy?
Here are the file content:
File1 contains the following information:
ID1 USA 18 200
ID1 IND 1 100
ID1 BEL 186 150
ID2 UK 185 200
ID3 UK 200 130
file2 contains:
mgk:ID1:brs 1-20 5000
vho:ID1:gld 30-40 4000
sun:ID3:slv 198-400 5500
My code:
`with open(r"x.txt","r")as X,open(r'y.txt',"r")as Y:
datax=[]
datay=[]
for eachx in X:
datax.append(eachx.strip())
for eachy in Y:
datay.append(eachy.strip())
for eachofx in datax:
dataxplitted=eachofx.split("\t")
xid=dataxplitted[0]
locx=int(dataxplitted[2])
#print (xid,locx)
for eachofy in datay:
dataysplitted=eachofy.split("\t")
Yends=dataysplitted[1].split("-")
ySTART=int(Yends[0])
ySTOP=int(Yends[1])
yIDdetails=(dataysplitted[0].split(":"))
yid=yIDdetails[1]
# print(yIDdetails,yid)
if(xid==yid):
if (int(ySTART)<= locx <=int(ySTOP)):
print (xid,mutID,locx,"exists",ySTART,ySTOP,locx-ySTART,ySTOP-locx)
`
Output:
ID1 18 exits 1 20 17 2
ID2 1 exits 1 20 0 19
ID3 200 exits 198 400 2 200
Explanation:
The files are to be compared based on ID1,ID2 etc. In file2(y.txt) it is a part of a string separated by ":".Once I find a matching, I need to check the values in the third column of File1 lies between the values in the second column of File2 (values are separated by "-"). If yes, I need to print "exists". Also I like to have a difference of those values in file1 to the two values (separated by "-") and print it along with. Thank you all.
If you can use pandas:
import pandas as pd
df1 = pd.read_csv('x.txt', sep=' ', names=['id', 'country', 'num1', 'num2'])
df2 = pd.read_csv('y.txt', sep=' ', names=['id', 'num1', 'num3'])
final_df = df1.merge(df2, on = ['id', 'num1'], how='outer')
final_df['status'] = final_df.apply(lambda x: 'exist' if pd.notnull(x['num3']) else 'no', axis=1)
I'd like to reformat a cross reference table I am using before merging it to my data.
Certain parts have a one to many relationship and I want to reformat these cases into a single row so I capturing all the info when I later merge/vlookup this table to my data. Most of the data is a one to one relationship so the solution has to be selective.
Currently:
Marketing Number SKU
0 XXX 111
1 XXX 222
2 XXX 333
What I want:
Marketing Number SKU
0 XXX 111; 222; 333
Use groupby to get the SKU values into a list
Then join the list values.
Since the values in the list are int type, they must be converted to strings, to join them.
import pandas as pd
# data and dataframe
data = {'Marketing Number': ['XXX', 'XXX', 'XXX', 'y', 'z', 'a'],
'SKU': [111, 222, 333, 444, 555, 666]}
df = pd.DataFrame(data)
Marketing Number SKU
0 XXX 111
1 XXX 222
2 XXX 333
3 y 444
4 z 555
5 a 666
# groupby with agg list
dfg = df.groupby('Marketing Number', as_index=False).agg({'SKU': list})
# join into string
dfg.SKU = dfg.SKU.apply(lambda x: '; '.join([str(i) for i in x]))
Marketing Number SKU
0 XXX 111; 222; 333
1 a 666
2 y 444
3 z 555
I have two dataframes, df1 and df2. I want to update some columns(not all) of df1 from the value which is in df2 columns(names of common column is same in both dataframes) based on key column. df1 can have multiple entries of that key but in df2 each key has only one entry.
df2 :
party_id age person_name col2
0 1 12 abdjc abc
1 2 35 fAgBS sfd
2 3 65 Afdc shd
3 5 34 Afazbf qfwjk
4 6 78 asgsdb fdgd
5 7 35 sdgsd dsfbds
df1:
party_id account_id product_type age dob status col2
0 1 1 Current 25 28-01-1994 active sdag
1 2 2 Savings 31 14-07-1988 pending asdg
2 3 3 Loans 65 22-07-1954 frozen sgsdf
3 3 4 Over Draft Facility 93 29-01-1927 active dsfhgd
4 4 5 Mortgage 93 01-03-1926 pending sdggsd
In this example I want to update age, col2 in df1 based on the value present in df2. And key column here is party_id.
I tried mapping df2 into dict with their key (column wise, one column at time). Here key_name = party_id and column_name = age
dict_key = df2[key_name]
dict_value = df2[column_name]
temp_dict = dict(zip(dict_key, dict_value))
and then map it to df1
df1[column_name].map(temp_dict).fillna(df1[column_name])
But issue here is it is only mapping the one entry not all for that key value.In this example party_id == 3 have multiple entry in df1.
Keys which is not in df2, their respective value for that column should be unchanged.
Can anyone help me with efficient solution as my df1 is of big size more than 500k? So that all columns can update at the same time.
df2 is of moderate size around 3k or something.
Thanks
Idea is use DataFrame.merge with left join first, then get columns with are same in both DataFrames to cols and replace missing values by original values by DataFrame.fillna:
df = df1.merge(df2.drop_duplicates('party_id'), on='party_id', suffixes=('','_'), how='left')
cols = df2.columns.intersection(df1.columns).difference(['party_id'])
df[cols] = df[cols + '_'].rename(columns=lambda x: x.strip('_')).fillna(df[cols])
df = df[df1.columns]
print (df)
party_id age person_name col2
0 1 25.0 abdjc sdag
1 2 31.0 fAgBS asdg
2 3 65.0 Afdc sgsdf
3 5 34.0 Afazbf qfwjk
4 6 78.0 asgsdb fdgd
5 7 35.0 sdgsd dsfbds