I'd like to reformat a cross reference table I am using before merging it to my data.
Certain parts have a one to many relationship and I want to reformat these cases into a single row so I capturing all the info when I later merge/vlookup this table to my data. Most of the data is a one to one relationship so the solution has to be selective.
Currently:
Marketing Number SKU
0 XXX 111
1 XXX 222
2 XXX 333
What I want:
Marketing Number SKU
0 XXX 111; 222; 333
Use groupby to get the SKU values into a list
Then join the list values.
Since the values in the list are int type, they must be converted to strings, to join them.
import pandas as pd
# data and dataframe
data = {'Marketing Number': ['XXX', 'XXX', 'XXX', 'y', 'z', 'a'],
'SKU': [111, 222, 333, 444, 555, 666]}
df = pd.DataFrame(data)
Marketing Number SKU
0 XXX 111
1 XXX 222
2 XXX 333
3 y 444
4 z 555
5 a 666
# groupby with agg list
dfg = df.groupby('Marketing Number', as_index=False).agg({'SKU': list})
# join into string
dfg.SKU = dfg.SKU.apply(lambda x: '; '.join([str(i) for i in x]))
Marketing Number SKU
0 XXX 111; 222; 333
1 a 666
2 y 444
3 z 555
Related
I have the following dictionary toy example:
d = {"id":666,"dp":[{"Value":"11","Key":"abc"},
{"Value":"88","Key":"kuku"},
{"Value":"99","Key":"lulu"},
{"Value":"John","Key":"name"}]}
I want to convert it to the following dataframe:
id key value
666 abc 11
666 kuku 88
666 lulu 99
666 name John
import pandas as pd
I have tried to use pd.DataFrame.from_dict(d) but I am getting id --> dp dicts.
Please advise, is there any quick method/best practice to attack this kind of format?
I know I can do it in few steps (to create the id column and add it to the key-value pairs.
You can use json_normalize, but repeated values are in last column(s):
df = pd.json_normalize(d, 'dp', 'id')
print(df)
Value Key id
0 11 abc 666
1 88 kuku 666
2 99 lulu 666
3 John name 666
For correct ordering use:
#create list of columns dynamic - all columns names without dp
cols = [c for c in d.keys() if c != 'dp']
print(cols)
['id']
df = pd.json_normalize(d, 'dp', 'id')
#change ordering by joined lists
df = df[cols + df.columns.difference(cols, sort=False).tolist()]
print(df)
id Value Key
0 666 11 abc
1 666 88 kuku
2 666 99 lulu
3 666 John name
you can also try:
df = pd.DataFrame(d)
df[['Value','Key']]=df.dp.apply(pd.Series)
df = df.drop('dp', axis=1)
I am working on one large dataset, the problem am facing is that there are columns that have all integer values, however, as the dataset is uncleaned there are a few rows where there are 'characters' along with integers. Here am trying to illustrate the problem with a small pandas dataframe example,
I have the following dataframe:
Index
l1
l2
l3
0
1
123
23
1
2
Z3V
343
2
3
321
21
3
4
AZ34
345
4
5
432
3
With dataframe code :
l1,l2,l3 = [1,2,3,4,5], [123, 'Z3V', 321, 'AZ34', 432], [23,343,21,345,3]
data = pd.DataFrame(zip(l1,l2,l3), columns=['l1', 'l2', 'l3'])
print(data)
Here as you can see, column 'l2' at rows index 1 and 3 have 'characters' along with integers. I want to find such rows in this particular column and print them. Later I want to replace them with integer values like 100 or something similar integer. i.e. those numbers that I am replacing with will be different for example, am replacing instances of 'Z3V' with 100 and instances of 'AZ34' with 101. My point is to replace characters containing values with integers. Now, if in 'l2' column, 'Z3V' occurs again, there too, I will replace it with 100.
Expected output :
Index
l1
l2
l3
0
1
123
23
1
2
100
343
2
3
321
21
3
4
101
345
4
5
432
3
As you can see, the two instances where there were characters have been replaced with 100 and 101 respectively
How to get this expected output ?
You could do:
import pandas as pd
import numpy as np
# setup
l1, l2, l3 = [1, 2, 3, 4, 5, 6], [123, 'Z3V', 321, 'AZ34', 432, 'Z3V'], [23, 343, 21, 345, 3, 3]
data = pd.DataFrame(zip(l1, l2, l3), columns=['l1', 'l2', 'l3'])
# find all non numeric values across the whole DataFrame
mask = data.applymap(np.isreal)
rows, cols = np.where(~mask)
# create the replacement dictionary
replacements = {k: i for i, k in enumerate(np.unique(data.values[rows, cols]), 100)}
# apply the replacements
res = data.replace(replacements)
print(res)
Output
l1 l2 l3
0 1 123 23
1 2 101 343
2 3 321 21
3 4 100 345
4 5 432 3
5 6 101 3
Note that I added an extra row to verify the desire behaviour, now the data DataFrame looks like:
l1 l2 l3
0 1 123 23
1 2 Z3V 343
2 3 321 21
3 4 AZ34 345
4 5 432 3
5 6 Z3V 3
By changing this line:
# create the replacement dictionary
replacements = {k: i for i, k in enumerate(np.unique(data.values[rows, cols]), 100)}
you can change the replacement values as you see fit.
I have a dataframe data which have last column containing a bunch of sting and digits and i have one more dataframe info where those sting and digits means, i want to map user input(item) with info and match, print and count how many of them present in the last column in data and prioritize the dataframe data based on numbder of match
import pandas
#data
data = {'id': [123, 456, 789, 1122, 3344],
'Name': ['abc', 'def', 'hij', 'klm', 'nop'],
'MP-ID': ['MP:001|MP:0085|MP:0985', 'MP:005|MP:0258', 'MP:025|MP:5890', 'MP:0589|MP:02546', 'MP:08597|MP:001|MP:005']}
test_data = pd.DataFrame(data)
#info
info = {'MP-ID': ['MP:001', 'MP:002', 'MP:003', 'MP:004', 'MP:005'], 'Item': ['apple', 'orange', 'grapes', 'bannan', 'mango']}
test_info = pd.DataFrame(info)
user input exmaple:
run.py apple mango
desired output:
id Name MP-ID match count
3344 nop MP:08597|MP:001|MP:005 MP:001|MP:005 2
123 abc MP:001|MP:0085|MP:0985 MP:001 1
456 def MP:005|MP:0258 MP:005 1
789 hij MP:025|MP:5890 0
1122 klm MP:0589|MP:02546 0
Thank you for your help in advance
First get all arguments to variable vars, filter MP-ID by Series.isin with DataFrame.loc and extract them by Series.str.findall with Series.str.join, last use Series.str.count with DataFrame.sort_values:
import sys
vals = sys.argv[1:]
#vals = ['apple','mango']
s = test_info.loc[test_info['Item'].isin(vals), 'MP-ID']
test_data['MP-ID match'] = test_data['MP-ID'].str.findall('|'.join(s)).str.join('|')
test_data['count'] = test_data['MP-ID match'].str.count('MP')
test_data = test_data.sort_values('count', ascending=False, ignore_index=True)
print (test_data)
id Name MP-ID MP-ID match count
0 3344 nop MP:08597|MP:001|MP:005 MP:001|MP:005 2
1 123 abc MP:001|MP:0085|MP:0985 MP:001 1
2 456 def MP:005|MP:0258 MP:005 1
3 789 hij MP:025|MP:5890 0
4 1122 klm MP:0589|MP:02546 0
I am new to python. I have written a code and it gives desired solution. Would it be possible for you to help me rewrite the same in pandas/NumPy?
Here are the file content:
File1 contains the following information:
ID1 USA 18 200
ID1 IND 1 100
ID1 BEL 186 150
ID2 UK 185 200
ID3 UK 200 130
file2 contains:
mgk:ID1:brs 1-20 5000
vho:ID1:gld 30-40 4000
sun:ID3:slv 198-400 5500
My code:
`with open(r"x.txt","r")as X,open(r'y.txt',"r")as Y:
datax=[]
datay=[]
for eachx in X:
datax.append(eachx.strip())
for eachy in Y:
datay.append(eachy.strip())
for eachofx in datax:
dataxplitted=eachofx.split("\t")
xid=dataxplitted[0]
locx=int(dataxplitted[2])
#print (xid,locx)
for eachofy in datay:
dataysplitted=eachofy.split("\t")
Yends=dataysplitted[1].split("-")
ySTART=int(Yends[0])
ySTOP=int(Yends[1])
yIDdetails=(dataysplitted[0].split(":"))
yid=yIDdetails[1]
# print(yIDdetails,yid)
if(xid==yid):
if (int(ySTART)<= locx <=int(ySTOP)):
print (xid,mutID,locx,"exists",ySTART,ySTOP,locx-ySTART,ySTOP-locx)
`
Output:
ID1 18 exits 1 20 17 2
ID2 1 exits 1 20 0 19
ID3 200 exits 198 400 2 200
Explanation:
The files are to be compared based on ID1,ID2 etc. In file2(y.txt) it is a part of a string separated by ":".Once I find a matching, I need to check the values in the third column of File1 lies between the values in the second column of File2 (values are separated by "-"). If yes, I need to print "exists". Also I like to have a difference of those values in file1 to the two values (separated by "-") and print it along with. Thank you all.
If you can use pandas:
import pandas as pd
df1 = pd.read_csv('x.txt', sep=' ', names=['id', 'country', 'num1', 'num2'])
df2 = pd.read_csv('y.txt', sep=' ', names=['id', 'num1', 'num3'])
final_df = df1.merge(df2, on = ['id', 'num1'], how='outer')
final_df['status'] = final_df.apply(lambda x: 'exist' if pd.notnull(x['num3']) else 'no', axis=1)
Input_pyspark_dataframe:
id name collection student.1.price student.2.price student.3.price
111 aaa 1 100 999 232
222 bbb 2 200 888 656
333 ccc 1 300 777 454
444 ddd 1 400 666 787
output_pyspark_dataframe
id name collection price
111 aaa 1 100
222 bbb 2 888
333 ccc 1 300
444 ddd 3 787
we can find the correct price of each id by using value present in the collection column
Question
using pyspark, How i can find the correct price of each id by dynamically framing column name student.{collection}.price ?
please let me know.
A bit complete but you can do this way.
The fields will give you the field names of the struct field, student. You should give this manually and eventually get 1, 2, 3.
The first line then make an array of the columns student.{i}.price for i = range(1, 4). Similarly, the second line make an array of the literals {i}.
Now, zip this two array into one array such as
[('1', col('student.1.price')), ...]
and explode the array then it becomes:
('1', col('student.1.price'))
('2', col('student.2.price'))
('3', col('student.3.price'))
Since the arrays_zip give you an array of struct, the above result is struct type. Get each value by using struct key as the column, that is the index and price.
Finally, you can compare the collection and index (this is actually the field name of the student struct column).
import pyspark.sql.functions as f
fields = [field.name for field in next(field for field in df.schema.fields if field.name == 'student').dataType.fields]
df.withColumn('array', f.array(*map(lambda x: 'student.' + x + '.price', fields))) \
.withColumn('index', f.array(*map(lambda x: f.lit(x), fields))) \
.withColumn('zip', f.arrays_zip('index', 'array')) \
.withColumn('zip', f.explode('zip')) \
.withColumn('index', f.col('zip.index')) \
.withColumn('price', f.col('zip.array')) \
.filter('collection = index') \
.select('id', 'name', 'collection', 'price') \
.show(10, False)
+---+----+----------+-----+
|id |name|collection|price|
+---+----+----------+-----+
|111|aaa |1 |100 |
|222|bbb |2 |888 |
|333|ccc |1 |300 |
|444|ddd |3 |787 |
+---+----+----------+-----+