My input df is as below:
ID item1 item2 item3
1 a,b b,c b
2 a,c,f b,c b,c,f
3 g,h,i i h,i
4 j,k j,k l
df datatypes for item1, item2 and item3 are string type.
I would like to add a 4th column and transformation required is as below:
pseudo code:
Final_item = item3 - set[col(item1) + col(item2)]
Basically, in the last column, is adding item1 and item2, then apply set to remove duplicates then subtract with item3 column.
Desired output as below:
ID item1 item2 item3 Final_item
1 a,b b,c b a,c
2 a,c,f b,c b,c,f a
3 g,h,i i h,i g
4 j,k j,k l j,k
First split columns and also joined columns by , and then get difference in list comprehension of zipped Series:
i3 = df['item3'].str.split(',')
i12 = (df['item1'] + ',' + df['item2']).str.split(',')
df['Final_item'] = [','.join(set(b) - set(a)) for a, b in zip(i3, i12)]
print (df)
ID item1 item2 item3 Final_item
0 1 a,b b,c b c,a
1 2 a,c,f b,c b,c,f a
2 3 g,h,i i h,i g
3 4 j,k j,k l j,k
Related
This is the available data:
Column A
Column B
Column C
Column D
Column E
item1
traitA
traitB
traitC
traitD
item2
traitE
traitF
traitG
traitH
item3
traitI
traitJ
traitK
item4
traitL
traitM
traitN
item5
traitO
traitP
I have a column of 5,000+ items. They all have different traits (some 2, some up to 20). Those traits are in the same row, in the columns next to the item. I already have the trait count per item and stacked the items for the right amount. Resulting in:
Column Q
Column R
item1
4
item2
4
item3
3
item4
3
item5
2
and:
Column Y
Column Z
item1
item1
item1
item1
item2
item2
item2
item2
item3
item3
item3
item4
item4
item4
item5
item5
The result I need is the following:
Column Y
Column Z
item1
traitA
item1
traitB
item1
traitC
item1
traitD
item2
traitE
item2
traitF
item2
traitG
item2
traitH
item3
traitI
item3
traitJ
item3
traitK
item4
traitL
item4
traitM
item4
traitN
item5
traitO
item5
traitP
I put this in cell Z2:
=VLOOKUP(Y2,$A:$E,2,FALSE)
This works but only for traitA, traitE, traitI, and so on (column B).
So what I need is a dynamic column index number. This needs to find how many 'item1' there are in Column Y in total, and then at which one out of the total in column Y it is at.
Also when you go to the next item, the column index number has to go back to '2', since that will make the VLOOKUP work.
The column index numbers need to be as follows:
Column Y
Column Z
item1
2
item1
3
item1
4
item1
5
item2
2
item2
3
item2
4
item2
5
item3
2
item3
3
item3
4
item4
2
item4
3
item4
4
item5
2
item5
3
Don't have too much experience with ROW and ROWS, I cannot get it to work. Maybe VBA offers the best solution. Or does there also needs to be a COUNTA function?
Any help would be truly appreciated. Thanks!
You can use this formula - based on a table called data
=LET(traits,data[[Column B]:[Column E]],
items,MAP(traits,LAMBDA(i,INDEX(data[Column A],ROW(i)-1))),
FILTER(HSTACK(TOCOL(items),TOCOL(traits)),TOCOL(traits)<>0))
The basic idea is to use TOCOL - for the traits that's easy.
Regarding the items the trick is to fill the traits-matrix with the according item - and then do the TOCOL
(Important: you have to be on the current channel of Excel 365 to have TOCOL and HSTACK)
I'm trying to extract a text out of a column so I can move to another column using a pattern in python but I miss some results at the same time I need to keep the unextracted strings as they are>
My code is:
import pandas as pd
df = pd.DataFrame({
'col': ['item1 (30-10)', 'item2 (200-100)', 'item3 (100 FS)', 'item4 (100+)', 'item1 (1000-2000)' ]
})
pattern = r'(\d+(\,[0-9]+)?\-\d+(\,[a-zA-Z])?\d+)'
df['result'] = df['col'].str.extract(pattern)[0]
print(df)
My output is:
col result
0 item1 (30-10) 30-10
1 item2 (200-100) 200-100
2 item3 (100 FS) NaN
3 item4 (100+) NaN
4 item1 (1000-2000) 1000-2000
My output should be:
col result newcolumn
0 item1 (30-10)
1 item2 (200-100)
2 item3 (100 FS)
3 item4 (100+)
4 item1 (1000-2000)
You can use this:
df['newcolumn'] = df.col.str.extract(r'(\(.+\))')
df['result'] = df['col'].str.extract(r'(\w+)')
Output:
col newcolumn result
0 item1 (30-10) (30-10) item1
1 item2 (200-100) (200-100) item2
2 item3 (100 FS) (100 FS) item3
3 item4 (100+) (100+) item4
4 item1 (1000-2000) (1000-2000) item1
Explanation:
The first expression gets the content within parenthesis (including the parenthesis themselves). The second gets the first word.
You can also do this with .str.split in a single line:
df[['result', 'newcolumn']] = df['col'].str.split(' ', 1, expand=True)
Output:
col result newcolumn
0 item1 (30-10) item1 (30-10)
1 item2 (200-100) item2 (200-100)
2 item3 (100 FS) item3 (100 FS)
3 item4 (100+) item4 (100+)
4 item1 (1000-2000) item1 (1000-2000)
You must use expand=True if your strings have a non-uniform number of splits (see also How to split a dataframe string column into two columns?).
EDIT: If you want to 'drop' the old column, you can also overwrite it and rename it:
df[['col', 'newcolumn']] = df['col'].str.split(' ', 1, expand=True)
df = df.rename(columns={"col": "result"})
which exactly gives you the result you specified was intended:
result newcolumn
0 item1 (30-10)
1 item2 (200-100)
2 item3 (100 FS)
3 item4 (100+)
4 item1 (1000-2000)
You can extract the parts of interest by grouping them within one regular expression. The regex pattern now matches item\d as first group and anything inside the brackets with \(.*\) as the second one.
import pandas as pd
df = pd.DataFrame({
'col': ['item1 (30-10)', 'item2 (200-100)', 'item3 (100 FS)', 'item4 (100+)', 'item1 (1000-2000)' ]
})
pattern = "(item\d*)\s(\(.*\))"
df['items'] = df['col'].str.extract(pattern)[0]
df['result'] = df['col'].str.extract(pattern)[1]
print(df)
Output:
col items result
0 item1 (30-10) item1 (30-10)
1 item2 (200-100) item2 (200-100)
2 item3 (100 FS) item3 (100 FS)
3 item4 (100+) item4 (100+)
4 item1 (1000-2000) item1 (1000-2000)
I have 2 pandas dataframe as below
df1:-
col1 col2 col3
aa b c
aa d c
bb d t
bb b g
cc e c
dd g c
and 2nd dataframe:-
col1 col2
aa b
cc e
bb d
And I want to change the value of col3 of dataframe1 to 'cc'. like below. based on 2nd dataframe column col1 and col2.
col1 col2 col3
aa b cc
aa d c
bb d cc
bb b g
cc e cc
dd g c
In short, I want to map 2nd dataframe columns(col1,col2) with 1st dataframe of columns(col1,col2) and change the column(col3) of 1st dataframe where it matches.
Use DataFrame.merge with left join and indicator parameter for helper column, compare by Series.eq for == with both and last set values in DataFrame.loc:
m = df1.merge(df2, on=['col1','col2'],indicator=True, how='left')['_merge'].eq('both')
df1.loc[m, 'col3'] = 'cc'
print (df1)
col1 col2 col3
0 aa b cc
1 aa d c
2 bb d cc
3 bb b g
4 cc e cc
5 dd g c
You can use pd.concat and drop_duplicates after assign a value for 'col3' on dataframe, df2 :
df = pd.concat([df2.assign(col3='cc'), df1]).drop_duplicates(['col1','col2']).reset_index(drop=True)
df
Output:
col1 col2 col3
0 aa b cc
1 cc e cc
2 bb d cc
3 aa d c
4 bb b g
5 dd g c
I'm quite new to VBA and trying to combine multiple row records in a large data dump into a single row record with multiple headers.
The data is exported into an excel file from another program and takes the form of:
Order Item Qty
1 Item1 2
1 Item2 5
1 Item4 1
2 Item1 1
2 Item2 2
2 Item3 5
3 Item1 4
3 Item2 5
3 Item3 1
4 Item2 2
4 Item3 1
5 Item1 1
5 Item2 1
5 Item3 1
6 Item1 4
6 Item2 4
6 Item4 2
Which would then be sorted into:
Order Item1 Item2 Item3 Item4
1 1 4 1
2 1 2 5
3 4 5 1
4 2 1
5 1 1 1
6 4 4 2
I'm not expecting anyone to write my code but any pointers as to an overall approach would be much appreciated. Thanks!
I create helper column and formula to get the counting. Hope can help.
I am working on getting pricing based on the number of units we are ordering using Excel sumifs. The data I have looks something like this:
A B C D
Item1 Comp1 1 4.99
Item1 Comp1 10 3.99
Item1 Comp1 100 2.99
Item1 Comp2 1 13.99
Item1 Comp2 100 10.99
Item1 Comp3 1 2.99
Item1 Comp3 10 2.59
Item1 Comp3 50 2.19
Item1 Comp3 100 1.99
... ... ... ...
Where column A is the main item, column B is the individual components of the item in column A, and column C is the number we need to order in order to get the price listed in column D.
In a separate sheet, I have the following table:
A B C
Item1 10 FORMULA
Item2 5 FORMULA
Item3 20 FORMULA
... ... ...
The point of this sheet is to have the Item name as seen in Column A of the first table, column B holds the number we need to order, and column C (hopefully) lists the total price by adding all the components at their respective price breaks.
In this example, the sum for Item1 I am looking for is 3.99 + 13.99 + 2.59 = 20.57 because 10 items gets the 10 price break for component 1, the 1 price break for component 2, and the 10 price break for component 3.
So far I am able to sum the cost based on the item name in column C:
=SUMIFS(Table1[D], Table1[A], "="A2)
I am having trouble starting the second part which is basically only summing the maximum price break for each component where Table1[C] <= B2.