I'm trying to extract a text out of a column so I can move to another column using a pattern in python but I miss some results at the same time I need to keep the unextracted strings as they are>
My code is:
import pandas as pd
df = pd.DataFrame({
'col': ['item1 (30-10)', 'item2 (200-100)', 'item3 (100 FS)', 'item4 (100+)', 'item1 (1000-2000)' ]
})
pattern = r'(\d+(\,[0-9]+)?\-\d+(\,[a-zA-Z])?\d+)'
df['result'] = df['col'].str.extract(pattern)[0]
print(df)
My output is:
col result
0 item1 (30-10) 30-10
1 item2 (200-100) 200-100
2 item3 (100 FS) NaN
3 item4 (100+) NaN
4 item1 (1000-2000) 1000-2000
My output should be:
col result newcolumn
0 item1 (30-10)
1 item2 (200-100)
2 item3 (100 FS)
3 item4 (100+)
4 item1 (1000-2000)
You can use this:
df['newcolumn'] = df.col.str.extract(r'(\(.+\))')
df['result'] = df['col'].str.extract(r'(\w+)')
Output:
col newcolumn result
0 item1 (30-10) (30-10) item1
1 item2 (200-100) (200-100) item2
2 item3 (100 FS) (100 FS) item3
3 item4 (100+) (100+) item4
4 item1 (1000-2000) (1000-2000) item1
Explanation:
The first expression gets the content within parenthesis (including the parenthesis themselves). The second gets the first word.
You can also do this with .str.split in a single line:
df[['result', 'newcolumn']] = df['col'].str.split(' ', 1, expand=True)
Output:
col result newcolumn
0 item1 (30-10) item1 (30-10)
1 item2 (200-100) item2 (200-100)
2 item3 (100 FS) item3 (100 FS)
3 item4 (100+) item4 (100+)
4 item1 (1000-2000) item1 (1000-2000)
You must use expand=True if your strings have a non-uniform number of splits (see also How to split a dataframe string column into two columns?).
EDIT: If you want to 'drop' the old column, you can also overwrite it and rename it:
df[['col', 'newcolumn']] = df['col'].str.split(' ', 1, expand=True)
df = df.rename(columns={"col": "result"})
which exactly gives you the result you specified was intended:
result newcolumn
0 item1 (30-10)
1 item2 (200-100)
2 item3 (100 FS)
3 item4 (100+)
4 item1 (1000-2000)
You can extract the parts of interest by grouping them within one regular expression. The regex pattern now matches item\d as first group and anything inside the brackets with \(.*\) as the second one.
import pandas as pd
df = pd.DataFrame({
'col': ['item1 (30-10)', 'item2 (200-100)', 'item3 (100 FS)', 'item4 (100+)', 'item1 (1000-2000)' ]
})
pattern = "(item\d*)\s(\(.*\))"
df['items'] = df['col'].str.extract(pattern)[0]
df['result'] = df['col'].str.extract(pattern)[1]
print(df)
Output:
col items result
0 item1 (30-10) item1 (30-10)
1 item2 (200-100) item2 (200-100)
2 item3 (100 FS) item3 (100 FS)
3 item4 (100+) item4 (100+)
4 item1 (1000-2000) item1 (1000-2000)
Related
This is the available data:
Column A
Column B
Column C
Column D
Column E
item1
traitA
traitB
traitC
traitD
item2
traitE
traitF
traitG
traitH
item3
traitI
traitJ
traitK
item4
traitL
traitM
traitN
item5
traitO
traitP
I have a column of 5,000+ items. They all have different traits (some 2, some up to 20). Those traits are in the same row, in the columns next to the item. I already have the trait count per item and stacked the items for the right amount. Resulting in:
Column Q
Column R
item1
4
item2
4
item3
3
item4
3
item5
2
and:
Column Y
Column Z
item1
item1
item1
item1
item2
item2
item2
item2
item3
item3
item3
item4
item4
item4
item5
item5
The result I need is the following:
Column Y
Column Z
item1
traitA
item1
traitB
item1
traitC
item1
traitD
item2
traitE
item2
traitF
item2
traitG
item2
traitH
item3
traitI
item3
traitJ
item3
traitK
item4
traitL
item4
traitM
item4
traitN
item5
traitO
item5
traitP
I put this in cell Z2:
=VLOOKUP(Y2,$A:$E,2,FALSE)
This works but only for traitA, traitE, traitI, and so on (column B).
So what I need is a dynamic column index number. This needs to find how many 'item1' there are in Column Y in total, and then at which one out of the total in column Y it is at.
Also when you go to the next item, the column index number has to go back to '2', since that will make the VLOOKUP work.
The column index numbers need to be as follows:
Column Y
Column Z
item1
2
item1
3
item1
4
item1
5
item2
2
item2
3
item2
4
item2
5
item3
2
item3
3
item3
4
item4
2
item4
3
item4
4
item5
2
item5
3
Don't have too much experience with ROW and ROWS, I cannot get it to work. Maybe VBA offers the best solution. Or does there also needs to be a COUNTA function?
Any help would be truly appreciated. Thanks!
You can use this formula - based on a table called data
=LET(traits,data[[Column B]:[Column E]],
items,MAP(traits,LAMBDA(i,INDEX(data[Column A],ROW(i)-1))),
FILTER(HSTACK(TOCOL(items),TOCOL(traits)),TOCOL(traits)<>0))
The basic idea is to use TOCOL - for the traits that's easy.
Regarding the items the trick is to fill the traits-matrix with the according item - and then do the TOCOL
(Important: you have to be on the current channel of Excel 365 to have TOCOL and HSTACK)
My input df is as below:
ID item1 item2 item3
1 a,b b,c b
2 a,c,f b,c b,c,f
3 g,h,i i h,i
4 j,k j,k l
df datatypes for item1, item2 and item3 are string type.
I would like to add a 4th column and transformation required is as below:
pseudo code:
Final_item = item3 - set[col(item1) + col(item2)]
Basically, in the last column, is adding item1 and item2, then apply set to remove duplicates then subtract with item3 column.
Desired output as below:
ID item1 item2 item3 Final_item
1 a,b b,c b a,c
2 a,c,f b,c b,c,f a
3 g,h,i i h,i g
4 j,k j,k l j,k
First split columns and also joined columns by , and then get difference in list comprehension of zipped Series:
i3 = df['item3'].str.split(',')
i12 = (df['item1'] + ',' + df['item2']).str.split(',')
df['Final_item'] = [','.join(set(b) - set(a)) for a, b in zip(i3, i12)]
print (df)
ID item1 item2 item3 Final_item
0 1 a,b b,c b c,a
1 2 a,c,f b,c b,c,f a
2 3 g,h,i i h,i g
3 4 j,k j,k l j,k
I have a dataframe like this
import pandas as pd
sample = pd.DataFrame({'Col1': ['1','0','1','0'],'Col2':['0','0','1','1'],'Col3':['0','0','1','0'],'Class':['A','B','A','B']},index=['Item1','Item2','Item3','Item4'])
In [32]: print(sample)
Out [32]:
Col1 Col2 Col3 Class
Item1 1 0 0 A
Item2 0 0 0 B
Item3 1 1 1 A
Item4 0 1 0 B
And I want to calculate row distances between differents class' rows. I mean, first of all I would like to calculate distance between rows from classA
Item1 Item3
Item1 0 0.67
Item3 0.67 0
Secondly, distances between rows from class B
Item2 Item4
Item2 0 1
Item4 1 0
And lastly distance between different classes.
Item2 Item4
Item1 1 1
Item3 1 0.67
I have tried calculating distances with DistanceMetric one by one
from sklearn.neighbors import DistanceMetric
dist = DistanceMetric.get_metric('jacquard')
But I don't know can I do it iterating over the different rows in a large dataframe and create this 3 different matrix wth the distances
To find distances within Class A and Class B, you can use DataFrame.groupby, (distance used is euclidean):
def find_distance(group):
return pd.DataFrame(dist.pairwise(group.values))
df.groupby('Class').apply(find_distance)
0 1
Class
A 0 0.000000 1.414214
1 1.414214 0.000000
B 0 0.000000 1.000000
1 1.000000 0.000000
If you only have two classes, you can separate the two classes into two dataframes and then calculate the difference:
dist_cols = ['Col1', 'Col2','Col3']
df_a = df[df['Class']=='A']
df_b = df[df['Class']=='B']
distances = dist.pairwise(df_a[dist_cols].values, df_b[dist_cols].values)
distances
> array([[1. , 1.41421356],
[1.73205081, 1.41421356]])
pd.DataFrame(distances, columns = df_b.index, index = df_a.index)
Item2 Item4
Item1 1.000000 1.414214
Item3 1.732051 1.414214
I'm quite new to VBA and trying to combine multiple row records in a large data dump into a single row record with multiple headers.
The data is exported into an excel file from another program and takes the form of:
Order Item Qty
1 Item1 2
1 Item2 5
1 Item4 1
2 Item1 1
2 Item2 2
2 Item3 5
3 Item1 4
3 Item2 5
3 Item3 1
4 Item2 2
4 Item3 1
5 Item1 1
5 Item2 1
5 Item3 1
6 Item1 4
6 Item2 4
6 Item4 2
Which would then be sorted into:
Order Item1 Item2 Item3 Item4
1 1 4 1
2 1 2 5
3 4 5 1
4 2 1
5 1 1 1
6 4 4 2
I'm not expecting anyone to write my code but any pointers as to an overall approach would be much appreciated. Thanks!
I create helper column and formula to get the counting. Hope can help.
I have one rawList with data. And I need distributed this list between the 3 lists. For example
oneList twoList threeList
item1
item2
item3
item4
item5
I move item2 element to twoList
oneList twoList threeList
item1 item2
item3
item4
item5
I move item4 element to threeList
oneList twoList threeList
item1 item2 item4
item3
item5
I can move by drag&drop, ">" "<" ">>" button, checkBoxs or etc.
I faind this solution Basic PickList But I need 3 Lists
Maybe somebody know How canI realize it?