There have been a lot of posts concerning splitting a single column into multiples, but I couldn't find an answer to a slight modification to the idea of splitting.
When you use str.split, it splits the string independent of order. You can modify it to be slightly more complex, such as ordering it by sorting alphabetically
e.x. dataframe (df)
row
0 a, e, c, b
1 b, d, a
2 a, b, c, d, e
3 d, f
foo = df['row'].str.split(',')
will split based on the comma and return:
0 1 2 3
0 a e c b
....
However that doesn't align the results by their unique value. Even if you use a sort on the split string, it will still only result in this:
0 1 2 3 4 5
0 a b c e
1 a b d
...
whereas I want it to look like this:
0 1 2 3 4 5
0 a b c e
1 a b d
2 a b c d e
...
I know I'm missing something. Do I need to add the columns first and then map the split values to the correct column? What if you don't know all of the unique values? Still learning pandas syntax so any pointers in the right direction would be appreciated.
Using get_dummies
s=df.row.str.get_dummies(sep=' ,')
s.mul(s.columns)
Out[239]:
a b c d e f
0 a b c e
1 a b d
2 a b c d e
3 d f
Related
If I had data in rows A to E as seen below in the table. Some of the values can be NA. IN column F if i wanted to input data from columns A to E in a way that if data in A exists use that otherwise if data in B exists use that otherwise until column E. If none of them have any values return NA. I would like to automate this where somewhere I just specify the order for example A, B, C, D and E OR A, C, E, D, B and the values in F update according to the reference table
Reference : C - B - A - E - D
a
b
c
d
e
f
3
4
3
2
2
7
1
7
NA
1
4
2
4
2
2
4
2
2
Use FILTER() with # operator.
=#FILTER(A2:E2,A2:E2<>"","NA")
For dynamic array approach (spill results automatically), try-
=BYROW(A2:E7,LAMBDA(x,INDEX(FILTER(x,x<>"","NA"),1,1)))
This is the original series. I'm trying to replace values of the non top 2 in the series with 'Other'.
Original Series(ser3):
b 8
c 6
a 5
h 4
g 2
d 2
f 2
e 1
This is my extracted top 2.
Top 2:
t2 = ((ser3.value_counts().head(2)))
b 8
c 6
Expected Output:
b 8
c 6
a Other
h Other
g Other
d Other
f Other
e Other
How can I do that? I do not want to convert to dictionary and replace the values by indexing. I prefer to do it by Series. I tried using .isin, but my code gives me an error.
a[a[~a.isin(t2)].index]='Other'
The above gives me an error.
You are close, need select t2.index and remove outer ser3[]:
ser3[~ser3.isin(t2.index)]='Other'
I'd like to do a search on all columns (except the first column !) of a DataFrame and add a new column (like 'Column_Match') with the name of the matching column.
I tried something like this:
df.apply(lambda row: row.astype(str).str.contains('my_keyword').any(), axis=1)
But it's not excluding the first column and I don't know how to return and add the column name.
Any help much appreciated !
If want columns name of first matched value per rows add new column for match not exist values by DataFrame.assign and DataFrame.idxmax for column name:
df = pd.DataFrame({
'B':[4,5,4,5,5,4],
'A':list('abcdef'),
'C':list('akabbe'),
'F':list('eakbbb')
})
f = lambda row: row.astype(str).str.contains('e')
df['new'] = df.iloc[:,1:].apply(f, axis=1).assign(missing=True).idxmax(axis=1)
print (df)
B A C F new
0 4 a a e F
1 5 b k a missing
2 4 c a k missing
3 5 d b b missing
4 5 e b b A
5 4 f e b C
If need all columns names of all matched values create boolean DataFrame and use dot product with columns names by DataFrame.dot and Series.str.rstrip:
f = lambda row: row.astype(str).str.contains('a')
df1 = df.iloc[:,1:].apply(f, axis=1)
df['new'] = df1.dot(df.columns[1:] + ', ').str.rstrip(', ').replace('', 'missing')
print (df)
B A C F new
0 4 a a e A, C
1 5 b k a F
2 4 c a k C
3 5 d b b missing
4 5 e b b missing
5 4 f e b missing
I was wondering if there was a way to count the number of values by category. Example:
A 3
A 3
A 3
B 4
B 4
B 4
B 4
C 5
C 5
C 5
C 5
C 5
D 2
D 2
What is happening there is that there are 5 categories "A, B, C, D" and there are different counts of it. Duplicate values. I would like to create a new column and output the number of times it occurs in a different column as shown above. Please no VBA as i don't know it.
Try this...
=IF(A2<>A1,COUNTIF(A:A,A2),"")
Below is a sample of the data I have. I want to match the data in Column A and B. If column B is not matching column A, I want to add a row and copy the data from Column A to B. For example, "4" is missing in column B, so I want to add a space and add "4" to column B so it will match column A. I have a large set of data, so I am trying to find a different way instead of checking for duplicate values in the two columns and manually adding one row at a time. Thanks!
A B C D
3 3 Y B
4 5 G B
5 6 B G
6 8 P G
7 9 Y P
8 11 G Y
9 12 B Y
10
11
12
11
12
I would move col B,C,D to a separate columns, say E,F,G, then using index matches against col A and col B identify which records are missing.
For col C: =IFERROR(INDEX(F:F,Match(A1,E:E,0)),"N/A")
For col D: =IFERROR(INDEX(G:G,Match(A1,E:E,0)),"N/A")
Following this you can filter for C="N/A" to identify cases where a B value is missing for an A value, and manually edit. Since you want A & B to be matching here col B is unnecessary, final result w/ removing col B and C->B, D->C:
A B C
3 Y B
4 N/A N/A
5 G B
6 B G
7 N/A N/A
Hope this helps!