how to split value into rows? [duplicate] - python-3.x

This question already has answers here:
Unnest (explode) a Pandas Series
(8 answers)
Closed 1 year ago.
data = {'user_id': {2: 3, 3: 4, 4: 5},
'Q1': {2: '1', 3: '8', 4: '2'},
'Q2': {2: '4', 3: '3', 4: '2, 3, 4'},
'Total Traffic(MB)': {2: 261.1186, 3: 179.18564, 4: 351.99208}}
pd.DataFrame(data)
how to split last row into 3 rows with same data but only one 'Q2' value per row ?

You can split and explode:
(df.assign(Q2=df['Q2'].str.split(', '))
.explode('Q2')
)
output:
user_id Q1 Q2 Total Traffic(MB)
2 3 1 4 261.11860
3 4 8 3 179.18564
4 5 2 2 351.99208
4 5 2 3 351.99208
4 5 2 4 351.99208

Related

Removing rows from DataFrame based on different conditions applied to subset of a data

Here is the Dataframe I am working with:
You can create it using the snippet:
my_dict = {'id': [1,2,1,2,1,2,1,2,1,2,3,1,3, 3],
'category':['a', 'a', 'b', 'b', 'b', 'b', 'a', 'a', 'b', 'b', 'b', 'a', 'a', 'a'],
'value' : [1, 12, 34, 12, 12 ,34, 12, 35, 34, 45, 65, 55, 34, 25]
}
x = pd.DataFrame(my_dict)
x
I want to filter IDs based on the condition: for category a, the count of values should be 2 and for category b, the count of values should be 3. Therefore, I would remove id 1 from category a and id 3 from category b from my original dataset x.
I can write the code for individual categories and start removing id's manually by using the code:
x.query('category == "a"').groupby('id').value.count().loc[lambda x: x != 2]
x.query('category == "b"').groupby('id').value.count().loc[lambda x: x != 3]
But, I don't want to do it manually since there are multiple categories. Is there a better way of doing it by considering all the categories at once and remove id's based on the condition listed in a list/dictionary?
If need filter MultiIndex Series - s by dictionary use Index.get_level_values with Series.map and get equal values per groups in boolean indexing:
s = x.groupby(['category','id']).value.count()
d = {'a': 2, 'b': 3}
print (s[s.eq(s.index.get_level_values(0).map(d))])
category id
a 2 2
3 2
b 1 3
2 3
Name: value, dtype: int64
If need filter original DataFrame:
s = x.groupby(['category','id'])['value'].transform('count')
print (s)
0 3
1 2
2 3
3 3
4 3
5 3
6 3
7 2
8 3
9 3
10 1
11 3
12 2
13 2
Name: value, dtype: int64
d = {'a': 2, 'b': 3}
print (x[s.eq(x['category'].map(d))])
id category value
1 2 a 12
2 1 b 34
3 2 b 12
4 1 b 12
5 2 b 34
7 2 a 35
8 1 b 34
9 2 b 45
12 3 a 34
13 3 a 25

Create key value pair dict for each new line

QI have a data frame of 4 columns. I need to create a {key: value} dictionary for 2 of those columns where this {key: value} pair should be created for each separate line in the data frame. Please refer to the example below:
df>>
a b c d
0 1 2 3 4
1 9 8 7 6
Expected output>>
a b c d new-column
0 1 2 3 4 {a:1, b:2}
1 9 8 7 6 {a:9, b:8}
You can use to_dict, craft a Series and join to the original DataFrame:
df2 = df.join(pd.Series(df[['a','b']].to_dict('index'), name='new_column'))
Output:
a b c d new_column
0 1 2 3 4 {'a': 1, 'b': 2}
1 9 8 7 6 {'a': 9, 'b': 8}
You can assign the values of returned dictionary of to_dict('index') to new column
df['new-column'] = df[['a','b']].to_dict('index').values()
print(df)
a b c d new-column
0 1 2 3 4 {'a': 1, 'b': 2}
1 9 8 7 6 {'a': 9, 'b': 8}

Looking up a value in a employee/manager hierarchy using pandas

I have a situation where I would like to lookup a value that is currently attached to a manager to their respective direct reports. The data set at the manager level looks like this:
MGR_ID MGR_Value
1 Complete
2 InComplete
Employee/ Manager hierarchy looks like this :
EE_ID MGR_ID
3 1
4 1
5 1
6 2
7 2
Now the following dataset has the manager/employee hierarchy and there I want to create a new column where I can store the same value for the employee as per their manager's. So the output data should look something like this:
EE_ID MGR_ID MGR_Value EE_Value
3 1 Complete Complete
4 1 Complete Complete
5 1 Complete Complete
6 2 InComplete InComplete
7 2 InComplete InComplete
How should I achieve this in pandas?
Try merge first then duplicate the column:
import pandas as pd
df_manage = pd.DataFrame({
'MGR_ID': {0: 1, 1: 2},
'MGR_Value': {0: 'Complete', 1: 'InComplete'}
})
df_hierarchy = pd.DataFrame({
'EE_ID': {0: 3, 1: 4, 2: 5, 3: 6, 4: 7},
'MGR_ID': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2}
})
# Merge DataFrames Together
new_df = df_hierarchy.merge(df_manage, on='MGR_ID')
# Duplicate Column
new_df["EE_Value"] = new_df['MGR_Value']
# For Display
print(new_df.to_string())
EE_ID MGR_ID MGR_Value EE_Value
0 3 1 Complete Complete
1 4 1 Complete Complete
2 5 1 Complete Complete
3 6 2 InComplete InComplete
4 7 2 InComplete InComplete

Increment Count column by 1 base on a another column

I have this data frame where i need to create a count column base on my distance column. I grouped the result by the model column. What i anticipate to get is an increment by 1 on the next count row each time the distance is 100. For example, here is what I have so far but no success yet with the increment
import pandas as pd
df = pd.DataFrame(
[['A', '34', 3], ['A', '55', 5], ['A', '100', 7], ['A', '0', 1],['A', '55', 5],
['B', '90', 3], ['B', '0', 1], ['B', '1', 3], ['B', '21', 1],['B', '0', 1],
['C', '9', 7], ['C', '100', 4], ['C', '50', 1], ['C', '100', 6],['C', '22', 4]],
columns=['Model', 'Distance', 'v1'])
df = df.groupby(['Model']).apply(lambda row: callback(row) if row['Distance'] is not None else callback(row)+1)
print(df)
import numpy as np
(
df.groupby('Model')
.apply(lambda x: x.assign(Bount=x.Count + x.Distance.shift()
.eq('100').replace(False, np.nan)
.ffill().fillna(0)
)
)
.reset_index(level=0, drop=True)
)
Result with your code solution
Model Distance v1 Count
A 34 3 1.0
A 55 5 2.0
A 100 7 3.0
A 0 1 5.0
A 55 5 6.0
B 90 3 1.0
B 0 1 2.0
B 1 3 3.0
B 21 1 4.0
B 0 1 5.0
C 9 7 1.0
C 100 4 2.0
C 50 1 4.0
C 100 6 5.0
C 22 4 6.0
My expected result is:
Model Distance v1 Count
A 34 3 1.0
A 55 5 2.0
A 100 7 3.0
A 0 1 5.0
A 55 5 6.0
B 90 3 1.0
B 0 1 2.0
B 1 3 3.0
B 21 1 4.0
B 0 1 5.0
C 9 7 1.0
C 100 4 2.0
C 50 1 4.0
C 100 6 5.0
C 22 4 7.0
Take a look at the C group that are two distance equal to 100
Setup:
df = pd.DataFrame({'Model': {0: 'A', 1: 'A', 2: 'A', 3: 'A', 4: 'A', 5: 'B', 6: 'B', 7: 'B', 8: 'B', 9: 'B', 10: 'C', 11: 'C', 12: 'C', 13: 'C', 14: 'C'},
'Distance': {0: '34', 1: '55', 2: '100', 3: '0', 4: '55', 5: '90', 6: '0', 7: '1', 8: '21', 9: '0', 10: '9', 11: '23', 12: '100', 13: '33', 14: '23'},
'v1': {0: 3, 1: 5, 2: 7, 3: 1, 4: 5, 5: 3, 6: 1, 7: 3, 8: 1, 9: 1, 10: 7, 11: 4, 12: 1, 13: 6, 14: 4},
'Count': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 1, 6: 2, 7: 3, 8: 4, 9: 5, 10: 1, 11: 2, 12: 3, 13: 4, 14: 5}})
If the logic needs to be applied across the Model column, you can use a shift, compare and add 1 for eligibal rows:
df.loc[df.Distance.shift().eq('100'), 'Count'] += 1
If the logic needs to be applied at per Model group, then you can use a groupby:
(
df.groupby('Model')
.apply(lambda x: x.assign(Count=x.Distance.shift().eq('100') + x.Count))
.reset_index(level=0, drop=True)
)
Based on #StringZ's updates, below is the updated solution:
(
df.groupby('Model')
.apply(lambda x: x.assign(Count=x.Count + x.Distance.shift()
.eq('100').replace(False, np.nan)
.ffill().fillna(0)
)
)
.reset_index(level=0, drop=True)
)
Model Distance v1 Count
0 A 34 3 1.0
1 A 55 5 2.0
2 A 100 7 3.0
3 A 0 1 5.0
4 A 55 5 6.0
5 B 90 3 1.0
6 B 0 1 2.0
7 B 1 v3 3.0
8 B 21 1 4.0
9 B 0 1 5.0
10 C 9 7 1.0
11 C 23 4 2.0
12 C 100 1 3.0
13 C 33 6 5.0
14 C 23 4 6.0

Trouble pivoting in pandas (spread in R)

I'm having some issues with the pd.pivot() or pivot_table() functions in pandas.
I have this:
df = pd.DataFrame({'site_id': {0: 'a', 1: 'a', 2: 'b', 3: 'b', 4: 'c', 5:
'c',6: 'a', 7: 'a', 8: 'b', 9: 'b', 10: 'c', 11: 'c'},
'dt': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1,6: 2, 7: 2, 8: 2, 9: 2, 10: 2, 11: 2},
'eu': {0: 'FGE', 1: 'WSH', 2: 'FGE', 3: 'WSH', 4: 'FGE', 5: 'WSH',6: 'FGE', 7: 'WSH', 8: 'FGE', 9: 'WSH', 10: 'FGE', 11: 'WSH'},
'kw': {0: '8', 1: '5', 2: '3', 3: '7', 4: '1', 5: '5',6: '2', 7: '3', 8: '5', 9: '7', 10: '2', 11: '5'}})
df
Out[140]:
dt eu kw site_id
0 1 FGE 8 a
1 1 WSH 5 a
2 1 FGE 3 b
3 1 WSH 7 b
4 1 FGE 1 c
5 1 WSH 5 c
6 2 FGE 2 a
7 2 WSH 3 a
8 2 FGE 5 b
9 2 WSH 7 b
10 2 FGE 2 c
11 2 WSH 5 c
I want this:
dt site_id FGE WSH
1 a 8 5
1 b 3 7
1 c 1 5
2 a 2 3
2 b 5 7
2 c 2 5
I've tried everything!
df.pivot_table(index = ['site_id','dt'], values = 'kw', columns = 'eu')
or
df.pivot(index = ['site_id','dt'], values = 'kw', columns = 'eu')
should have worked. I also tried unstack():
df.set_index(['dt','site_id','eu']).unstack(level = -1)
Your last try (with unstack) works fine for me, I'm not sure why it gave you a problem. FWIW, I think it's more readable to use the index names rather than levels, so I did it like this:
>>> df.set_index(['dt','site_id','eu']).unstack('eu')
kw
eu FGE WSH
dt site_id
1 a 8 5
b 3 7
c 1 5
2 a 2 3
b 5 7
c 2 5
But again, your way looks fine to me and is pretty much the same as what #piRSquared did (except their answer adds some more code to get rid of the multi-index).
I think the problem with pivot is that you can only pass a single variable, not a list? Anyway, this works for me:
>>> df.set_index(['dt','site_id']).pivot(columns='eu')
For pivot_table, the main issue is that 'kw' is an object/character and pivot_table will attempt to aggregate with numpy.mean by default. You probably got the error message: "DataError: No numeric types to aggregate".
But there are a couple of workarounds. First, you could just convert to a numeric type and then use your same pivot_table command
>>> df['kw'] = df['kw'].astype(int)
>>> df.pivot_table(index = ['dt','site_id'], values = 'kw', columns = 'eu')
Alternatively you could change the aggregation function:
>>> df.pivot_table(index = ['dt','site_id'], values = 'kw', columns = 'eu',
aggfunc=sum )
That's using the fact that strings can be summed (concatentated) even though you can't take a mean of them. Really, you can use most functions here (including lambdas) that operate on strings.
Note, however, that pivot_table's aggfunc requires some sort of reduction operation here even though you only have a single value per cell, so there actually isn't anything to reduce! But there is a check in the code that requires a reduction operation, so you have to do one.
df.set_index(['dt', 'site_id', 'eu']).kw \
.unstack().rename_axis(None, 1).reset_index()

Resources