Looking up a value in a employee/manager hierarchy using pandas - python-3.x

I have a situation where I would like to lookup a value that is currently attached to a manager to their respective direct reports. The data set at the manager level looks like this:
1 Complete
2 InComplete
Employee/ Manager hierarchy looks like this :
3 1
4 1
5 1
6 2
7 2
Now the following dataset has the manager/employee hierarchy and there I want to create a new column where I can store the same value for the employee as per their manager's. So the output data should look something like this:
3 1 Complete Complete
4 1 Complete Complete
5 1 Complete Complete
6 2 InComplete InComplete
7 2 InComplete InComplete
How should I achieve this in pandas?

Try merge first then duplicate the column:
import pandas as pd
df_manage = pd.DataFrame({
'MGR_ID': {0: 1, 1: 2},
'MGR_Value': {0: 'Complete', 1: 'InComplete'}
df_hierarchy = pd.DataFrame({
'EE_ID': {0: 3, 1: 4, 2: 5, 3: 6, 4: 7},
'MGR_ID': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2}
# Merge DataFrames Together
new_df = df_hierarchy.merge(df_manage, on='MGR_ID')
# Duplicate Column
new_df["EE_Value"] = new_df['MGR_Value']
# For Display
0 3 1 Complete Complete
1 4 1 Complete Complete
2 5 1 Complete Complete
3 6 2 InComplete InComplete
4 7 2 InComplete InComplete


How to create a separate df after applying groupby?

I have a df as follows:
Product Step
1 1
1 3
1 6
1 6
1 8
1 1
1 4
2 2
2 4
2 8
2 8
2 3
2 1
3 1
3 3
3 6
3 6
3 8
3 1
3 4
What I would like to do is to:
For each Product, every Step must be grabbed and the order must not be changed, that is, if we look at Product 1, after Step 8, there is a 1 coming and that 1 must be after 8 only. So, the expected output for product 1 and product 3 should be of the order: 1, 3, 6, 8, 1, 4; for the product 2 it must be: 2, 4, 8, 3, 1.
Here, I only want one value of 6 for product 1 and 3, since in the main df both the 6 next to each other, but both the values of 1 must be present since they are not next to each other.
Once the first step is done, the products with the same Steps must be grouped together into a new df (in the below example: Product 1 and 3 have same Steps, so they must be grouped together)
What I have done:
import pandas as pd
sid = pd.DataFrame(data.groupby('Product').apply(lambda x: x['Step'].unique())).reset_index()
But it is yielding a result like:
Product 0
0 1 [1 3 6 8 4]
1 2 [2 4 8 3 1]
2 3 [1 3 6 8 4]
which is not the result I want. I would like the value for the first and third product to be [1 3 6 8 1 4].
IIUC Create the Newkey by using cumsum and diff
df['Newkey']=df.groupby('Product').Step.apply(lambda x : x.diff().ne(0).cumsum())
(1, 3, 6, 8, 1, 4) [1, 3]
(2, 4, 8, 3, 1) [2]
Name: Product, dtype: object
groupby preservers the order of rows within a group, so there isn't much need to worry about the rows shifting.
A straightforward, but not greatly performant, solution would be to apply(tuple), since they are hashable allowing you to group on them to see which Products are identical. form_seq will make it so that consecutive values only appear once in the list of steps before forming the tuple.
def form_seq(x):
x = x[x != x.shift()]
return tuple(x)
s = df.groupby('Product').Step.apply(form_seq)
#{(1, 3, 6, 8, 1, 4): Int64Index([1, 3], dtype='int64', name='Product'),
# (2, 4, 8, 3, 1): Int64Index([2], dtype='int64', name='Product')}
Or if you'd like a DataFrame:
#(1, 3, 6, 8, 1, 4) [1, 3]
#(2, 4, 8, 3, 1) [2]
#Name: Product, dtype: object
The values of that dictionary are the groupings of products that share the step sequence (given by the dictionary keys). Products 1 and 3 are grouped together by the step sequence 1, 3, 6, 8, 1, 4.
Another very similar way:

Pandas aggregate column and keep header

I have code which works but gives me data without header is there a way I can write this code so header is not removed? I know one way will be to add back header, but is there a better way?
My code:
df = pd.read_csv(“_data.csv",skiprows=[0], header=None)
df = df.groupby([2])[10].sum().astype(float)
1 2
1 1
2 3
2 4
I have data like above trying to get this result:
1 3
2 7
Try to use the function reset_index after the sum:
data = [{'a': 1, 'b': 2},{'a': 1, 'b': 1},{'a': 2, 'b': 3},{'a': 2, 'b': 4}]
df = pd.DataFrame(data)
a b
0 1 2
1 1 1
2 2 3
3 2 4
a b
0 1 3
1 2 7
You should specify the separator (several spaces in your case) and that the header is the first row (=0, with python indexing), than groupby the column you want.
df = pd.read_csv("_data.csv", sep='\s*', header=0)
0 1 2
1 1 1
2 2 3
3 2 4
df = df.groupby(['A']).sum()
1 3
2 7

Filter simultaneously by different values of rows Pandas

I have a huge dataframe with product_id and their property_id's. Note that for each property starts with new index. I need to filter simultaneously by different property_id values for each product_id. Is there any way to do it fast?
product_id property_id
0 3588 1
1 3588 2
2 3588 5
3 3589 1
4 3589 3
5 3589 5
6 3590 1
7 3590 2
8 3590 5
For example want kinda that to filter for each product_id by two properties that are assigned at different rows like out_df.loc[(out_df['property_id'] == 1) & (out_df['property_id'] == 2)] but instead of it).
I need something like that but working at the same time for all rows of each product_id column.
I know that it can be done via groupby into lists
3587 [2, 1, 5]
3588 [1, 3, 5]
3590 [1, 2, 5]
and finding intersections inside lists.
gp_df.apply(lambda r: {1, 2} < (set(r['property_id'])), axis=1)
But it takes time and at the same time Pandas common filtering is greatly optimized for speed (believe in using some tricky right and inverse indexes inside what do search engines like ElasticSearch, Sphinx etc) .
Expected output: where both {1 and 2} are having.
3587 [2, 1, 5]
3590 [1, 2, 5]
Since this is just as much a performance as a functional question, I would go with an intersection approach like this:
df = pd.DataFrame({'product_id': [3588, 3588, 3588, 3589, 3589, 3589, 3590, 3590,3590],
'property_id': [1, 2, 5, 1, 3, 5, 1, 2, 5]})
df = df.set_index(['property_id'])
print("The full DataFrame:")
start = time()
for i in range(1000):
s1 = df.loc[(1), 'product_id']
s2 = df.loc[(2), 'product_id']
s_done = pd.Series(list(set(s1).intersection(set(s2))))
print("Overlapping product_id's")
Iterating the lookup 1000 times takes 0.93 seconds on my ThinkPad T450s. I took the liberty to test #jezrael's two suggestions and they come in at 2.11 and 2.00 seconds, the groupby approach is, software engineering wise, more elegant though.
Depending on the size of your data set and the importance of performance, you can also switch to more simple datatypes, like classic dictionaries and gain further speed.
Jupyter Notebook can be found here: pandas_fast_lookup_using_intersection.ipynb
do you mean something like this?
result = out_df.loc[out_df['property_id'].isin([1,2]), :]
If you want you can then drop duplicates based on product_id...
The simpliest is use GroupBy.transform with compare sets:
s = {1, 2}
a = df[df.groupby('product_id')['property_id'].transform(lambda r: s < set(r))]
print (a)
product_id property_id
0 3588 1
1 3588 2
2 3588 5
6 3590 1
7 3590 2
8 3590 5
Another solution is filter only values of sets, removing duplicates first:
df1 = df[df['property_id'].isin(s) & ~df.duplicated(['product_id', 'property_id'])]
Then is necessary check if lengths of each group is same as length of set with this solution:
f, u = df1['product_id'].factorize()
ids = df1.loc[np.bincount(f)[f] == len(s), 'product_id'].unique()
Last filter all rows with product_id by condition:
a = df[df['product_id'].isin(ids)]
print (a)
product_id property_id
0 3588 1
1 3588 2
2 3588 5
6 3590 1
7 3590 2
8 3590 5

Why is order of data items reversed while creating a pandas series?

I am new to python and pandas so please bear with me. I tried searching the answer everywhere but couldn't find it. Here's my question:
This is my input code:
list = [1, 2, 3, 1, 2, 3]
s = pd.Series([1, 2, 3, 10, 20, 30], list)
The output is:
1 1
2 2
3 3
1 10
2 20
3 30
dtype: int64
Now, my question is why the "list" is coming before the first list specified while creating the series? I tried running the same code multiple times to check if the series creation is orderless. Any help would be highly appreciated.
Python Version:
Python 3.6.0
Pandas Version:
I think you omit index which specify first column called index - so Series construction now is:
#dont use list as variable, because reversed word in python
L = [1, 2, 3, 1, 2, 3]
s = pd.Series(data=[1, 2, 3, 10, 20, 30], index=L)
print (s)
1 1
2 2
3 3
1 10
2 20
3 30
dtype: int64
You can also check Series documentation.

Trouble pivoting in pandas (spread in R)

I'm having some issues with the pd.pivot() or pivot_table() functions in pandas.
I have this:
df = pd.DataFrame({'site_id': {0: 'a', 1: 'a', 2: 'b', 3: 'b', 4: 'c', 5:
'c',6: 'a', 7: 'a', 8: 'b', 9: 'b', 10: 'c', 11: 'c'},
'dt': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1,6: 2, 7: 2, 8: 2, 9: 2, 10: 2, 11: 2},
'eu': {0: 'FGE', 1: 'WSH', 2: 'FGE', 3: 'WSH', 4: 'FGE', 5: 'WSH',6: 'FGE', 7: 'WSH', 8: 'FGE', 9: 'WSH', 10: 'FGE', 11: 'WSH'},
'kw': {0: '8', 1: '5', 2: '3', 3: '7', 4: '1', 5: '5',6: '2', 7: '3', 8: '5', 9: '7', 10: '2', 11: '5'}})
dt eu kw site_id
0 1 FGE 8 a
1 1 WSH 5 a
2 1 FGE 3 b
3 1 WSH 7 b
4 1 FGE 1 c
5 1 WSH 5 c
6 2 FGE 2 a
7 2 WSH 3 a
8 2 FGE 5 b
9 2 WSH 7 b
10 2 FGE 2 c
11 2 WSH 5 c
I want this:
dt site_id FGE WSH
1 a 8 5
1 b 3 7
1 c 1 5
2 a 2 3
2 b 5 7
2 c 2 5
I've tried everything!
df.pivot_table(index = ['site_id','dt'], values = 'kw', columns = 'eu')
df.pivot(index = ['site_id','dt'], values = 'kw', columns = 'eu')
should have worked. I also tried unstack():
df.set_index(['dt','site_id','eu']).unstack(level = -1)
Your last try (with unstack) works fine for me, I'm not sure why it gave you a problem. FWIW, I think it's more readable to use the index names rather than levels, so I did it like this:
>>> df.set_index(['dt','site_id','eu']).unstack('eu')
dt site_id
1 a 8 5
b 3 7
c 1 5
2 a 2 3
b 5 7
c 2 5
But again, your way looks fine to me and is pretty much the same as what #piRSquared did (except their answer adds some more code to get rid of the multi-index).
I think the problem with pivot is that you can only pass a single variable, not a list? Anyway, this works for me:
>>> df.set_index(['dt','site_id']).pivot(columns='eu')
For pivot_table, the main issue is that 'kw' is an object/character and pivot_table will attempt to aggregate with numpy.mean by default. You probably got the error message: "DataError: No numeric types to aggregate".
But there are a couple of workarounds. First, you could just convert to a numeric type and then use your same pivot_table command
>>> df['kw'] = df['kw'].astype(int)
>>> df.pivot_table(index = ['dt','site_id'], values = 'kw', columns = 'eu')
Alternatively you could change the aggregation function:
>>> df.pivot_table(index = ['dt','site_id'], values = 'kw', columns = 'eu',
aggfunc=sum )
That's using the fact that strings can be summed (concatentated) even though you can't take a mean of them. Really, you can use most functions here (including lambdas) that operate on strings.
Note, however, that pivot_table's aggfunc requires some sort of reduction operation here even though you only have a single value per cell, so there actually isn't anything to reduce! But there is a check in the code that requires a reduction operation, so you have to do one.
df.set_index(['dt', 'site_id', 'eu']).kw \
.unstack().rename_axis(None, 1).reset_index()
