python pandas recursive lookup between manager and employee ID's - python-3.x

I have a dataframe like below
import pandas as pd
import numpy as np
raw_data = {'Emp_ID':[144,220,155,200],
'Mgr_ID': [200, 144,200,500],
'Type': ['O','I','I','I'],
'Location' : ['India','UK','UK','US']
}
df2 = pd.DataFrame(raw_data, columns = ['Emp_ID','Mgr_ID', 'Type','Location'])
print(df2)
i want to get manager id and ultimate employee ID which he directly/indirectly reporting...suppose manager id 200 directly reporting 144 and 155 and indirectly reporting employee 220. so i want to have separate 3 records for manager 200 like below output..like this for other all manager id's
Wanted output like below

Finding parents/childs or relations between IDs relate to graph theory, so you are better using Networkx package. You need to install it through pip and import it. Create a graph g using networkx from_pandas_edgelist. For each manager, there are multiple employees directly under him/her. However, each employee is supposed having only one direct manager. So, we start from Emp_ID. Call nx.ancestors for each employee using genex (or listcomp if you prefer) and passing it to create dataframe df3. Finally, explode the series of lists of Mgr_ID and join back to df2 to get final output.
import pandas as pd
import networkx as nx
g = nx.from_pandas_edgelist(df2, source='Mgr_ID', target='Emp_ID', create_using=nx.DiGraph)
df3 = pd.DataFrame(([list(nx.ancestors(g, x)), x] for x in df2.Emp_ID),
index=df2.index, columns=['Mgr_ID', 'Emp_ID'])
df_final = df3.explode('Mgr_ID').join(df2[['Type', 'Location']])
Out[23]:
Mgr_ID Emp_ID Type Location
0 200 144 O India
0 500 144 O India
1 144 220 I UK
1 500 220 I UK
1 200 220 I UK
2 200 155 I UK
2 500 155 I UK
3 500 200 I US

Related

Using FuzzyWuzzy with pandas

I am trying to calculate the similarity between cities in my dataframe, and 1 static city name. (eventually I want to iterate through a dataframe and choose the best matching city name from that data frame, but I am testing my code on this simplified scenario).
I am using fuzzywuzzy token set ratio.
For some reason it calculates the first row correctly, and it seems it assigns the same value for all rows.
code:
from fuzzywuzzy import fuzz
test_df= pd.DataFrame( {"City" : ["Amsterdam","Amsterdam","Rotterdam","Zurich","Vienna","Prague"]})
test_df = test_df.assign(Score = lambda d: fuzz.token_set_ratio("amsterdam",test_df["City"]))
print (test_df.shape)
test_df.head()
Result:
City Score
0 Amsterdam 100
1 Amsterdam 100
2 Rotterdam 100
3 Zurich 100
4 Vienna 100
If I do the comparison one by one it works:
print (fuzz.token_set_ratio("amsterdam","Amsterdam"))
print (fuzz.token_set_ratio("amsterdam","Rotterdam"))
print (fuzz.token_set_ratio("amsterdam","Zurich"))
print (fuzz.token_set_ratio("amsterdam","Vienna"))
Results:
100
67
13
13
Thank you in advance!
I managed to solve it via iterating through the rows:
for index,row in test_df.iterrows():
test_df.loc[index, "Score"] = fuzz.token_set_ratio("amsterdam",test_df.loc[index,"City"])
The result is:
City Country Code Score
0 Amsterdam NL 100
1 Amsterdam NL 100
2 Rotterdam NL 67
3 Zurich NL 13
4 Vienna NL 13

Flatten Pandas DataFrame adding a suffix

I have the following Pandas DataFrame:
POS Price Cost (...)
10122 100 20
10123 500 5
(...)
I would like to pivot rows and columns, obtaining a single line, adding a suffix as:
Price_POS10122 Cost_POS10122 Price_POS10123 Cost_POS10123 (...)
100 20 500 5
(...)
How can I achieve that?
let's unstack:
df=df.set_index('POS').unstack().to_frame().T
df.columns=[f"{x}_POS{y}" for x,y in df.columns]
output of df:
Price_POS10122 Price_POS10123 Cost_POS10122 Cost_POS10123
0 100 500 20 5

Automate plotting horizontal bar chart in python grouped by a column in dataframe

I have the following dataframe:
ProgramID ProgramStartDate ProgramEndDate ProjectID ProjectStartDate ProjectEndDate
1113 8.794 9.345 101 8.7 8.98
1113 8.794 9.345 102 9.1 9.3
1114 23.3 34.5 103 25.3 37
I want to automate the plotting individual horizontal bar chart grouped by ProgramID. So basically from the above data I want to have two different bar charts one for 1113 and one for 1114. Not sure how to pull that off.
Explanation: ProgramID and ProjectID has parent-child relationship. A specific Project can start or end beyond the Start and End Date range.
you can use this:
import pandas as pd
df = pd.DataFrame({
'ProgramID':['1113','1113','1114'],
'ProgramStartDate':[8.794,8.794,23.3],
'ProgramEndDate':[9.345,9.345,34.5],
})
df1 =df.tail(2)
df1
output:
ProgramID ProgramStartDate ProgramEndDate
1113 8.794 9.345
1114 23.300 34.500
and then:
import matplotlib.pyplot as plt
import pandas as pd
# a simple line plot
df1.plot(kind='bar',x='ProgramID')

sum of all the columns values in the given dataframe and display output in in a new data frame

I have tried the below code:
import pandas as pd
dataframe = pd(C1,columns=['School-A','School-B','School-C','School-D','School-E'])
sum_column = daeframet.sum(axis=0)
print (sum_column)
I am getting the below error
TypeError: 'module' object is not callable
Data:
Output:
The error is coming from calling the module pd as a function. It's difficult to know which function you should be calling from pandas without knowing what C1 is, but if it is a dictionary or a pandas data frame, try:
import pandas as pd
# common to abbreviate dataframe as df
df = pd.DataFrame(C1, columns=['School-A','School-B','School-C','School-D','School-E'])
sum_column = df.sum(axis=0)
print(sum_column)
using sum will only return a series and not a dataframe, there are many ways you can do this. Lets try using select_dtypes and the to_frame() method
import numpy as np
import pandas as pd
np.random.seed(5)
df = pd.DataFrame({'class' : ['first','second','third','fourth','fifth'],
'School A' : np.random.randint(1,50,5),
'School B' : np.random.randint(1,50,5),
'School C' : np.random.randint(1,50,5),
'School D' : np.random.randint(1,50,5),
'School E' : np.random.randint(1,50,5)})
print(df)
class School A School B School C School D School E
0 first 36 10 49 16 14
1 second 15 9 31 40 12
2 third 48 37 17 17 2
3 fourth 39 40 8 28 48
4 fifth 17 28 13 45 31
new_df = (df.select_dtypes(include='int').sum(axis=0).to_frame()
.reset_index().rename(columns={0 : 'Total','index' : 'School'}))
print(new_df)
School Total
0 School A 155
1 School B 124
2 School C 118
3 School D 146
4 School E 107
Edit
seems like there are some typos in your code :
import pandas as pd
dataframe = pd.DataFrame(C1,columns=['School-A','School-B','School-C','School-D','School-E'])
sum_column = dataframe.sum(axis=0)
print (sum_column)
will return the sum as a series, and also sum the text columns by way of string concatenation :
class firstsecondthirdfourthfifth
School A 155
School B 124
School C 118
School D 146
School E 107
dtype: object

When I download date and convert into DataFrame I lose the first column with data

I use a quandl into download a stock prices. I have a list of names of companies and I download all informations. After that, I convert it into data frame. When I do it for only one company all works well but when I try do it for all in the same time something goes wrong. The first column with data convert into index with the value from 0 to 3 insted of data
My code looks like below:
import quandl
import pandas as pd
names_of_company = [11BIT, ABCDATA, ALCHEMIA]
for names in names_of_company:
x = quandl.get('WSE/%s' %names, start_date='2018-11-29',
end_date='2018-11-29',
paginate=True)
x['company'] = names
results = results.append(x).reset_index(drop=True)
Actual results looks like below:
Index Open High Low Close %Change Volume # of Trades Turnover (1000) company
0 204.5 208.5 204.5 206.0 0.73 3461.0 105.0 717.31 11BIT
1 205.0 215.0 202.5 214.0 3.88 10812.0 392.0 2254.83 ABCDATA
2 215.0 215.0 203.5 213.0 -0.47 12651.0 401.0 2656.15 ALCHEMIA
But I expected:
Data Open High Low Close %Change Volume # of Trades Turnover (1000) company
2018-11-29 204.5 208.5 204.5 206.0 0.73 3461.0 105.0 717.31 11BIT
2018-11-29 205.0 215.0 202.5 214.0 3.88 10812.0 392.0 2254.83 ABCDATA
2018-11-29 215.0 215.0 203.5 213.0 -0.47 12651.0 401.0 2656.15 ALCHEMIA
So as you can see, there is an issue with data beacues it can't convert into a correct way. But as I said if I do it for only one company, it works. Below is code:
x = quandl.get('WSE/11BIT', start_date='2019-01-01', end_date='2019-01-03')
df = pd.DataFrame(x)
I will be very grateful for any help ! Thanks All
When you store it to a dataframe, the date is your index. You lose it because when you use .reset_index(), you over write the old index (the date), and instead of the date being added as a column, you tell it to drop it with .reset_index(drop=True)
So I'd append, but then once the whole results dataframe is populated, I'd then reset the index, but NOT drop by either doing results = results.reset_index(drop=False) or results = results.reset_index() since the default is false.
import quandl
import pandas as pd
names_of_company = ['11BIT', 'ABCDATA', 'ALCHEMIA']
results = pd.DataFrame()
for names in names_of_company:
x = quandl.get('WSE/%s' %names, start_date='2018-11-29',
end_date='2018-11-29',
paginate=True)
x['company'] = names
results = results.append(x)
results = results.reset_index(drop=False)
Output:
print (results)
Date Open High ... # of Trades Turnover (1000) company
0 2018-11-29 269.50 271.00 ... 280.0 1822.02 11BIT
1 2018-11-29 0.82 0.92 ... 309.0 1027.14 ABCDATA
2 2018-11-29 4.55 4.55 ... 1.0 0.11 ALCHEMIA
[3 rows x 10 columns]

Resources