I have no idea how to start but I am using information in a file called ('emp_details') which I keep opening when coding can anyone help me?
def total_salary():
file=open('emp_details.txt','r')
tsalary = 0
for line in file:
line.split('\n')
item = line.split(',')
salary: List(str) = item[1:4]
tsalary = tsalary + str(salary)
print('The total salary for all employees in the system is: ',tsalary)
anykey = input("Enter any key to return main menu")
mainMenu()
I have a file called 'emp_details' this contain all the information including the salary of each employee. I am trying to calculate the total salary of all employees.
this is the file that has been read in.
#EMP_NO, EMP_NAME, AGE, POSITION, SALARY, YRS_EMP
001, Peter Smyth, 26, Developer, 29000, 4
002, Samuel Jones, 23, Developer, 24000, 1
003, Laura Stewart, 41, DevOps, 42000, 15
004, Paul Jones, 24, Analyst, 21000, 2
005, Simon Brown, 52, Developer, 53000, 18
006, George Staples, 42, Tester, 42000, 12
007, Greg Throne, 57, DevOps, 50000, 23
008, Aston Bently, 27, Tester, 33000, 5
009, Ben Evans, 32, DevOps, 38000, 2
010, Emma Samson, 23, DevOps, 22000, 1
011, Stephanie Beggs, 43, Tester, 19000, 9
012, Sarah McQuillin, 47, DevOps, 23000, 5
013, Grace Corrigan, 48, Analyst, 44000, 16
014, Simone Mills, 32, DevOps, 32000, 11
015, Martin Montgomery, 28, Analyst, 28000, 3
it seems you are trying to reinvent a wheel of sorts(even if you don't know there is a wheel). If you utilize pandas and import as a csv, which this seems to be, this can be done in 3 lines;
import pandas as pd
f = pd.read_csv('emp_details.txt', sep=',', names = ['EMP_NO', 'EMP_NAME', 'AGE', 'POSITION', 'SALARY', 'YRS_EMP'])
df = pd.DataFrame(f)
Total = df['SALARY'].sum()
edit: after some testing I realize the txt extension makes it necessary to skip the first line, so it turns into;
import pandas as pd
f = pd.read_csv('emp_details.txt', sep=',',skiprows = [0], names = ['EMP_NO', 'EMP_NAME', 'AGE', 'POSITION', 'SALARY', 'YRS_EMP'])
df = pd.DataFrame(f)
Total = df['SALARY'].sum()
Related
Suppose I have a DataFrame
import pandas as pd
df = pd.DataFrame(columns=['Year', 'Player', 'Team', 'TeamName', 'Games', 'Pts', 'Assist', 'Rebound'],data=[[2015, 'Curry', 'GSW', 'Warriors', 79, 30.1, 6.7, 5.4],
[2016, 'Curry', 'GSW', 'Warriors', 79, 25.3, 6.6, 4.5],
[2017, 'Curry', 'GSW', 'Warriors', 51, 26.4, 6.1, 5.1],
[2015, 'Durant', 'OKC', 'Thunder', 72, 28.2, 5.0, 8.2],
[2016, 'Durant', 'GSW', 'Warriors', 62, 25.1, 4.8, 8.3],
[2017, 'Durant', 'GSW', 'Warriors', 68, 26.4, 5.4, 6.8],
[2015, 'Ibaka', 'OKC', 'Thunder', 78, 12.6, 0.8, 6.8],
[2016, 'Ibaka', 'ORL', 'Magic', 56, 15.1, 1.1, 6.8],
[2016, 'Ibaka', 'TOR', 'Raptors', 23, 14.2, 0.7, 6.8]])
If I use
df.melt(id_vars=['Year', 'Player','Team','TeamName'])
I got a melted version of this df. I am trying to use
df.stack(), df.unstack(), df.set_index(), df.reset_index() to get the same output as the melted version, but I could not get it done.
Any suggestion on how to generate the same output from stack, unstack, set_index, reset_index as the melted version? (Any other method beside these 4 cannot be used.)
Here is the most recent attempt. I don't care the column names. But the value should be aligned. I almost got it but the values are still swapped.
df.set_index(['Year', 'Player', 'Team', 'TeamName']).stack().reset_index()
Thanks.
You can set_index + stack + reset_index to reproduce what a melt does. melt also renames the columns so we can take care of that too. And to deal with the ordering we'll need to do some modulus division.
This doesn't have all the bells and whistles of melt, but gets the basic job done for specifying only id_vars
import numpy as np
id_vars = ['Year', 'Player', 'Team', 'TeamName']
N = len(id_vars) # For renaming
N_ord = len(df.columns)-N # For reordering
df = (df.set_index(id_vars)
.stack(dropna=False) # Melt keeps `NaN`
.reset_index()
.rename(columns={0: 'value', f'level_{N}': 'variable'}))
# Reorder to match `melt`
df = df.iloc[np.argsort(df.index%N_ord, kind='mergesort')].reset_index(drop=True)
Year Player Team TeamName variable value
0 2015 Curry GSW Warriors Games 79.0
1 2016 Curry GSW Warriors Games 79.0
2 2017 Curry GSW Warriors Games 51.0
3 2015 Durant OKC Thunder Games 72.0
...
34 2016 Ibaka ORL Magic Rebound 6.8
35 2016 Ibaka TOR Raptors Rebound 6.8
df.set_index(['Year', 'Player','Team','TeamName'], inplace=True)
df = df.stack().rename('value').reset_index()
df = df.rename({'level_4': 'variable'})
Now df is of the same shape as with the melt operation. With melt you get a column named 'variable' which in this case holds the labels 'games', 'pts', and the rest. With stack.reset that column ends up being called 'level_4'. The rename fixes that.
I'm trying to extract pages from a PDF that is 1000 pages long but I only need pages in the pattern of [9,10,17,18,25,26,33,34,...etc]. These numbers can be represented in the formula: pg = 1/2 (7 - 3 (-1)^n + 8*n).
I tried to define the formula and plug into tabula.read_pdf but I'm not sure how to define the 'n' variable where 'n' ranges from 0 up to 25. Right now I defined it as a list which I think is the problem...
n = list(range(25+1))
pg = 1/2 (7 - 3 (-1)^n + 8*n)
df = tabula.read_pdf(path, pages = 'pg',index_col=0, multiple_tables=False)
When trying to execute, I get a TypeError: 'int' object is not callable on line pg = 1/2 (7 - 3 (-1)^n + 8*n). How would I define the variables so that tabula extracts pages that fit the condition of the formula?
Formula is x = 1/2(8n - 3(-1)^n + 7)
Step1:
pg = [] #Empty list to store the pages numbers calculated by formula
for i in range(1, 25+1): # For 1000 pages pdf use 1000 instead of 25
k = int(1/2*((8*n[i])-3*((-1)**n[i])+7))
pg.append(k)
print(pg, end = '') # This will give you list of page numbers
#[9, 10, 17, 18, 25, 26, 33, 34, 41, 42, 49, 50, 57, 58, 65, 66, 73, 74, 81, 82, 89, 90, 97, 98, 105]
Step 2:
# Now run the loop through each of the pages with the table
df=pd.DataFrame([])
df_combine=pd.DataFrame([])
for pageiter in range(pg):
df = tabula.read_pdf(path, pages=pageiter+1 ,index_col=0, multiple_tables=False, guess=False) #modify it as per your requirement
df_combine=pd.concat([df,df_combine]) #you can choose between merge or concat as per your need
OR
df_data = []
for pageiter in range(pg):
df = tabula.read_pdf(path, pages=pageiter+1 ,index_col=0, multiple_tables=False, guess=False) #modify it as per your requirement
df_data.append(df)
df_combine= pd.concat(df_data, axis=1)
Reference link to create formula
https://www.wolframalpha.com/widgets/view.jsp?id=a3af2e675c3bfae0f2ecce820c2bef43
I've got a pandas dataframe with two datetime columns and I would like to calculate the timedelta between the columns in "business minutes". It's easy to add business timedeltas using the offsets method, but I can't seem to find something built in that returns a timedelta in business days, hours, minutes, seconds. I'm very new to Python so it's very likely I'm missing something.
Thanks,
Nick
I don't think there's a way in numpy/pandas, but you can do it with python lib businesstime:
>>> datetime(2013, 12, 26, 5) - datetime(2013, 12, 23, 12)
datetime.timedelta(2, 61200)
>>> bt = businesstime.BusinessTime(holidays=businesstime.holidays.usa.USFederalHolidays())
>>> bt.businesstimedelta(datetime(2013, 12, 23, 12), datetime(2013, 12, 26, 5))
datetime.timedelta(1, 18000)
I have daily (day) data on calories intake for one person (cal2), which I get from a Stata dta file.
I run the code below:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from pandas import read_csv
from matplotlib.pylab import rcParams
d = pd.read_stata('time_series_calories.dta', preserve_dtypes=True,
index = 'day', convert_dates=True)
print(d.dtypes)
print(d.shape)
print(d.index)
print(d.head)
plt.plot(d)
This is how the data looks like:
0 2002-01-10 3668.433350
1 2002-01-11 3652.249756
2 2002-01-12 3647.866211
3 2002-01-13 3646.684326
4 2002-01-14 3661.941406
5 2002-01-15 3656.951660
The prints reveal the following:
day datetime64[ns]
cal2 float32
dtype: object
(251, 2)
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
...
241, 242, 243, 244, 245, 246, 247, 248, 249, 250],
dtype='int64', length=251)
And here is the problem - the data should identify as dtype='datatime64[ns]'.
However, it clearly does not. Why not?
There is a discrepancy between the code provided, the data and the types shown.
This is because irrespective of the type of cal2, the index = 'day' argument
in pd.read_stata() should always render day the index, albeit not as the
desired type.
With that said, the problem can be reproduce as follows.
First, create the dataset in Stata:
clear
input double day float cal2
15350 3668.433
15351 3652.25
15352 3647.866
15353 3646.684
15354 3661.9414
15355 3656.952
end
format %td day
save time_series_calories
describe
Contains data from time_series_calories.dta
obs: 6
vars: 2
size: 72
----------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------------------
day double %td
cal2 float %9.0g
----------------------------------------------------------------------------------------------------
Sorted by:
Second, load the data in Pandas:
import pandas as pd
d = pd.read_stata('time_series_calories.dta', preserve_dtypes=True, convert_dates=True)
print(d.head)
day cal2
0 2002-01-10 3668.433350
1 2002-01-11 3652.249756
2 2002-01-12 3647.866211
3 2002-01-13 3646.684326
4 2002-01-14 3661.941406
5 2002-01-15 3656.951660
print(d.dtypes)
day datetime64[ns]
cal2 float32
dtype: object
print(d.shape)
(6, 2)
print(d.index)
Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')
In order to change the index as desired, you can use pd.set_index():
d = d.set_index('day')
print(d.head)
cal2
day
2002-01-10 3668.433350
2002-01-11 3652.249756
2002-01-12 3647.866211
2002-01-13 3646.684326
2002-01-14 3661.941406
2002-01-15 3656.951660
print(d.index)
DatetimeIndex(['2002-01-10', '2002-01-11', '2002-01-12', '2002-01-13',
'2002-01-14', '2002-01-15'],
dtype='datetime64[ns]', name='day', freq=None)
If day is a string in the Stata dataset, then you can do the following:
d['day'] = pd.to_datetime(d.day)
d = d.set_index('day')
I am trying to find the equation of a line within a DF
Here is a fake data set to explain:
Clicks Sales
5 10
5 11
10 16
10 20
10 18
15 28
15 26
... ...
100 200
What I am trying to do:
Calculate the equation of the line between so that I am able to input a number of clicks and have an output of sales at any predicted level. The thing I am trying to wrap my brain around is that I have many different line functions (e.g. there are multiple sales for each amount of clicks). How can I iterate through my DF to just to calculate one aggregate line function?
Here's what I have but it only accept ONE input at a time, I would like to create an average or aggregate...
def slope(self, target):
return slope(target.x - self.x, target.y - self.y)
def y_int(self, target): # <= here's the magic
return self.y - self.slope(target)*self.x
def line_function(self, target):
slope = self.slope(target)
y_int = self.y_int(target)
def fn(x):
return slope*x + y_int
return fn
a = Point(5, 10) # I am stuck here since - what to input!?
b = Point(10, 16) # I am stuck here since - what to input!?
line = a.line_function(b)
print(line(x=10))
Use the scipy function scipy.stats.linregress to fit your data.
Maybe also check https://en.wikipedia.org/wiki/Linear_regression to better understand linear regression.
You could group by Clicks and take the average of the Sales per group:
In [307]: sales = df.groupby('Clicks')['Sales'].mean(); sales
Out[307]:
Clicks
5 10.5
10 18.0
15 27.0
100 200.0
Name: Sales, dtype: float64
Then form the piecewise linear interpolating function based on
the groupwise-averaged data above using interpolate.interp1d:
from scipy import interpolate
fn = interpolate.interp1d(sales.index, sales.values, kind='linear')
For example,
import numpy as np
import pandas as pd
from scipy import interpolate
import matplotlib.pyplot as plt
df = pd.DataFrame({'Clicks': [5, 5, 10, 10, 10, 15, 15, 100],
'Sales': [10, 11, 16, 20, 18, 28, 26, 200]})
sales = df.groupby('Clicks')['Sales'].mean()
Once you have the groupwise-averaged sales, you can compute the interpolated sales
a number of ways. One way is to use np.interp:
newx = [10]
print(np.interp(newx, sales.index, sales.values))
# [ 18.] <-- The interpolated sales when the number of clicks is 10 (newx)
The problem with np.interp is that you are passing sales.index and sales.values to np.interp every time you call it -- it has no memory of the interpolating function. It is re-computing the interpolating function every time you call it.
If you have scipy, then you could create the interpolating function once and then use it as many times as you like later:
fn = interpolate.interp1d(sales.index, sales.values, kind='linear')
print(fn(newx))
# [ 18.]
For example, you could evaluate the interpolating function at a whole bunch of points (and plot the result) like this:
newx = np.linspace(5, 100, 100)
plt.plot(newx, fn(newx))
plt.plot(df['Clicks'], df['Sales'], 'o')
plt.show()
Pandas Series (and DataFrames) have an iterpolate method too. To use it, you reindex the Series to include the points where you wish to interpolate:
In [308]: sales.reindex(sales.index.union([14]))
Out[308]:
5 10.5
10 18.0
14 NaN
15 27.0
100 200.0
Name: Sales, dtype: float64
and then interpolate fills in the interpolated values where the Series is NaN:
In [295]: sales.reindex(sales.index.union([14])).interpolate('values')
Out[295]:
5 10.5
10 18.0
14 25.2 # <-- interpolated value
15 27.0
100 200.0
Name: Sales, dtype: float64
But I think it is perhaps not appropriate for your problem since it does not
return just the interpolated values you are looking for; it returns a whole
Series.