Merging data frames by date in python - python-3.x

I have a 8 dataframes which contain date and a stock return. They are of different length (some 5 years, some 8, etc.). Some of them are randomly non-defined (depends on what the user do before - which countries she chooses). What I want to do, is to merge by date only those dataframes which have data and to find correlation between the columns. Please, advise me this issue. I use
the command below to merge data frames, but because some of them do not exist (this is random) I can not merge them.
merged =pd.concat([germany_return, france_return, usa_return, hongkong_return, india_return, japan_return, england_return, china_return], axis=1)
Below I present the last part of the code:
#Define Stock Price colum
stock_germany = data_germany['Settle']
stock_france = data_france['Settle']
stock_usa = data_usa['Value']
stock_hongkong = data_hongkong['Last Traded']
stock_india = data_india['Close']
stock_japan = data_india['Close']
stock_england = data_england['Settle']
stock_china = data_china['Value']
#Calculate Monthly returns
germany_return = stock_germany.pct_change(21)
france_return = stock_france.pct_change(21)
usa_return = stock_usa.pct_change(21)
hongkong_return = stock_hongkong.pct_change(21)
india_return = stock_india.pct_change(21)
japan_return = stock_japan.pct_change(21)
england_return = stock_england.pct_change(21)
china_return = stock_china.pct_change(21)
merged =pd.concat([germany_return, france_return, usa_return, hongkong_return, india_return, japan_return, england_return, china_return], axis=1)

Related

Calculate percentage change in pandas with rows that contain the same values

I am using Pandas to calculate percentage change(s) between values that occur more than once in the column of interest.
I want to compare the values of last weeks workout provided they're the same exercise type to get the percentage change of (weight used, reps accomplished )
I am able to get the percentages of all the rows which is halfway what I want but the conditional part is missing - so only get the percentages if the exercise_name is of the same value as we want to compare how we improve on a weekly, bi-weekly basis.
ids = self.user_data["exercise"].fillna(0)
dups = self.user_data[ids.isin(ids[ids.duplicated()])].sort_values("exercise")
dups['exercise'] = dups['exercise'].astype(str)
dups['set_one_weight'] = pd.to_numeric(dups['set_one_weight'])
dups['set_two_weight'] = pd.to_numeric(dups['set_two_weight'])
dups['set_three_weight'] = pd.to_numeric(dups['set_three_weight'])
dups['set_four_weight'] = pd.to_numeric(dups['set_four_weight'])
dups['set_one'] = pd.to_numeric(dups['set_one'])
dups['set_two'] = pd.to_numeric(dups['set_two'])
dups['set_three'] = pd.to_numeric(dups['set_three'])
dups['set_four'] = pd.to_numeric(dups['set_four'])
**percent_change = dups[['set_three_weight']].pct_change()**
the last line gets the percentage change for all the rows for column set_three_weight but is unable to do what I want above which is find rows with same name and obtain the percentage change.
UPDATE
Using Group By Solution
ids = self.user_data["exercise"].fillna(0)
dups = self.user_data[ids.isin(ids[ids.duplicated()])].sort_values("exercise")
dups['exercise'] = dups['exercise'].astype(str)
dups['set_one_weight'] = pd.to_numeric(dups['set_one_weight'])
dups['set_two_weight'] = pd.to_numeric(dups['set_two_weight'])
dups['set_three_weight'] = pd.to_numeric(dups['set_three_weight'])
dups['set_four_weight'] = pd.to_numeric(dups['set_four_weight'])
dups['set_one'] = pd.to_numeric(dups['set_one'])
dups['set_two'] = pd.to_numeric(dups['set_two'])
dups['set_three'] = pd.to_numeric(dups['set_three'])
dups['set_four'] = pd.to_numeric(dups['set_four'])
dups['routine_upload_date'] = pd.to_datetime(dups['routine_upload_date'])
# percent_change = dups[['set_three_weight']].pct_change()
# Group the exercises together and create a new cols that represent the percentage delta variation in percentages
dups.sort_values(['exercise', 'routine_upload_date'], inplace=True, ascending=[True, False])
dups['set_one_weight_delta'] = (dups.groupby('exercise')['set_one_weight'].apply(pd.Series.pct_change) + 1)
dups['set_two_weight_delta'] = (dups.groupby('exercise')['set_two_weight'].apply(pd.Series.pct_change) + 1)
dups['set_three_weight_delta'] = (dups.groupby('exercise')['set_three_weight'].apply(pd.Series.pct_change) + 1)
dups['set_four_weight_delta'] = (dups.groupby('exercise')['set_four_weight'].apply(pd.Series.pct_change) + 1)
dups['set_one_reps_delta'] = (dups.groupby('exercise')['set_one'].apply(pd.Series.pct_change) + 1)
dups['set_two_reps_delta'] = (dups.groupby('exercise')['set_two'].apply(pd.Series.pct_change) + 1)
dups['set_three_reps_delta'] = (dups.groupby('exercise')['set_three'].apply(pd.Series.pct_change) + 1)
dups['set_four_reps_delta'] = (dups.groupby('exercise')['set_four'].apply(pd.Series.pct_change) + 1)
print(dups.head())
I think this gets me the result(s) I want, would like someone to confirm

How to calculate active hours of an employee using face_recognition for attendance tracking

I am working on face recognition system for my academic project. I want to set the first time an employee was recognized as his first active time and the next time he is being recognized should be recorded as his last active time and then calculate the total active hours based on first active and last active time.
I tried with the following code but I'm getting only the current system time as the start time. can someone help me on what I am doing wrong.
Code:
data = pickle.loads(open(args["encodings"], "rb").read())
vs = VideoStream(src=0).start()
writers = None
time.sleep(2.0)
while True:
frame = vs.read()
rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
rgb = imutils.resize(frame, width=750)
r = frame.shape[1] / float(rgb.shape[1])
boxes = face_recognition.face_locations(rgb)
encodings = face_recognition.face_encodings(rgb, boxes)
names = []
face_names = []
for encoding in encodings:
matches = face_recognition.compare_faces(data["encodings"],
encoding)
name = "Unknown"
if True in matches:
matchedIdxs = [i for (i, b) in enumerate(matches) if b]
counts = {}
for i in matchedIdxs:
name = data["names"][i]
counts[name] = counts.get(name, 0) + 1
name = max(counts, key=counts.get)
names.append(name)
if names != []:
for i in names:
first_active_time = datetime.now().strftime('%H:%M')
last_active_time = datetime.now().strftime('%H:%M')
difference = datetime.strptime(first_active_time, '%H:%M') - datetime.strptime(last_active_time, '%H:%M')
difference = difference.total_seconds()
total_hours = time.strftime("%H:%M", time.gmtime(difference))
face_names.append([i, first_active_time, last_active_time, total_hours])

Value error in assigning to dataframe

I am assigning different data to one dataframe. And I had the following
ValueError: If using all scalar values, you must pass an index
I follow the question post by other Here
But it did not work out.
The following is my code. All you have to do is copy and paste the code to IDE.
import pandas as pd
import numpy as np
#Loading Team performance Data (ExpG (Home away)) For and against
epl_1718 = pd.read_csv("http://www.football-data.co.uk/mmz4281/1718/E0.csv")
epl_1718 = epl_1718[['HomeTeam','AwayTeam','FTHG','FTAG']]
epl_1718 = epl_1718.rename(columns={'FTHG': 'HomeGoals', 'FTAG': 'AwayGoals'})
Home_goal_avg = epl_1718['HomeGoals'].mean()
Away_goal_avg = epl_1718['AwayGoals'].mean()
Home_team_goals = epl_1718.groupby(['HomeTeam'])['HomeGoals'].sum()
Home_count = epl_1718.groupby(['HomeTeam'])['HomeTeam'].count()
Home_team_avg_goal = Home_team_goals/Home_count
Home_team_concede = epl_1718.groupby(['HomeTeam'])['AwayGoals'].sum()
EPL_Home_average_score = epl_1718['HomeGoals'].mean()
EPL_Home_average_conc = epl_1718['HomeGoals'].mean()
Home_team_avg_conc = Home_team_concede/Home_count
Away_team_goals = epl_1718.groupby(['AwayTeam'])['AwayGoals'].sum()
Away_count = epl_1718.groupby(['AwayTeam'])['AwayTeam'].count()
Away_team_avg_goal = Away_team_goals/Away_count
Away_team_concede = epl_1718.groupby(['AwayTeam'])['HomeGoals'].sum()
EPL_Away_average_score = epl_1718['AwayGoals'].mean()
EPL_Away_average_conc = epl_1718['HomeGoals'].mean()
Away_team_avg_conc = Away_team_concede/Away_count
Home_attk_sth = Home_team_avg_goal/EPL_Home_average_score
Home_attk_sth = Home_attk_sth.sort_index().reset_index()
Home_def_sth = Home_team_avg_conc/EPL_Home_average_conc
Home_def_sth = Home_def_sth .sort_index().reset_index()
Away_attk_sth = Away_team_avg_goal/EPL_Away_average_score
Away_attk_sth = Away_attk_sth .sort_index().reset_index()
Away_def_sth = Away_team_avg_conc/EPL_Away_average_conc
Away_def_sth = Away_def_sth.sort_index().reset_index()
Home_def_sth
HomeTeam = epl_1718['HomeTeam'].drop_duplicates().sort_index().reset_index().set_index('HomeTeam')
AwayTeam = epl_1718['AwayTeam'].drop_duplicates().sort_index().reset_index().sort_values(['AwayTeam']).set_index(['AwayTeam'])
#HomeTeam = HomeTeam.sort_index().reset_index()
Team = HomeTeam.append(AwayTeam).drop_duplicates()
Data = pd.DataFrame({"Team":Team,
"Home_attkacking":Home_attk_sth,
"Home_def": Home_def_sth,
"Away_Attacking":Away_attk_sth,
"Away_def":Away_def_sth,
"EPL_Home_avg_score":EPL_Home_average_score,
"EPL_Home_average_conc":EPL_Home_average_conc,
"EPL_Away_average_score":EPL_Away_average_score,
"EPL_Away_average_conc":EPL_Away_average_conc},
columns =['Team','Home_attacking','Home_def','Away_attacking','Away_def',
'EPL_Home_avg_score','EPL_Home_avg_conc','EPL_Away_avg_score','EPL_Away_average_conc'])
In this code, what I am trying to do is to get average goal score per team per game, average goals conceded per team per game.
And then I am calculating other performance factors such as attacking strength, defensive strenght etc.
I have to paste the code as if i use example, creating data frame would work.
Thanks for understanding.
Thanks in advance for the advice too.
The format (or the columns) of final data frame will look like as follow:
Team Home Attacking Home Defensive Away attacking away defensive
and so on as mentioned in the data frame.
It means, there will be only 20 teams under team columns
The shape of dataframe will be ( 20,9)
Regards,
Zep
Here main idea is remove reset_index for Series with index by teams, so variable Team is not necessary and is created as last step by reset_index. Also be carefull with columns names in DataFrame constructor, if there are changed like EPL_Home_average_conc in dictionary and then EPL_Home_avg_conc get NaNs columns:
Home_team_goals = epl_1718.groupby(['HomeTeam'])['HomeGoals'].sum()
Home_count = epl_1718.groupby(['HomeTeam'])['HomeTeam'].count()
Home_team_avg_goal = Home_team_goals/Home_count
Home_team_concede = epl_1718.groupby(['HomeTeam'])['AwayGoals'].sum()
EPL_Home_average_score = epl_1718['HomeGoals'].mean()
EPL_Home_average_conc = epl_1718['HomeGoals'].mean()
Home_team_avg_conc = Home_team_concede/Home_count
Away_team_goals = epl_1718.groupby(['AwayTeam'])['AwayGoals'].sum()
Away_count = epl_1718.groupby(['AwayTeam'])['AwayTeam'].count()
Away_team_avg_goal = Away_team_goals/Away_count
Away_team_concede = epl_1718.groupby(['AwayTeam'])['HomeGoals'].sum()
EPL_Away_average_score = epl_1718['AwayGoals'].mean()
EPL_Away_average_conc = epl_1718['HomeGoals'].mean()
Away_team_avg_conc = Away_team_concede/Away_count
#removed reset_index
Home_attk_sth = Home_team_avg_goal/EPL_Home_average_score
Home_attk_sth = Home_attk_sth.sort_index()
Home_def_sth = Home_team_avg_conc/EPL_Home_average_conc
Home_def_sth = Home_def_sth .sort_index()
Away_attk_sth = Away_team_avg_goal/EPL_Away_average_score
Away_attk_sth = Away_attk_sth .sort_index()
Away_def_sth = Away_team_avg_conc/EPL_Away_average_conc
Away_def_sth = Away_def_sth.sort_index()
Data = pd.DataFrame({"Home_attacking":Home_attk_sth,
"Home_def": Home_def_sth,
"Away_attacking":Away_attk_sth,
"Away_def":Away_def_sth,
"EPL_Home_average_score":EPL_Home_average_score,
"EPL_Home_average_conc":EPL_Home_average_conc,
"EPL_Away_average_score":EPL_Away_average_score,
"EPL_Away_average_conc":EPL_Away_average_conc},
columns =['Home_attacking','Home_def','Away_attacking','Away_def',
'EPL_Home_average_score','EPL_Home_average_conc',
'EPL_Away_average_score','EPL_Away_average_conc'])
#column from index
Data = Data.rename_axis('Team').reset_index()
print (Data)

Updating multiple line plots dynamically in callback in bokeh

I have a use case where I have multiple line plots (with legends), and I need to update the line plots based on a column condition. Below is an example of two data set, based on the country, the column data source changes. But the issue I am facing is, the number of columns is not fixed for the data source, and even the types can vary. So, when I update the data source based on a callback when there is a new country selected, I get this error:
Error: attempted to retrieve property array for nonexistent field 'pay_conv_7d.content'.
I am guessing because in the new data source, the pay_conv_7d.content column doesn't exist, but in my plot those lines were already there. I have been trying to fix this issue by various means (making common columns for all country selection - adding the missing column in the data source in callback, but still get issues.
Is there any clean way to have multiple line plots updating using callback, and not do a lot of hackish way? Any insights or help would be really appreciated. Thanks much in advance! :)
def setup_multiline_plots(x_axis, y_axis, title_text, data_source, plot):
num_categories = len(data_source.data['categories'])
legends_list = list(data_source.data['categories'])
colors_list = Spectral11[0:num_categories]
# xs = [data_source.data['%s.'%x_axis].values] * num_categories
# ys = [data_source.data[('%s.%s')%(y_axis,column)] for column in data_source.data['categories']]
# data_source.data['x_series'] = xs
# data_source.data['y_series'] = ys
# plot.multi_line('x_series', 'y_series', line_color=colors_list,legend='categories', line_width=3, source=data_source)
plot_list = []
for (colr, leg, column) in zip(colors_list, legends_list, data_source.data['categories']):
xs, ys = '%s.'%x_axis, ('%s.%s')%(y_axis,column)
plot.line(xs,ys, source=data_source, color=colr, legend=leg, line_width=3, name=ys)
plot_list.append(ys)
data_source.data['plot_names'] = data_source.data.get('plot_names',[]) + plot_list
plot.title.text = title_text
def update_plot(country, timeseries_df, timeseries_source,
aggregate_df, aggregate_source, category,
plot_pay_7d, plot_r_pay_90d):
aggregate_metrics = aggregate_df.loc[aggregate_df.country == country]
aggregate_metrics = aggregate_metrics.nlargest(10, 'cost')
category_types = list(aggregate_metrics[category].unique())
timeseries_df = timeseries_df[timeseries_df[category].isin(category_types)]
timeseries_multi_line_metrics = get_multiline_column_datasource(timeseries_df, category, country)
# len_series = len(timeseries_multi_line_metrics.data['time.'])
# previous_legends = timeseries_source.data['plot_names']
# current_legends = timeseries_multi_line_metrics.data.keys()
# common_legends = list(set(previous_legends) & set(current_legends))
# additional_legends_list = list(set(previous_legends) - set(current_legends))
# for legend in additional_legends_list:
# zeros = pd.Series(np.array([0] * len_series), name=legend)
# timeseries_multi_line_metrics.add(zeros, legend)
# timeseries_multi_line_metrics.data['plot_names'] = previous_legends
timeseries_source.data = timeseries_multi_line_metrics.data
aggregate_source.data = aggregate_source.from_df(aggregate_metrics)
def get_multiline_column_datasource(df, category, country):
df_country = df[df.country == country]
df_pivoted = pd.DataFrame(df_country.pivot_table(index='time', columns=category, aggfunc=np.sum).reset_index())
df_pivoted.columns = df_pivoted.columns.to_series().str.join('.')
categories = list(set([column.split('.')[1] for column in list(df_pivoted.columns)]))[1:]
data_source = ColumnDataSource(df_pivoted)
data_source.data['categories'] = categories
Recently I had to update data on a Multiline glyph. Check my question if you want to take a look at my algorithm.
I think you can update a ColumnDataSource in three ways at least:
You can create a dataframe to instantiate a new CDS
cds = ColumnDataSource(df_pivoted)
data_source.data = cds.data
You can create a dictionary and assign it to the data attribute directly
d = {
'xs0': [[7.0, 986.0], [17.0, 6.0], [7.0, 67.0]],
'ys0': [[79.0, 69.0], [179.0, 169.0], [729.0, 69.0]],
'xs1': [[17.0, 166.0], [17.0, 116.0], [17.0, 126.0]],
'ys1': [[179.0, 169.0], [179.0, 1169.0], [1729.0, 169.0]],
'xs2': [[27.0, 276.0], [27.0, 216.0], [27.0, 226.0]],
'ys2': [[279.0, 269.0], [279.0, 2619.0], [2579.0, 2569.0]]
}
data_source.data = d
Here if you need different sizes of columns or empty columns you can fill the gaps with NaN values in order to keep column sizes. And I think this is the solution to your question:
import numpy as np
d = {
'xs0': [[7.0, 986.0], [17.0, 6.0], [7.0, 67.0]],
'ys0': [[79.0, 69.0], [179.0, 169.0], [729.0, 69.0]],
'xs1': [[17.0, 166.0], [np.nan], [np.nan]],
'ys1': [[179.0, 169.0], [np.nan], [np.nan]],
'xs2': [[np.nan], [np.nan], [np.nan]],
'ys2': [[np.nan], [np.nan], [np.nan]]
}
data_source.data = d
Or if you only need to modify a few values then you can use the method patch. Check the documentation here.
The following example shows how to patch entire column elements. In this case,
source = ColumnDataSource(data=dict(foo=[10, 20, 30], bar=[100, 200, 300]))
patches = {
'foo' : [ (slice(2), [11, 12]) ],
'bar' : [ (0, 101), (2, 301) ],
}
source.patch(patches)
After this operation, the value of the source.data will be:
dict(foo=[11, 22, 30], bar=[101, 200, 301])
NOTE: It is important to make the update in one go to avoid performance issues

Melting a List in column to multiple rows in dataframe

What I am trying to do is, I have a list of star_cast and list of a genre for a single movie entity. I want to melt this list down as a repeating entity in data frame so that I can store it in a database system.
director_name = ['chris','guy','bryan']
genre = [['mystery','thriller'],['comedy','crime'],['action','adventure','sci -fi']]
gross_vlaue = [2544,236544,265888]
imdb_ratings = [8.5,5.4,3.2]
metascores = [80.0,55.0,64.0]
movie_names = ['memento','snatch','x-men']
runtime = [113.0,102.0,104.0]
star_cast = [['abc','ced','gef'],['aaa','abc'],['act','cst','gst','hhs']]
votes = [200,2150,2350]
sample_data = pd.DataFrame({"movie_names":movie_names,
"imdb_ratings":imdb_ratings,
"metscores":metascores,
"votes":votes,
"runtime":runtime,
"genre":genre,
"director_name": director_name,
"star_cast": star_cast,
"gross_value":gross_vlaue
})
The above will generate a Data Frame sample I have.
director_name = ['chris','chris','chris','chris','chris','chris','guy','guy','guy','guy','bryan','bryan','bryan','bryan','bryan','bryan','bryan','bryan','bryan','bryan','bryan','bryan']
genre = ['mystery','thriller','mystery','thriller','mystery','thriller','comedy','crime','comedy','crime','action','adventure','sci -fi','action','adventure','sci -fi','action','adventure','sci -fi','action','adventure','sci -fi']
gross_vlaue = [2544,2544,2544,2544,2544,2544,236544,236544,236544,236544,265888,265888,265888,265888,265888,265888,265888,265888,265888,265888,265888,265888]
imdb_ratings = [8.5,8.5,8.5,8.5,8.5,8.5,5.4,5.4,5.4,5.4,3.2,3.2,3.2,3.2,3.2,3.2,3.2,3.2,3.2,3.2,3.2,3.2]
metascores = [80.0,80.0,80.0,80.0,80.0,80.0,55.0,55.0,55.0,55.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0,64.0]
movie_names = ['memento','memento','memento','memento','memento','memento','snatch','snatch','snatch','snatch','x-men','x-men','x-men','x-men','x-men','x-men','x-men','x-men','x-men','x-men','x-men','x-men']
runtime = [113.0,113.0,113.0,113.0,113.0,113.0,102.0,102.0,102.0,102.0,104.0,104.0,104.0,104.0,104.0,104.0,104.0,104.0,104.0,104.0,104.0,104.0]
star_cast = ['abc','ced','gef','abc','ced','gef','aaa','abc','aaa','abc','act','cst','gst','hhs','act','cst','gst','hhs','act','cst','gst','hhs']
votes = [200,200,200,200,200,200,2150,2150,2150,2150,2350,2350,2350,2350,2350,2350,2350,2350,2350,2350,2350,2350]
sample_result = pd.DataFrame({"movie_names":movie_names,
"imdb_ratings":imdb_ratings,
"metscores":metascores,
"votes":votes,
"runtime":runtime,
"genre":genre,
"director_name": director_name,
"star_cast": star_cast,
"gross_value":gross_vlaue
})
This will generate the format I want to conver my data into.
I tried using melt() but no luck there. Please help, as to how it can be achieved in an effective way. My dataset is fairly large, using for loops will be very slow. Is there any other way around solving this?

Resources