How to create a dataframe from multiple dictionaries? - python-3.x

How can I create a data frame from multiple dictionaries?
Suppose the following:
import numpy as np
import pandas as pd
Open = {'Open': np.array([86.34, 84.04, 79.06, 78.46, 75.85, 80.78, 79.66, 80.67, 82.32,80.1 , 77.63, 77. , 79.15, 76.32, 77. , 77.11, 77.04, 79.74,79.92, 79.09])}
High = {'High': np.array([86.45, 84.24, 80.29, 79.11, 79.98, 80.98, 80.57, 82.18, 83.25,81.25, 78.28, 79.2 , 79.19, 77.55, 79. , 77.5 , 81.93, 81.04,82.48, 86.74])}
Low = {'Low': np.array([83.15, 79.07, 75.59, 76.99, 74.78, 77.45, 78.48, 80.11, 80.35, 77. , 71.96, 76.15, 76.73, 75.83, 76.11, 73.46, 76.55, 78.7 ,77.65, 78.47])}
Close = {'Close': np.array([84.02, 79.17, 77.28, 77.56, 79.24, 79.86, 79.91, 82.03, 81.83,77.63, 76.19, 79.13, 76.85, 76.98, 78.31, 77.49, 81.65, 80.57,77.92, 85.51])}
index = pd.date_range('2021-1-1',periods=20)
I'm able to create a dataframe from one dictionary as evidenced by the below:
df = pd.DataFrame(Open, index = index)
However, I'm unable to extend this syntax using a list of dictionaries and get:
df = pd.DataFrame([Open, High, Low, Close] index = index)
ValueError: Shape of passed values is (4,4) indices imply (20,4)
How can I construct a dataframe from multiple dictionaries where each column is a dictionary?

You merge multiple dict into one
df = pd.DataFrame(dict(Open, **High, **Low, **Close), index = index)
df
Open High Low Close
2021-01-01 86.34 86.45 83.15 84.02
2021-01-02 84.04 84.24 79.07 79.17
2021-01-03 79.06 80.29 75.59 77.28
2021-01-04 78.46 79.11 76.99 77.56
2021-01-05 75.85 79.98 74.78 79.24
2021-01-06 80.78 80.98 77.45 79.86
2021-01-07 79.66 80.57 78.48 79.91
2021-01-08 80.67 82.18 80.11 82.03
2021-01-09 82.32 83.25 80.35 81.83
2021-01-10 80.10 81.25 77.00 77.63
2021-01-11 77.63 78.28 71.96 76.19
2021-01-12 77.00 79.20 76.15 79.13
2021-01-13 79.15 79.19 76.73 76.85
2021-01-14 76.32 77.55 75.83 76.98
2021-01-15 77.00 79.00 76.11 78.31
2021-01-16 77.11 77.50 73.46 77.49
2021-01-17 77.04 81.93 76.55 81.65
2021-01-18 79.74 81.04 78.70 80.57
2021-01-19 79.92 82.48 77.65 77.92
2021-01-20 79.09 86.74 78.47 85.51

Related

month starting date and ending date between a range of date in python

the input is a range of date for which we need to find the starting date of the month and end date of the month of all date in between the interval. example is given below
input:
start date: 2018-6-15
end date: 2019-3-20
desired output:
[
["month starting date","month ending date"],
["2018-6-15","2018-6-30"],
["2018-7-1","2018-7-31"],
["2018-8-1","2018-8-31"],
["2018-9-1","2018-9-30"],
["2018-10-1","2018-10-31"],
["2018-11-1","2018-11-30"],
["2018-12-1","2018-12-31"],
["2019-1-1","2019-1-31"],
["2019-2-1","2019-2-28"],
["2019-3-1","2019-3-20"]
]
An option using pandas: create a date_range from start to end date, extract the month numbers from that as a pandas.Series, shift it 1 element forward and 1 element backward to retrieve a boolean mask where the months change (!=). Now you can create a DataFrame to work with or create a list of lists if you like.
Ex:
import pandas as pd
start_date, end_date = '2018-6-15', '2019-3-20'
dtrange = pd.date_range(start=start_date, end=end_date, freq='d')
months = pd.Series(dtrange .month)
starts, ends = months.ne(months.shift(1)), months.ne(months.shift(-1))
df = pd.DataFrame({'month_starting_date': dtrange[starts].strftime('%Y-%m-%d'),
'month_ending_date': dtrange[ends].strftime('%Y-%m-%d')})
# df
# month_starting_date month_ending_date
# 0 2018-06-15 2018-06-30
# 1 2018-07-01 2018-07-31
# 2 2018-08-01 2018-08-31
# 3 2018-09-01 2018-09-30
# 4 2018-10-01 2018-10-31
# 5 2018-11-01 2018-11-30
# 6 2018-12-01 2018-12-31
# 7 2019-01-01 2019-01-31
# 8 2019-02-01 2019-02-28
# 9 2019-03-01 2019-03-20
# as a list of lists:
l = [df.columns.values.tolist()] + df.values.tolist()
# l
# [['month_starting_date', 'month_ending_date'],
# ['2018-06-15', '2018-06-30'],
# ['2018-07-01', '2018-07-31'],
# ['2018-08-01', '2018-08-31'],
# ['2018-09-01', '2018-09-30'],
# ['2018-10-01', '2018-10-31'],
# ['2018-11-01', '2018-11-30'],
# ['2018-12-01', '2018-12-31'],
# ['2019-01-01', '2019-01-31'],
# ['2019-02-01', '2019-02-28'],
# ['2019-03-01', '2019-03-20']]
Note that I use strftime when I create the DataFrame. Do this if you want the output to be of dtype string. If you want to continue to work with datetime objects (timestamps), don't apply strftime.
This code is simple and uses standard python packages.
import calendar
from datetime import datetime, timedelta
def get_time_range_list(start_date, end_date):
date_range_list = []
while 1:
month_end = start_date.replace(day=calendar.monthrange(start_date.year, start_date.month)[1])
next_month_start = month_end + timedelta(days=1)
if next_month_start <= end_date:
date_range_list.append((start_date, month_end))
start_date = next_month_start
else:
date_range_list.append((start_date, end_date))
return date_range_list

How to expand the dataframe based on the column values?

I have this dataframe:
utc arc_time_s tec_tecu elevation_deg lat_e_deg lon_e_deg
01.01.2018 01:19 54 3.856 17.35 57.44 25.02
01.01.2018 01:19 53 4.021 17.29 57.47 25.03
01.01.2018 01:19 52 4.029 17.22 57.51 25.05
01.01.2018 01:19 51 4.015 17.15 57.54 25.07
01.01.2018 01:19 50 3.997 17.08 57.57 25.09
What I want is expanding the dataframe based on lat_e_deg column to have all values with decimal scale 2.
I found the method resample but it seems like only can be used for datetime column.
So as an output I want to have like this:
How can I do this?
import pandas as pd
import numpy as np
# reconstruct part of your DataFrame for testing purposes:
df = pd.DataFrame([[17.35, 57.44], [17.29, 57.47], [17.22, 57.51]],
columns = ['elevation_deg', 'lat_e_deg'])
# create a Series of the desired stepwise values:
lat_e_deg_expanded = pd.Series(np.arange(start = min(df['lat_e_deg']),
stop = max(df['lat_e_deg']),
step = 0.01),
name = 'lat_e_deg')
# merge the expanded series with the original DataFrame and sort:
df_expanded = pd.merge(df, lat_e_deg_expanded,
on = 'lat_e_deg',
how = 'outer')
df_expanded.sort_values(by = 'lat_e_deg', inplace = True)
You can create pd.Series with step = 0.01 and then join to original dataframe.
Example code assuming df is dataframe with missing decimal values:
ts = pd.Series(np.arange(start = 57.44, stop = 57.57, step=0.01), name = "t")
df = pd.DataFrame({'t': [57.44, 57.47, 57.57]})
df2 = pd.merge(ts, df, how = "left").sort_values("t")
Result:
t
0 57.44
1 57.45
2 57.46
3 57.47
4 57.48
5 57.49
6 57.50
7 57.51
8 57.52
9 57.53
10 57.54
11 57.55
12 57.56
13 57.57

How to transform a dataframe based on if,else conditions?

I am trying to build a function which transform a dataframe based on certain conditions but I am getting a Systax Error. I am not sure what I am doing wrong. Any help will be appreciated. Thank you!
import pandas as pd
from datetime import datetime
from datetime import timedelta
df=pd.read_csv('example1.csv')
df.columns =(['dtime','kW'])
df['dtime'] = pd.to_datetime(df['dtime'])
df.head(5)
dtime kW
0 2019-08-27 23:30:00 0.016
1 2019-08-27 23:00:00 0
2 2019-08-27 22:30:00 0.016
3 2019-08-27 22:00:00 0.016
4 2019-08-27 21:30:00 0
def transdf(df):
a=df.loc[0,'dtime']
b=df.loc[1,'dtime']
c=a-b
minutes = c.total_seconds() / 60
d=int(minutes)
#d can be only 15 ,30 or 60
if d==15:
return df=df.set_index('dtime').asfreq('-15T',fill_value='Missing')
elif d==30:
return df=df.set_index('dtime').asfreq('-30T',fill_value='Missing')
elif d==60:
return df=df.set_index('dtime').asfreq('-60T',fill_value='Missing')
else:
return None
first. It is more efficient to have the return statement after the else at the end of your code. Inside each of the cases just update the value for df. Return is part of your function, not the if statement that's why you are getting errors.
def transform(df):
a = df.loc[0, 'dtime']
b = df.loc[1, 'dtime']
c = a - b
minutes = c.total_seconds() / 60
d=int(minutes)
#d can be only 15 ,30 or 60
if d==15:
df= df.set_index('dtime').asfreq('-15T',fill_value='Missing')
elif d==30:
df= df.set_index('dtime').asfreq('-30T',fill_value='Missing')
elif d==60:
df= df.set_index('dtime').asfreq('-60T',fill_value='Missing')
else:
None
return dfere

Identifying groups of two rows that satisfy three conditions in a dataframe

I have the df below and want to identify any two orders that satisfy all the following condtions:
Distance between pickups less than X miles
Distance between dropoffs less Y miles
Difference between order creation times less Z minutes
Would use haversine import haversine to calculate the difference in pickups for each row and difference in dropoffs for each row or order.
The df I currently have looks like the following:
DAY  Order pickup_lat pickup_long dropoff_lat dropoff_long created_time
1/3/19 234e 32.69 -117.1 32.63 -117.08 3/1/19 19:00
1/3/19 235d 40.73 -73.98 40.73 -73.99 3/1/19 23:21
1/3/19 253w 40.76 -73.99 40.76 -73.99 3/1/19 15:26
2/3/19 231y 36.08 -94.2 36.07 -94.21 3/2/19 0:14
3/3/19 305g 36.01 -78.92 36.01 -78.95 3/2/19 0:09
3/3/19 328s 36.76 -119.83 36.74 -119.79 3/2/19 4:33
3/3/19 286n 35.76 -78.78 35.78 -78.74 3/2/19 0:43
I want my output df to be any 2 orders or rows that satisfy the above conditions. What I am not sure of is how to calculate that for each row in the dataframe to return any two rows that satisfy those condtions.
I hope I am explaining my desired output correctly. Thanks for looking!
I don't know if it is an optimal solution, but I didn't come up with something different. What I have done:
created dataframe with all possible orders combination,
computed all needed measures and for all of the combinations, I added those measures column to the dataframe,
find the indices of the rows which fulfill the mentioned conditions.
The code:
#create dataframe with all combination
from itertools import combinations
index_comb = list(combinations(trips.index, 2))#trip, your dataframe
col_names = trips.columns
orders1= pd.DataFrame([trips.loc[c[0],:].values for c in index_comb],columns=trips.columns,index = index_comb)
orders2= pd.DataFrame([trips.loc[c[1],:].values for c in index_comb],columns=trips.columns,index = index_comb)
orders2 = orders2.add_suffix('_1')
combined = pd.concat([orders1,orders2],axis=1)
from haversine import haversine
def distance(row):
loc_0 = (row[0],row[1]) # (lat, lon)
loc_1 = (row[2],row[3])
return haversine(loc_0,loc_1,unit='mi')
#pickup diff
pickup_cols = ["pickup_long","pickup_lat","pickup_long_1","pickup_lat_1"]
combined[pickup_cols] = combined[pickup_cols].astype(float)
combined["pickup_dist_mi"] = combined[pickup_cols].apply(distance,axis=1)
#dropoff diff
dropoff_cols = ["dropoff_lat","dropoff_long","dropoff_lat_1","dropoff_long_1"]
combined[dropoff_cols] = combined[dropoff_cols].astype(float)
combined["dropoff_dist_mi"] = combined[dropoff_cols].apply(distance,axis=1)
#creation time diff
combined["time_diff_min"] = abs(pd.to_datetime(combined["created_time"])-pd.to_datetime(combined["created_time_1"])).astype('timedelta64[m]')
#Thresholds
Z = 600
Y = 400
X = 400
#find orders with below conditions
diff_time_Z = combined["time_diff_min"] < Z
pickup_dist_X = combined["pickup_dist_mi"]<X
dropoff_dist_Y = combined["dropoff_dist_mi"]<Y
contitions_idx = diff_time_Z & pickup_dist_X & dropoff_dist_Y
out = combined.loc[contitions_idx,["Order","Order_1","time_diff_min","dropoff_dist_mi","pickup_dist_mi"]]
The output for your data:
Order Order_1 time_diff_min dropoff_dist_mi pickup_dist_mi
(0, 5) 234e 328s 573.0 322.988195 231.300179
(1, 2) 235d 253w 475.0 2.072803 0.896893
(4, 6) 305g 286n 34.0 19.766096 10.233550
Hope I understand you well and that will help.
Using your dataframe as above. Drop the index. I'm presuming your created_time column is in datetime format.
import pandas as pd
from geopy.distance import geodesic
Cross merge the dataframe to get all possible combinations of 'Order'.
df_all = pd.merge(df.assign(key=0), df.assign(key=0), on='key').drop('key', axis=1)
Remove all the rows where the orders are equal.
df_all = df_all[-(df_all['Order_x'] == df_all['Order_y'])].copy()
Drop duplicate rows where Order_x, Order_y == [a, b] and [b, a]
# drop duplicate rows
# first combine Order_x and Order_y into a sorted list, and combine into a string
df_all['dup_order'] = df_all[['Order_x', 'Order_y']].values.tolist()
df_all['dup_order'] = df_all['dup_order'].apply(lambda x: "".join(sorted(x)))
# drop the duplicates and reset the index
df_all = df_all.drop_duplicates(subset=['dup_order'], keep='first')
df_all.reset_index(drop=True)
Create a column calculate the time difference in minutes.
df_all['time'] = (df_all['dt_ceated_x'] - df_all['dt_ceated_y']).abs().astype('timedelta64[m]')
Create a column and calculate the distance between drop offs.
df_all['dropoff'] = df_all.apply(
(lambda row: geodesic(
(row['dropoff_lat_x'], row['dropoff_long_x']),
(row['dropoff_lat_x'], row['dropoff_long_y'])
).miles),
axis=1
)
Create a column and calculate the distance between pickups.
df_all['pickup'] = df_all.apply(
(lambda row: geodesic(
(row['pickup_lat_x'], row['pickup_long_x']),
(row['pickup_lat_x'], row['pickup_long_y'])
).miles),
axis=1
)
Filter the results as desired.
X = 1500
Y = 2000
Z = 100
mask_pickups = df_all['pickup'] < X
mask_dropoff = df_all['dropoff'] < Y
mask_time = df_all['time'] < Z
print(df_all[mask_pickups & mask_dropoff & mask_time][['Order_x', 'Order_y', 'time', 'dropoff', 'pickup']])
Order_x Order_y time dropoff pickup
10 235d 231y 53.0 1059.026620 1059.026620
11 235d 305g 48.0 260.325370 259.275948
13 235d 286n 82.0 249.306279 251.929905
25 231y 305g 5.0 853.308110 854.315567
27 231y 286n 29.0 865.026077 862.126593
34 305g 286n 34.0 11.763787 7.842526

Pandas: create multiple aggregate columns and merge multiple data frames in an elegant way

I am using the following code to create a few new aggregated columns based on the column version. Then merged the 4 new data frames.
new_df = df[['version','duration']].groupby('version').mean().rename(columns=lambda x: ('mean_' + x)).reset_index().fillna(0)
new_df1 = df[['version','duration']].groupby('version').std().rename(columns=lambda x: ('std_' + x)).reset_index().fillna(0)
new_df2 = df[['version','ts']].groupby('version').min().rename(columns=lambda x: ('min_' + x)).reset_index().fillna(0)
new_df3 = df[['version','ts']].groupby('version').max().rename(columns=lambda x: ('max_' + x)).reset_index().fillna(0)
new_df3
import pandas
df_a = pandas.merge(new_df,new_df1, on = 'version')
df_b = pandas.merge(df_a,new_df2, on = 'version')
df_c = pandas.merge(df_b,new_df3, on = 'version')
df_c
The output looks like below:
version mean_duration std_duration min_ts max_ts
0 1400422 451 1 2018-02-28 09:42:15 2018-02-28 09:42:15
1 7626065 426 601 2018-01-25 11:01:58 2018-01-25 11:15:22
2 7689209 658 473 2018-01-30 11:09:31 2018-02-01 05:19:23
3 7702304 711 80 2018-01-30 17:49:18 2018-01-31 12:27:20
The code works fine, but I am wondering is there a more elegant/clean way to do this? Thank you!
Using functools reduce modify your result (merge)
import functools
l=[new_df1,new_df3,new_df3]
functools.reduce(lambda left,right: pd.merge(left,right,on=['version']), l)
Or let us using agg recreate what you need
s=df.groupby('version').agg({'duration':['mean','std'],'ts':['min','max']}).reset_index()
s.columns=s.columns.map('_'.join)

Resources