How do I iterate through data in a panda data frame?

How do I iterate through data in a panda data frame? - python-3.x

I have created a 5 day moving average for 5 years worth of data. How do I iterate through this to show if the moving average is rising or falling for every single day. My code is simply giving me 1 integer answer rather than a rising (+1) or falling(-1) answer for every day. Thank you!
import pandas as pd
df = pd.read_csv('file:///C:/Users/James Brodie/Desktop/USDJPY.csv', header=1, index_col=0)
ma5 = df['PX_LAST'].rolling(window=5).mean()
ma8 = df['PX_LAST'].rolling(window=8).mean()
ma21 = df['PX_LAST'].rolling(window=21).mean()
ma5x = []
for i in ma5:
if i > i-1:
ma5x = 1
elif i < i-1:
ma5x = -1
else:
ma5x = 0
print(ma5x)
Thank you!

ma5 = [5,2,2,3,3,2,5]
ma5x = []
lastItem = ma5[0]
for currItem in ma5[1:]:
if currItem > lastItem:
ma5x.append(1)
elif currItem < lastItem:
ma5x.append(-1)
else:
ma5x.append(0)
lastItem = currItem
print(ma5x)
gives:
[-1, 0, 1, 0, -1, 1]
The elements of a list are best represented in pandas as a Series object (put own index of choice for rows replacing list(range(len(ma5x))) with anything you need):
print('-----------')
import pandas as pd
pd_ma5x = pd.Series(ma5x, index=list(range(len(ma5x))))
print(pd_ma5x)
gives:
-----------
0 -1
1 0
2 1
3 0
4 -1
5 1
dtype: int64
Using own index be aware that pd_ma5x size is one less than that of ma5

Related

How to speed up a nested loop with hundreds of thousands of combinations records in python?

Consider I've two data frames GL_df and RE_df
GL_ID
GL_ACCNO
GL_DATE
GL_CRDR
GL_AMOUNT
ISMATCHED
MATCH_TABLE
MATCH_ID
1175595887
0004366490004
2022-03-14
C
17482.12
0
NULL
NULL
1175595893
0004366490004
2022-03-14
D
-91.22
0
NULL
NULL
1175595897
0004366490004
2022-03-14
D
-18.24
0
NULL
NULL
1179466130
0004366490004
2022-03-22
D
-400000.00
0
NULL
NULL
1179466158
0004366490004
2022-03-22
D
-500000.00
0
NULL
NULL
RE_ID
RE_ACCNO
RE_DATE
RE_CRDR
RE_AMOUNT
ISMATCHED
1261337
0004366490004
2022-03-22
C
500000.00
0
1261342
0004366490004
2022-03-22
D
-44707.99
0
1261343
0004366490004
2022-03-22
D
-16226.15
0
1261346
0004366490004
2022-03-22
D
-17338.43
0
1261348
0004366490004
2022-03-22
C
500000.00
0
In the above
I've to find all possible combinations of Id's of two dataframe where IS_MATCHED columns value equals to zero
from itertools import combinations
import pandas as pd
import numpy as np
Max_per_com = 4
dict_data = GL_df.loc[GL_df['ISMATCHED']==0].set_index('GL_ID')['GL_AMOUNT'].to_dict()
GL_per_com_list = [i for j in range(Max_per_com) for i in combinations(dict_data, j) if sum(map(dict_data.get, i))]
dict_data = RE_df.loc[RE_df['ISMATCHED']==0].set_index('RE_ID')['RE_AMOUNT'].to_dict()
RE_per_com_list = [i for j in range(Max_per_com) for i in combinations(dict_data, j) if sum(map(dict_data.get, i))]
filter and sum all possible combination amounts from GL_df Id's with RE_df ID's, the amount with in the Variance level mark it as matched and skip it for further combinations
skip the already matched records from both tables
here GL_df,RE_df date (format '%m/%Y') count should be 1
also GL_df,RE_df date (format '%m/%Y') should be same
Variance = 1
for i in range(0,len(GL_per_com_list)):
if 1 in (GL_df[GL_df['GL_ID'].isin(list(GL_per_com_list[i]))]['ISMATCHED'].values):
continue
if len(GL_df[GL_df['GL_ID'].isin(list(GL_per_com_list[i]))]['GL_DATE'].dt.strftime('%m/%Y').unique()) > 1:
continue
for j in range(0,len(RE_per_com_list)):
if 1 in (RE_df[RE_df['RE_ID'].isin(list(RE_per_com_list[j]))]['ISMATCHED'].values):
continue
if len(RE_df[RE_df['RE_ID'].isin(list(RE_per_com_list[j]))]['RE_DATE'].dt.strftime('%m/%Y').unique()) > 1:
continue
if ((GL_df[GL_df['GL_ID'].isin(list(GL_per_com_list[i]))]['GL_DATE'].dt.strftime('%m/%Y').unique()[0]) != (RE_df[RE_df['RE_ID'].isin(list(RE_per_com_list[j]))]['RE_DATE'].dt.strftime('%m/%Y').unique()[0])):
continue
amount = abs((GL_df[GL_df['GL_ID'].isin(list(GL_per_com_list[i]))]['GL_AMOUNT'].sum()) + (RE_df[RE_df['RE_ID'].isin(list(RE_per_com_list[j]))]['RE_AMOUNT'].sum()))
if amount <= Variance:
GL_df.loc[GL_df['GL_ID'].isin(list(GL_per_com_list[i])),'ISMATCHED'] = 1
RE_df.loc[RE_df['RE_ID'].isin(list(RE_per_com_list[j])),'ISMATCHED'] = 1
GL_df.loc[GL_df['GL_ID'].isin(list(GL_per_com_list[i])),'MATCH_TABLE'] = 'tbl$matched$entry'
GL_df.loc[GL_df['GL_ID'].isin(list(GL_per_com_list[i])),'MATCH_ID'] = str(list(RE_per_com_list[j]))
break
The above cases it works as expected. but it will take hours to perform. I just wan't to speed it up with in seconds or minutes

Thank God, Big thanks to Myself
Gone through the internet and got some idea. To process millions of records in efficient way you need to convert the data into Python Dictionary or JSON data. here I converted the data frame to json and performed the above logic in json data which reduces my processing in seconds
GL_df['G_Date'] = GL_df['GL_DATE'].dt.strftime('%Y-%m')
RE_df['R_Date'] = RE_df['RE_DATE'].dt.strftime('%Y-%m')
GL_dict = GL_df.to_json(orient='records')
RE_dict = RE_df.to_json(orient='records')
# Transform json input to python objects
import json
GL_dict = json.loads(GL_dict)
RE_dict = json.loads(RE_dict)
for i in range(0,len(GL_per_com_list)):
if (len([k for k in [dt for dt in GL_dict if dt['GL_ID'] in list(GL_per_com_list[i])] if k['ISMATCHED'] == 1]) > 0):
continue
if (len(set(k['G_Date'] for k in [dt for dt in GL_dict if dt['GL_ID'] in list(GL_per_com_list[i])])) > 1):
continue
for j in range(0,len(RE_per_com_list)):
if (len([k for k in [dt for dt in RE_dict if dt['RE_ID'] in list(RE_per_com_list[j])] if k['ISMATCHED'] == 1]) > 0):
continue
if (len(set(k['R_Date'] for k in [dt for dt in RE_dict if dt['RE_ID'] in list(RE_per_com_list[j])])) > 1):
continue
if (set(k['G_Date'] for k in [dt for dt in GL_dict if dt['GL_ID'] in list(GL_per_com_list[i])]) != set(k['R_Date'] for k in [dt for dt in RE_dict if dt['RE_ID'] in list(RE_per_com_list[j])])):
continue
amount = abs(sum(k['GL_AMOUNT'] for k in [dt for dt in GL_dict if dt['GL_ID'] in list(GL_per_com_list[i])]) + sum(k['RE_AMOUNT'] for k in [dt for dt in RE_dict if dt['RE_ID'] in list(RE_per_com_list[j])]))
if amount <= Variance:
for k in [dt for dt in GL_dict if dt['GL_ID'] in list(GL_per_com_list[i])]:
k['ISMATCHED'] = 1
k['MATCH_TABLE'] = 'tbl$matched$entry'
k['MATCH_ID'] = str(list(RE_per_com_list[j]))
for k in [dt for dt in RE_dict if dt['RE_ID'] in list(RE_per_com_list[j])]:
k['ISMATCHED'] = 1
break
In the above the execution time is about four and half seconds for 200 thousand records
But in the earliest, pandas data frame filter took around 13 mins

Looping over pandas DataFrame

I have a weird issue that the result doesn't change for each iteration. The code is the following:
import pandas as pd
import numpy as np
X = np.arange(10,100)
Y = X[::-1]
Z = np.array([X,Y]).T
df = pd.DataFrame(Z ,columns = ['col1','col2'])
dif = df['col1'] - df['col2']
for gap in range(100):
Up = dif > gap
Down = dif < -gap
df.loc[Up,'predict'] = 'Up'
df.loc[Down,'predict'] = 'Down'
df_result = df.dropna()
Total = df.shape[0]
count = df_result.shape[0]
ratio = count/Total
print(f'Total: {Total}; count: {count}; ratio: {ratio}')
The result is always
Total: 90; count: 90; ratio: 1.0
when it shouldn't be.

Found the root of the problem 5 mins after posting this question. I just needed to reset the dataFrame to the original to fix the problem.
import pandas as pd
import numpy as np
X = np.arange(10,100)
Y = X[::-1]
Z = np.array([X,Y]).T
df = pd.DataFrame(Z ,columns = ['col1','col2'])
df2 = df.copy()#added this line to preserve the original df
dif = df['col1'] - df['col2']
for gap in range(100):
df = df2.copy()#reset the altered df back to the original
Up = dif > gap
Down = dif < -gap
df.loc[Up,'predict'] = 'Up'
df.loc[Down,'predict'] = 'Down'
df_result = df.dropna()
Total = df.shape[0]
count = df_result.shape[0]
ratio = count/Total
print(f'Total: {Total}; count: {count}; ratio: {ratio}')

Optimizing using Pandas Data Frame

I have the following function that loads a csv into a data frame then does some calculations. It takes about 4-5 minutes to do calculation on the csv with a little over 100,000 lines. I was hoping there is a faster way.
def calculate_adeck_errors(in_file):
print(f'Starting Data Calculations: {datetime.datetime.now().strftime("%I:%M%p on %B %d, %Y")}')
pd.set_option('display.max_columns', 12)
# read in the raw csv
adeck_df = pd.read_csv(in_file)
#print(adeck_df)
#extract only the carq items and remove duplicates
carq_data = adeck_df[(adeck_df.MODEL == 'CARQ') & (adeck_df.TAU == 0)].drop_duplicates(keep='last')
#print(carq_data)
#remove carq items from original
final_df = adeck_df[adeck_df.MODEL != 'CARQ']
#print(final_df)
row_list = []
for index, row in carq_data.iterrows():
position_time = row['POSDATETIME']
for index, arow in final_df.iterrows():
if arow['POSDATETIME'] == position_time:
# match, so do calculations
storm_id = arow['STORMID']
model_base_time = arow['MODELDATETIME']
the_hour = arow['TAU']
the_model = arow['MODEL']
point1 = float(row['LAT']), float(row['LON'])
point2 = float(arow['LAT']), float(arow['LON'])
if arow['LAT'] == 0.0:
dist_error = None
else:
dist_error = int(round(haversine(point1, point2, miles=True)))
if arow['WIND'] != 0:
wind_error = int(abs(int(row['WIND']) - int(arow['WIND'])))
else: wind_error = None
if arow['PRES'] != 0:
pressure_error = int(abs(int(row['PRES']) - int(arow['PRES'])))
else:
pressure_error = None
lat_carq = row['LAT']
lon_carq = row['LON']
lat_model = arow['LAT']
lon_model = arow['LON']
wind_carq = row['WIND']
wind_model = arow['WIND']
pres_carq = row['PRES']
pres_model = arow['PRES']
row_list.append([storm_id, model_base_time, the_model, the_hour, lat_carq, lon_carq, lat_model, lon_model, dist_error,
wind_carq, wind_model, wind_error, pres_carq, pres_model, pressure_error])
result_df = pd.DataFrame(row_list)
result_df = result_df.where((pd.notnull(result_df)), None)
result_cols = ['StormID', 'ModelBasetime', 'Model' , 'Tau',
'LatCARQ', 'LonCARQ', 'LatModel', 'LonModel', 'DistError',
'WindCARQ', 'WindModel','WindError',
'PresCARQ', 'PresModel','PresError']
result_df.columns = result_cols
calculate_adeck_errors(infile)
To clarify what I'm doing:
1. The CARQ entries are the control (actual).
2. The other models are the guesses.
3. I'm comparing the control (CARQ) to the guesses to see what their errors are.
4. The basis of the comparison is the MODELBASETIME = POSBASETIME
4. A sample file I'm processing is here: http://vortexweather.com/downloads/adeck/aal062018.csv
I was hoping there is a faster way than i'm doing it, or another pandas method besides iterrows
Many thanks for suggestion.
Bryan

This code takes about 10 seconds to run your entire dataset!
The code looks very similar to what you have written, with the exception that all of the operations within the main_function have been vectorized. See Fast, Flexible, Easy and Intuitive: How to Speed Up Your Pandas Projects
2018-09-13_adeck_error_calculations.ipynb
import pandas as pd
import numpy as np
import datetime
from haversine import haversine
def main_function(df, row):
"""
The main difference here is that everything is vectorized
Returns: DataFrame
"""
df_new = pd.DataFrame()
df_storage = pd.DataFrame()
pos_datetime = df.POSDATETIME.isin([row['POSDATETIME']]) # creates a Boolean map
array_len = len(pos_datetime)
new_index = pos_datetime.index
df_new['StormID'] = df.loc[pos_datetime, 'STORMID']
df_new['ModelBaseTime'] = df.loc[pos_datetime, 'MODELDATETIME']
df_new['Model'] = df.loc[pos_datetime, 'MODEL']
df_new['Tau'] = df.loc[pos_datetime, 'TAU']
# Distance
df_new['LatCARQ'] = pd.DataFrame(np.full((array_len, 1), row['LAT']), index=new_index).loc[pos_datetime, 0]
df_new['LonCARQ'] = pd.DataFrame(np.full((array_len, 1), row['LON']), index=new_index).loc[pos_datetime, 0]
df_new['LatModel'] = df.loc[pos_datetime, 'LAT']
df_new['LonModel'] = df.loc[pos_datetime, 'LON']
def calc_dist_error(row):
return round(haversine((row['LatCARQ'], row['LonCARQ']), (row['LatModel'], row['LonModel']), miles=True)) if row['LatModel'] != 0.0 else None
df_new['DistError'] = df_new.apply(calc_dist_error, axis=1)
# Wind
df_new['WindCARQ'] = pd.DataFrame(np.full((array_len, 1), row['WIND']), index=new_index).loc[pos_datetime, 0]
df_new['WindModel'] = df.loc[pos_datetime, 'WIND']
df_storage['row_WIND'] = pd.DataFrame(np.full((array_len, 1), row['WIND']), index=new_index).loc[pos_datetime, 0]
df_storage['df_WIND'] = df.loc[pos_datetime, 'WIND']
def wind_error_calc(row):
return (row['row_WIND'] - row['df_WIND']) if row['df_WIND'] != 0 else None
df_new['WindError'] = df_storage.apply(wind_error_calc, axis=1)
# Air Pressure
df_new['PresCARQ'] = pd.DataFrame(np.full((array_len, 1), row['PRES']), index=new_index).loc[pos_datetime, 0]
df_new['PresModel'] = df.loc[pos_datetime, 'PRES']
df_storage['row_PRES'] = pd.DataFrame(np.full((array_len, 1), row['PRES']), index=new_index).loc[pos_datetime, 0]
df_storage['df_PRES'] = df.loc[pos_datetime, 'PRES']
def pres_error_calc(row):
return abs(row['row_PRES'] - row['df_PRES']) if row['df_PRES'] != 0 else None
df_new['PresError'] = df_storage.apply(pres_error_calc, axis=1)
del(df_storage)
return df_new
def calculate_adeck_errors(in_file):
"""
Retruns: DataFrame
"""
print(f'Starting Data Calculations: {datetime.datetime.now().strftime("%I:%M:%S%p on %B %d, %Y")}')
pd.set_option('max_columns', 20)
pd.set_option('max_rows', 300)
# read in the raw csv
adeck_df = pd.read_csv(in_file)
adeck_df['MODELDATETIME'] = pd.to_datetime(adeck_df['MODELDATETIME'], format='%Y-%m-%d %H:%M')
adeck_df['POSDATETIME'] = pd.to_datetime(adeck_df['POSDATETIME'], format='%Y-%m-%d %H:%M')
#extract only the carq items and remove duplicates
carq_data = adeck_df[(adeck_df.MODEL == 'CARQ') & (adeck_df.TAU == 0)].drop_duplicates(keep='last')
print('Len carq_data: ', len(carq_data))
#remove carq items from original
final_df = adeck_df[adeck_df.MODEL != 'CARQ']
print('Len final_df: ', len(final_df))
df_out_new = pd.DataFrame()
for index, row in carq_data.iterrows():
test_df = main_function(final_df, row) # function call
df_out_new = df_out_new.append(test_df, sort=False)
df_out_new = df_out_new.reset_index(drop=True)
df_out_new = df_out_new.where((pd.notnull(df_out_new)), None)
print(f'Finishing Data Calculations: {datetime.datetime.now().strftime("%I:%M:%S%p on %B %d, %Y")}')
return df_out_new
in_file = 'aal062018.csv'
df = calculate_adeck_errors(in_file)
>>>Starting Data Calculations: 02:18:30AM on September 13, 2018
>>>Len carq_data: 56
>>>Len final_df: 137999
>>>Finishing Data Calculations: 02:18:39AM on September 13, 2018
print(len(df))
>>>95630
print(df.head(20))
Please don't forget to check the accepted solution. Enjoy!

Looks like you are creating two dataframes out of the same dataframe, and then processing them. Two things that may cut your time.
First, you are iterating over both dataframes and checking for a condition:
for _, row in carq_data.iterrows():
for _, arow in final_df.iterrows():
if arow['POSDATETIME'] == row['POSDATETIME']:
# do something by using both tables
This is essentially an implementation of a join. You are joining carq_data with final_df on 'POSDATETIME'.
As a first step, you should merge the tables:
merged = carq_data.merge(final_df, on=['POSDATETIME'])
At this point you will get multiple rows for each similar 'POSDATETIME'. In the below, let's assume column b is POSDATETIME:
>>> a
a b
0 1 11
1 1 33
>>> b
a b
0 1 2
1 1 3
2 1 4
>>> merged = a.merge(b, on=['a'])
>>> merged
a b_x b_y
0 1 11 2
1 1 11 3
2 1 11 4
3 1 33 2
4 1 33 3
5 1 33 4
Now, to do your conditional calculations, you can use the apply() function.
First, define a function:
def calc_dist_error(row):
return int(round(haversine(row['b_x'], row['b_y'], miles=True))) if row['a'] != 0.0 else None
Then apply it to every row:
merged['dist_error'] = merged.apply(calc_dist_error, axis=1)
Continuing my small example:
>>> merged['c'] = [1, 0, 0, 0, 2, 3]
>>> merged
a b_x b_y c
0 1 11 2 1
1 1 11 3 0
2 1 11 4 0
3 1 33 2 0
4 1 33 3 2
5 1 33 4 3
>>> def foo(row):
... return row['b_x'] - row['b_y'] if row['c'] != 0 else None
...
>>> merged['dist_error'] = merged.apply(foo, axis=1)
>>> merged
a b_x b_y c dist_error
0 1 11 2 1 9.0
1 1 11 3 0 NaN
2 1 11 4 0 NaN
3 1 33 2 0 NaN
4 1 33 3 2 30.0
5 1 33 4 3 29.0
This should help you reduce run time (see also this for how to check using %timeit). Hope this helps!

TypeError: ("Cannot compare type 'Timestamp' with type 'str'", 'occurred at index 262224')

I am trying to create a flag for date from datetime column. but getting an error after applying the below function.
def f(r):
if r['balance_dt'] <= '2016-11-30':
return 0
else:
return 1
df_obctohdfc['balance_dt_flag'] = df_obctohdfc.apply(f,axis=1)

The error your are getting is because you are comparing string object to datetime object. You can convert the string to datetime.
Ex:
import datetime
def f(r):
if r['balance_dt'] <= datetime.datetime.strptime('2016-11-30', '%Y-%m-%d'):
return 0
else:
return 1
df_obctohdfc['balance_dt_flag'] = df_obctohdfc.apply(f,axis=1)
Note: It is better to do the way jezrael has mention. That is the right way to do it

In pandas is best avoid loops, how working apply under the hood.
I think need convert string to datetime and then cast mask to integer - True to 1 and False to 0 and change <= to >:
timestamp = pd.to_datetime('2016-11-30')
df_obctohdfc['balance_dt_flag'] = (df_obctohdfc['balance_dt'] > timestamp).astype(int)
Sample:
rng = pd.date_range('2016-11-27', periods=10)
df_obctohdfc = pd.DataFrame({'balance_dt': rng})
#print (df_obctohdfc)
timestamp = pd.to_datetime('2016-11-30')
df_obctohdfc['balance_dt_flag'] = (df_obctohdfc['balance_dt'] > timestamp).astype(int)
print (df_obctohdfc)
balance_dt balance_dt_flag
0 2016-11-27 0
1 2016-11-28 0
2 2016-11-29 0
3 2016-11-30 0
4 2016-12-01 1
5 2016-12-02 1
6 2016-12-03 1
7 2016-12-04 1
8 2016-12-05 1
9 2016-12-06 1
Comparing in 1000 rows DataFrame:
In [140]: %timeit df_obctohdfc['balance_dt_flag1'] = (df_obctohdfc['balance_dt'] > timestamp).astype(int)
1000 loops, best of 3: 368 µs per loop
In [141]: %timeit df_obctohdfc['balance_dt_flag2'] = df_obctohdfc.apply(f,axis=1)
10 loops, best of 3: 91.2 ms per loop
Setup:
rng = pd.date_range('2015-11-01', periods=1000)
df_obctohdfc = pd.DataFrame({'balance_dt': rng})
#print (df_obctohdfc)
timestamp = pd.to_datetime('2016-11-30')
import datetime
def f(r):
if r['balance_dt'] <= datetime.datetime.strptime('2016-11-30', '%Y-%m-%d'):
return 0
else:
return 1

how to find the element in a list having most occurrence using loop

assume
[1,2,4,4,3,3,3]
and upper_limit is 8 (all numbers are not bigger than upper_limit)
how to produce 3 under the help of upper_limit
must run in O(n+upper_limit) overall time.

a = [1,1,2,2,2,3,3,4]
cnt = {}
most = a[0]
for x in a:
if x not in cnt:
cnt[x] = 0
cnt[x] += 1
if cnt[x] > cnt[most]:
most = x
print(most)

You can use collections.Counter
from collections import Counter
lst = [1,2,4,4,3,3,3]
counter = Counter(lst)
most_freq, freq = counter.most_common()[0]
Alternatives to Counter using dictionary.
from collections import defaultdict
d = defaultdict(int)
lst = [1,2,4,4,3,3,3]
for val in lst:
d[val] = d[val] + 1
most_freq, freq = max(d.items(), key = lambda t: t[1])
This example keeps track of the most frequent as you iterate through circumventing the need for max function.
from collections import defaultdict
d = defaultdict(int)
lst = [1,2,4,4,3,3,3]
most_freq, freq = None, 0
for val in lst:
d[val] = d[val] + 1
if d[val] > freq:
most_freq = val
freq = d[val]
print(most_freq, freq)
If you don't want to use a defaultdict then you can just use a normal dictionary and use dict.setdefault instead.
d = {} # set d to dict
d[val] = d.setdefault(val, 0) + 1

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How do I iterate through data in a panda data frame? - python-3.x

Related

How to speed up a nested loop with hundreds of thousands of combinations records in python?

Looping over pandas DataFrame

Optimizing using Pandas Data Frame

TypeError: ("Cannot compare type 'Timestamp' with type 'str'", 'occurred at index 262224')

how to find the element in a list having most occurrence using loop

Categories

Resources