Joining a temporary DataFrame to permanent DataFrame - python-3.x

I am trying to join a predefined Data Frame to a temporary Data Frame which is located within a loop. My code template is as follows:
df1_date = pd.date_range('2008-01-01', '2017-01-01')
df1 = pd.DataFrame(index=final_date)
#A loop:
dates = pd.date_range('2008-01-01', '2009-01-01')
df2 = pd.DataFrame(data=data, index=dates)
df1 = df1.join(df2, how='left')
When the loop runs for the first time the index values of DataFrame which is calculated in loop joins with the DataFrame which was defined in begining. But when the loop runs for the next time it gives the following error:
ValueError: columns overlap but no suffix specified: Index(['Valuation'], dtype='object')
The next time loop runs it calculates values for the next time period and I want it to join with the permanent Data Frame, rather than giving the above error.

Related

Pandas discard items in set using a different set

I have two columns in a pandas dataframe; parents and cte. Both columns are made up of sets. I want to use the cte column to discard overlapping items in the parents column. The dataframe is made up of over 6K rows. Some of the cte rows have empty sets.
Below is a sample:
data = {'parents': [{'loan_agreement', 'select', 'opportunity', 'sales.dim_date', 'sales.flat_opportunity_detail', 'dets', 'dets2', 'channel_partner'}
,{'seed', 'dw_salesforce.sf_dw_partner_application'}],
'cte': [{'dets', 'dets2'}, {'seed'}]}
df = pd.DataFrame(data)
df
I've used .discard(cte) previously but I can't figure out how to get it to work.
I would like the output to look like the following:
data = {'parents': [{'loan_agreement', 'select', 'opportunity', 'sales.dim_date', 'sales.flat_opportunity_detail', 'channel_partner'}
,{'dw_salesforce.sf_dw_partner_application'}],
'cte': [{'dets', 'dets2'}, {'seed'}]}
df = pd.DataFrame(data)
df
NOTE: dets, dets2 and seed have been removed from the corresponding parents cell.
Once the cte is compared to the parents, I don't need data from that row again. The next row will only compare data on that row and so on.
You need to use a loop here.
A list comprehension will likely be the fastest:
df['parents'] = [P.difference(C) for P,C in zip(df['parents'], df['cte'])]
output:
parents cte
0 {channel_partner, select, opportunity, loan_ag... {dets, dets2}
1 {dw_salesforce.sf_dw_partner_application} {seed}

How to calculate the diff between values of 2 adjacent values across every column in a pandas dataframe?

I made a dataset of shape (252,60) by concatenating the ['Close'] columns of every stock of the Sensex-30 index, and making columns by shifting each ['Close'] column by 1 level down. Here I wanted to count the difference between the shifted price and current price for every day and every stock, I tried to do so in a colab notebook, but I get an error as IndexError: single positional indexer is out-of-bounds
The dataset and code is too long to be shown, so you can look at it at this colab notebook
Reducing your code, I find the below works
import requests
df = pd.DataFrame()
for stock in ['RELIANCE','INFY','HCLTECH','TCS','BAJAJ-AUTO',
'TITAN','LT','NESTLEIND','TECHM','ASIANPAINT',
'M&M','ICICIBANK','POWERGRID','HINDUNILVR','SUNPHARMA',
'TATASTEEL','AXISBANK','SBIN','ULTRACEMCO','BAJAJFINSV',
'ITC','NTPC','BAJFINANCE','BHARTIARTL','MARUTI',
'KOTAKBANK','HDFC','HDFCBANK','ONGC','INDUSINDBK']:
url = "https://query1.finance.yahoo.com/v7/finance/download/"+stock+".BO?period1=1577110559&period2=1608732959&interval=1d&events=history&includeAdjustedClose=true"
df = pd.concat([df, pd.read_csv(io.BytesIO(requests.get(url).content), index_col="Date")
.loc[:,"Close"]
.to_frame().rename(columns={"Close":stock})], axis=1)
profit={f"{c}_profit":lambda dfa: dfa[c]-dfa[c].shift(periods=1) for c in df.columns}
df = df.assign(**profit)
df.shape
output
(252, 60)

How to store a df.column in a list without index in a loop?

df.shape (15,4)
I want to store 4th column of df within the loop in a list. What I'm trying is:
l=[]
n=1000 #No. of iterations
for i in range(0,n):
#df expressions and results calcualtion equations
l.append(df.iloc[:,2]) # This is storing values with index. I want to store then without indices while keeping inplace=True.
df_new = pd.DataFrame(np.array(l), columns = df.index)
I want l list to append only values from df column 3. Not series object of pandas.core.series module in each cell.
Use df.iloc[:,2]).tolist() inside append to get the desired result.

Python: DataFrame Index shifting

I have several dataframes that I have concatenated with pandas in the line:
xspc = pd.concat([df1,df2,df3], axis = 1, join_axes = [df3.index])
In df2 the index values read one day later than the values of df1, and df3. So for instance when the most current date is 7/1/19 the index values for df1 and df3 will read "7/1/19" while df2 reads '7/2/19'. I would like to be able to concatenate each series so that each dataframe is joined on the most recent date, so in other words I would like all the dataframe values from df1 index value '7/1/19' to be concatenated with dataframe 2 index value '7/2/19' and dataframe 3 index value '7/1/19'. When methods can I use to shift the data around to join on these not matching index values?
You can reset the index of the data frame and then concat the dataframes
df1=df1.reset_index()
df2=df2.reset_index()
df3=df3.reset_index()
df_final = pd.concat([df1,df2,df3],axis=1, join_axes=[df3.index])
This should work since you mentioned that the date in df2 will be one day after df1 or df3

pandas read_csv create new column and usecols at the same time

I'm trying to load multiple csv files into a single dataframe df while:
adding column names
adding and populating a new column (Station)
excluding one of the columns (QD)
All of this works fine until I attempt to exclude a column with usecols, which throws the error Too many columns specified: expected 5 and found 4.
Is it possible to create a new column and pass usecols at the same time?
The reason I'm creating & populating a new 'Station' column during read_csv is my dataframe will contain data from multiple stations. I can work around the error by doing read_csv in one statement and dropping the QD column in the next with df.drop('QD', axis=1, inplace=True) but want to make sure I understand how to do this the most pandas way possible.
Here's the code that throws the error:
df = pd.concat(pd.read_csv("http://lgdc.uml.edu/common/DIDBGetValues?ursiCode=" + row['StationCode'] + "&charName=MUFD&DMUF=3000",
skiprows=17,
delim_whitespace=True,
parse_dates=[0],
usecols=['Time','CS','MUFD','Station'],
names=['Time','CS','MUFD','QD','Station']
).fillna(row['StationCode']
).set_index(['Time', 'Station'])
for index, row in stationdf.iterrows())
Example StationCode from stationdf BC840.
Data sample 2016-09-19T00:00:05.000Z 100 19.34 //
You can create a new column using operator chaining with assign:
df = pd.read_csv(...).assign(StationCode=row['StationCode'])

Resources