I have this dataframe.
df = pd.DataFrame({'date': np.array(['2021-04-11', '2021-04-12', '2021-04-13', '2021-04-14',
'2021-04-15', '2021-04-16', '2021-04-17', '2021-04-18',
'2021-04-19', '2021-04-20', '2021-04-21', '2021-04-22',
'2021-04-23', '2021-04-24', '2021-04-25', '2021-04-26',
'2021-04-27', '2021-04-28', '2021-04-29', '2021-04-30',
'2021-05-01' ,'2021-05-02', '2021-05-03', '2021-05-04',
'2021-05-05', '2021-05-06', '2021-05-07']),
'value': np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,
24,25,26,27])})
I want to split it to 3 parts (train,val and test).
For example:
split=0.7 # split perc
last=7 # keep 7 last days for test data
split_idx = int(df.shape[0] * split)
train_df = df[:split_idx]
val_df = df[split_idx:-last]
test_df = df[(train_df.shape[0] + val_df.shape[0]):]
So, now I have:
len(train_df), len(val_df), len(test_df) = 18, 2, 7
I want the lengths to be divisible by 7, so:
if len(train_df) % 7 != 0:
# move those rows to the beginning of val_df
val_df.loc[0] =
# drop those rows from train_df
train_df.drop(train_df.tail(len(train_df) % 7).index, inplace=True)
If the len of train_df is not divisible by 7, then I want to move those last rows of data to the beginning of val_df data and then drop those from train_df. The same applies to val_df. The test_df will always have at lest 7 values, so if it greater I will just drop them.
So , I found an answer!
if len(train_df) % 7 != 0:
# move those rows to the beginning of val_df
rows = train_df.tail(len(train_df) % 7)
val_df = pd.concat([val_df.iloc[:len(val_df)], rows, val_df.iloc[len(val_df):]]).sort_index()
# drop those rows from train_df
train_df.drop(train_df.tail(len(train_df) % 7).index[0:], inplace=True)
Related
I have a csv file which has around 58 million cells containing numerical data. I want to extract data from every 16 cells which are 49 rows apart.
Let me describe it clearly.
The data I need to extract
The above image shows the the first set of data that is to be extracted (rows 23 to 26, columns 92 to 95). This data has to be written in another file csv file (preferably in a row).
Then I will move down 49 rows (row 72), then extract 4rows x 4columns. Shown in image below.
Next set of data
Similarly, I need to keep going till I reach the end of the file.
Third set
The next set will be the image shown above.
I have to keep going till I reach the end of the file and extract thousands of such data.
I had written a code for this but its not working. I don't know where is the mistake. I will also attach it here.
import pandas as pd
import numpy
df = pd.read_csv('TS_trace31.csv')
# print(numpy.shape(df))
df = pd.read_csv('TS_trace31.csv')
# print(numpy.shape(df))
arrY = []
ex = 0
for i in range(len(df)):
if i == 0:
for j in range(4):
l = (df.iloc[j+21+i*(49), 91:95]).tolist()
arrY.append(l)
else:
for j in range(4):
if j+22+i*(49) >= len(df):
ex = 1
break
# print(j)
l = (df.iloc[j+21+i*(49), 91:95]).tolist()
arrY.append(l)
if ex == 1:
break
# print(arrY)
a = []
for i in range(len(arrY) - 3):
p = arrY[i]+arrY[i+1]+arrY[i+2]+arrY[i+3]
a.append(p)
print(numpy.shape(a))
numpy.savetxt('myfile.csv', a, delimiter=',')
Using the above code, I didn't get the result I wanted.
Please help with this and correct where I have gone wrong.
I couldn't attach my csv file here, Please try to use any sample sheet that you have or can create a simple one.
Thanks in advance! Have a great day.
i don't know what exactly you are doing in your code
but i wrote my own
import csv
from itertools import chain
CSV_PATH = 'TS_trace31.csv'
new_data = []
with open(CSV_PATH, 'r') as csvfile:
reader = csv.reader(csvfile)
# row_num for storing big jumps e.g. 23, 72, 121 ...
row_num = 23
# n for storing the group number 0 - 3
# with n we can find the 23, 24, 25, 26
n = 0
# row_group for storing every 4 group rows
row_group = []
# looping over every row in main file
for row in reader:
if reader.line_num == row_num + n:
# for the first time this is going to be 23 + 0
# then we add one number to the n
# so the next cycle will be 24 and so on
n += 1
print(reader.line_num)
# add each row to it group
row_group.append(row[91:95])
# check if we are at the end of the group e.g. 26
if n == 4:
# reset the group number
n = 0
# add the jump to main row number
row_num += 49
# combine all the row_group to a single row
new_data.append(list(chain(*row_group)))
# clear the row_group for next set of rows
row_group.clear()
print('='*50)
else:
continue
# and finally write all the rows in a new file
with open('myfile.csv', 'w') as new_csvfile:
writer = csv.writer(new_csvfile)
writer.writerows(new_data)
I want to create a rolling forecast for the following 12 months, the results for the month and entry must become part of the dataframe as well (Later it will be written out into excel as part of a bigger dataframe).
The entries for the new dataframe needs to be calculated based on the criteria, that the forecasted month is between start_date and start_date + duration is also in the range of the forecasted 12 months. If these are met, the value from duration should be written here.
expected output
To do this I imagine that I have to use a numpy.where(), however I can not wrap my head around it.
I came across Use lambda with pandas to calculate a new column conditional on existing column, but after some trying I came to the conclusion, that this can not be the whole truth for my case.
import numpy as np
import pandas as pd
import datetime as dt
months = ["Jan", "Feb", "Mrz", "Apr", "Mai", "Jun", "Jul", "Aug", "Sep", "Okt", "Nov", "Dez"]
cur_month = dt.date.today().month - 1
cur_year = dt.date.today().year
d = {'start_date': ['2020-12-23', '2021-02-08', '2021-06-11', '2022-01-07'], 'duration': [12, 6, 8, 3],
'effort': [0.3, 0.5, 1.2, 0.1]}
df = pd.DataFrame(data=d)
i = 0
while i < 12:
# this creates the header rows for the 12 month period
next_month = months[(cur_month + i) % len(months)]
# here goes the calculation/condition I am stuck with...
df[next_month] = np.where(...)
i += 1
So I came up with this and seems to work, I also added some logic for weighting for the cases a project starts some time during the month, so we get a more accurate effort number.
d = {"id": [1,2,3,4], "start_date": ['2020-12-23', '2021-02-08', '2021-06-11', '2022-01-07'], "duration": [12, 6, 8, 3],
"effort": [0.3, 0.5, 1.2, 0.1]}
df = pd.DataFrame(data=d)
df["EndDates"] = df["start_date"].dt.to_period("M") + df_["duration"]
i = 0
forecast = pd.Series(pd.period_range(today, freq="M", periods=12))
while i < 12:
next_month = months[(cur_month + i) % len(months)]
df[next_month] = ""
for index, row in df.iterrows():
df_tmp = df.loc[df['id'] == int(row['id'])]
if not df_tmp.empty and pd.notna(df_tmp["start_date"].item()):
if df_tmp["start_date"].item().to_period("M") <= forecast[i] <= df_tmp["EndDates"].item():
# For the current month let's calculate with the remaining value
if i == 0:
act_enddate = monthrange(today.year, today.month)[1]
weighter = 1 - (int(today.day) / int(act_enddate))
df.at[index, next_month] = round(df_tmp['effort'].values[0] * weighter,
ndigits=2)
# If it is the first entry for the oppty, how many FTEs will be needed for the first month
# of the assignment
elif df_tmp["start_date"].item().to_period("M") == forecast[i]:
first_day = df_tmp["start_date"].item().day
if first_day != 1:
months_enddate = monthrange(forecast[i].year, forecast[i].month)[1]
weighter = 1 - (int(first_day) / int(months_enddate))
df.at[index, next_month] = round(df_tmp['effort'].values[0] * weighter,
ndigits=2)
else:
df.at[index, next_month] = df_tmp['effort'].values[0]
# How many FTEs are needed for the last month of the assignment
elif df_tmp["EndDates"].item() == forecast[i]:
end_day = df_tmp["start_date"].item().day
if end_day != 1:
months_enddate = monthrange(forecast[i].year, forecast[i].month)[1]
weighter = int(end_day) / int(months_enddate)
df.at[index, next_month] = round(df_tmp['Umrechnung in FTEs'].values[0] * weighter,
ndigits=2)
else:
continue
else:
df.at[index, next_month] = df_tmp['effort'].values[0]
I am using python3 and pandas to create a script that will:
Be dynamic across different dataset lengths(rows) and unique values - completed
Take unique values from column A and create separate dataframes as variables for each unique entry - completed
Add totals to the bottom of each dataframe - completed
Concatenate the separate dataframes back together - incomplete
The issue is I am unable to formulate a way to create a list of the variables in use and apply them as arg in to the command pd.concat.
The sample dataset. The dataset may have more unique BrandFlavors or less which is why the script must be flexible and dynamic.
Script:
import pandas as pd
import warnings
warnings.simplefilter(action='ignore')
excel_file = ('testfile.xlsx')
df = pd.read_excel(excel_file)
df = df.sort_values(by='This', ascending=False)
colarr = df.columns.values
arr = df[colarr[0]].unique()
for i in range(len(arr)):
globals()['var%s' % i] = df.loc[df[colarr[0]] == arr[i]]
for i in range(len(arr)):
if globals()['var%s' % i].empty:
''
else:
globals()['var%s' % i] = globals()['var%s' % i].append({'BrandFlavor':'Total',
'This':globals()['var%s' % i]['This'].sum(),
'Last':globals()['var%s' % i]['Last'].sum(),
'Diff':globals()['var%s' % i]['Diff'].sum(),
'% Chg':globals()['var%s' % i]['Diff'].sum()/globals()['var%s' % i]['Last'].sum() * 100}, ignore_index=True)
globals()['var%s' % i]['% Chg'].fillna(0, inplace=True)
globals()['var%s' % i].fillna(' ', inplace=True)
I have tried this below, however the list is a series of strings
vararr = []
count = 0
for x in range(len(arr)):
vararr.append('var' + str(count))
count = count + 1
df = pd.concat([vararr])
pd.concat does not recognize a string. I tired to build a class with an arg defined but had the same issue.
The desired outcome would be a code snippet that generated a list of variables that matched the ones created by lines 9/10 and could be referenced by pd.concat([list, of, vars, here]). It must be dynamic. Thank you
Just fixing the issue at hand, you shouldn't use globals to make variables, that is not considered good practice. Your code should work with some minor modifications.
import pandas as pd
import warnings
warnings.simplefilter(action='ignore')
excel_file = ('testfile.xlsx')
df = pd.read_excel(excel_file)
df = df.sort_values(by='This', ascending=False)
def good_dfs(dataframe):
if dataframe.empty:
pass
else:
this = dataframe.This.sum()
last = dataframe.Last.sum()
diff = dataframe.Diff.sum()
data = {
'BrandFlavor': 'Total',
'This': this,
'Last': last,
'Diff': diff,
'Pct Change': diff / last * 100
}
dataframe.append(data, ignore_index=True)
dataframe['Pct Change'].fillna(0.0, inplace=True)
dataframe.fillna(' ', inplace=True)
return dataframe
colarr = df.columns.values
arr = df[colarr[0]].unique()
dfs = []
for i in range(len(arr)):
temp = df.loc[df[colarr[0]] == arr[i]]
dfs.append(temp)
final_dfs = [good_dfs(d) for d in dfs]
final_df = pd.concat(final_dfs)
Although I will say, there are far easier ways to accomplish what you want without doing all of this, however that can be a separate question.
I'am trying to calculate 33 stock betas and write them to dataframe.
Unfortunately, I have an error in my code:
cannot concatenate object of type ""; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are vali
import pandas as pd
import numpy as np
stock1=pd.read_excel(r"C:\Users\Кир\Desktop\Uni\Master\Nasdaq\Financials 11.05\Nasdaq last\clean data\01.xlsx", '1') #read second sheet of excel file
stock2=pd.read_excel(r"C:\Users\Кир\Desktop\Uni\Master\Nasdaq\Financials 11.05\Nasdaq last\clean data\01.xlsx", '2') #read second sheet of excel file
stock2['stockreturn']=np.log(stock2.AdjCloseStock / stock2.AdjCloseStock.shift(1)) #stock ln return
stock2['SP500return']=np.log(stock2.AdjCloseSP500 / stock2.AdjCloseSP500.shift(1)) #SP500 ln return
stock2 = stock2.iloc[1:] #delete first row in dataframe
betas = pd.DataFrame()
for i in range(0,(len(stock2.AdjCloseStock)//52)-1):
betas = betas.append(stock2.stockreturn.iloc[i*52:(i+1)*52].cov(stock2.SP500return.iloc[i*52:(i+1)*52])/stock2.SP500return.iloc[i*52:(i+1)*52].cov(stock2.SP500return.iloc[i*52:(i+1)*52]))
My data looks like weekly stock and S&P index return for 33 years. So the output should have 33 betas.
I tried simplifying your code and creating an example. I think the problem is that your calculation returns a float. You want to make it a pd.Series. DataFrame.append takes:
DataFrame or Series/dict-like object, or list of these
np.random.seed(20)
df = pd.DataFrame(np.random.randn(33*53, 2),
columns=['a', 'b'])
betas = pd.DataFrame()
for year in range(len(df['a'])//52 -1):
# Take some data
in_slice = pd.IndexSlice[year*52:(year+1)*52]
numerator = df['a'].iloc[in_slice].cov(df['b'].iloc[in_slice])
denominator = df['b'].iloc[in_slice].cov(df['b'].iloc[in_slice])
# Do some calculations and create a pd.Series from the result
data = pd.Series(numerator / denominator, name = year)
# Append to the DataFrame
betas = betas.append(data)
betas.index.name = 'years'
betas.columns = ['beta']
betas.head():
beta
years
0 0.107669
1 -0.009302
2 -0.063200
3 0.025681
4 -0.000813
I have several .txt files with 140k+ lines each. They all have three types of data, which are a mix of string and floats:
- 7 col
- 14 col
- 18 col
What is the best and fastest way to parse such data?
I tried to use numpy.genfromtxt with usecols=np.arange(0,7) but obviously cuts out the 14 and 18 col data.
# for 7 col data
load = np.genfromtxt(filename, dtype=None, names=('day', 'tod', 'condition', 'code', 'type', 'state', 'timing'), usecols=np.arange(0,7))
I would like to parse the data as efficiently as possible.
The solution is rather simple and intuitive. We check if the number of columns in each row is equal to the specified number and append it to an array. For better analysis/modification of our data, we can then convert it to a Pandas DataFrame or Numpy as desired, below I show conversion to DataFrame. The number of columns in my dataset are 7, 14 and 18. I want my data labeled, so I can use Pandas' columns to label from an array.
import pandas as pd
filename = "textfile.txt"
labels_array1 = [] # 7 labels
labels_array2 = [] # 14 labels
labels_array3 = [] # 18 labels
with open(filename, "r") as f:
lines = f.readlines()
for line in lines:
num_items = len(line.split())
if num_items==7:
array1.append(line.rstrip())
elif num_items==14:
array2.append(line.rstrip())
elif num_items==18:
array3.append(line.rstrip())
else:
print("Detected a line with different columns.", num_items)
df1 = pd.DataFrame([sub.split() for sub in array1], columns=labels_array1)
df2 = pd.DataFrame([sub.split() for sub in array2], columns=labels_array2)
df3 = pd.DataFrame([sub.split() for sub in array3], columns=labels_array3)