iterate to perform calculations on certain column in pandas dataframe - python-3.x

I am working with a large pandas dataframe where I have created a new empty column. What I want to do si to iterate over every value within a specific column of the dataframe, do a Boolean check, and then assign a value to the new column based on the output of the value check.
I would think I need to use a for loop to check the individual contents of each cell in my specified column. The problem is that I can't seem to figure out the correct syntax to correctly write a for loop that checks values in a specific column. This is what I have so far.
call_info['% of Net Capital'] = call_info['Call Amount'] / call_info['Net Capital']
for (ColumnData) in call_info['Call Amount']:
columnSeriesObj = call_info[ColumnData]
if columnSeriesObj.any - call_info['Excess Deficit'].any > 0:
call_info['Sufficient Excess?'][ColumnData] = True
else:
call_info['Sufficient Excess?'][ColumnData] = False
I get a KeyError : 38749372
call_info is a pandas dataframe. I am trying to compare call_info['Call Amount'] against call_info['Excess Deficit'] and put a True or false value in call_info['Sufficient Excess?']
**Updated to include an example of my dataframe, and the expected output
This is a snip of a larger csv file:
I have loaded the data from this CSV file using openpyxl load_workbook
From there, I converted the data into a Pandas Dataframe using the following code :
from itertools import islice
data = sheet_ranges.values
cols = next(data)[1:]
data = list(data)
idx = [r[0] for r in data]
data = (islice(r, 1, None) for r in data)
df = pd.DataFrame(data, index=idx, columns=cols)
An example of the expected output is a column within the dataframe that looks like this:
I've been able to do this in Excel, but I am looking to automate the process

I made some demo data, which hopefully represents the problem.
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame(np.random.randint(1000, size = [20, 2]), columns = ['call_amount', 'excess_deficit'])
Then you can use the following code to get the result your looking for.
df['sufficient_excess'] = (df['call_amount'] - df['excess_deficit']) > 0
which gives
call_amount excess_deficit sufficient_excess
0 684 559 True
1 629 192 True
2 835 763 True
3 707 359 True
4 9 723 False
5 277 754 False
6 804 599 True
7 70 472 False
8 600 396 True
9 314 705 False
If you need the result changing to have Yes instead of True, let me now

Related

pandas read_csv with data and headers in alternate columns

I have a generated CSV file that
doesn't have headers
has header and data occur alternately in every row (headers do not change from row to row).
E.g.:
imageId,0,feat1,30,feat2,34,feat,90
imageId,1,feat1,0,feat2,4,feat,89
imageId,2,feat1,3,feat2,3,feat,80
IMO, this format is redundant and cumbersome (I don't see why anyone would generate files in this format). The saner/normal CSV of the same data (which I can directly read using pd.read_csv():
imageId,feat1,feat2,feat
0,30,34,90
1,0,4,89
2,3,3,80
My question is, how do I read the original data into a pd dataframe? For now, I do a read_csv and then drop all alternate columns:
df=pd.read_csv(file, header=None)
df=df[range(1, len(df.columns), 2]
Problem with this is I don't get the headers, unless I make it a point to specify them.
Is there a simpler way of telling pandas that the format has data and headers in every row?
Select columns by indexing in DataFrame.iloc and set new columns names with get first row and pair values (assuming pair columns have same values like in sample data):
#default headers
df = pd.read_csv(file, header=None)
df1 = df.iloc[:, 1::2]
df1.columns = df.iloc[0, ::2].tolist()
print (df1)
imageId feat1 feat2 feat
0 0 30 34 90
1 1 0 4 89
2 2 3 3 80
I didn't measure but I would expect that it could be a problem to read the entire file (redundant headers and actual data) before filtering for the interesting stuff. So I tried to exploit the optional parameters nrows and usecols to (hopefully) limit the amount of memory needed to process the CSV input file.
# --- Utilities for generating test data ---
import random as rd
def write_csv(file, line_count=100):
with open(file, 'w') as f:
r = lambda : rd.randrange(100);
for i in range(line_count):
line = f"imageId,{i},feat1,{r()},feat2,{r()},feat,{r()}\n"
f.write(line)
file = 'text.csv'
# Generate a small CSV test file
write_csv(file, 10)
# --- Actual answer ---
import pandas as pd
# Read columns of the first row
dfi = pd.read_csv(file, header=None, nrows=1)
ncols = dfi.size
# Read data columns
dfd = pd.read_csv(file, header=None, usecols=range(1, ncols, 2))
dfd.columns = dfi.iloc[0, ::2].to_list()
print(dfd)

How to expand the dataframe based on the column values?

I have this dataframe:
utc arc_time_s tec_tecu elevation_deg lat_e_deg lon_e_deg
01.01.2018 01:19 54 3.856 17.35 57.44 25.02
01.01.2018 01:19 53 4.021 17.29 57.47 25.03
01.01.2018 01:19 52 4.029 17.22 57.51 25.05
01.01.2018 01:19 51 4.015 17.15 57.54 25.07
01.01.2018 01:19 50 3.997 17.08 57.57 25.09
What I want is expanding the dataframe based on lat_e_deg column to have all values with decimal scale 2.
I found the method resample but it seems like only can be used for datetime column.
So as an output I want to have like this:
How can I do this?
import pandas as pd
import numpy as np
# reconstruct part of your DataFrame for testing purposes:
df = pd.DataFrame([[17.35, 57.44], [17.29, 57.47], [17.22, 57.51]],
columns = ['elevation_deg', 'lat_e_deg'])
# create a Series of the desired stepwise values:
lat_e_deg_expanded = pd.Series(np.arange(start = min(df['lat_e_deg']),
stop = max(df['lat_e_deg']),
step = 0.01),
name = 'lat_e_deg')
# merge the expanded series with the original DataFrame and sort:
df_expanded = pd.merge(df, lat_e_deg_expanded,
on = 'lat_e_deg',
how = 'outer')
df_expanded.sort_values(by = 'lat_e_deg', inplace = True)
You can create pd.Series with step = 0.01 and then join to original dataframe.
Example code assuming df is dataframe with missing decimal values:
ts = pd.Series(np.arange(start = 57.44, stop = 57.57, step=0.01), name = "t")
df = pd.DataFrame({'t': [57.44, 57.47, 57.57]})
df2 = pd.merge(ts, df, how = "left").sort_values("t")
Result:
t
0 57.44
1 57.45
2 57.46
3 57.47
4 57.48
5 57.49
6 57.50
7 57.51
8 57.52
9 57.53
10 57.54
11 57.55
12 57.56
13 57.57

How to write from loop to dataframe

I'am trying to calculate 33 stock betas and write them to dataframe.
Unfortunately, I have an error in my code:
cannot concatenate object of type ""; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are vali
import pandas as pd
import numpy as np
stock1=pd.read_excel(r"C:\Users\Кир\Desktop\Uni\Master\Nasdaq\Financials 11.05\Nasdaq last\clean data\01.xlsx", '1') #read second sheet of excel file
stock2=pd.read_excel(r"C:\Users\Кир\Desktop\Uni\Master\Nasdaq\Financials 11.05\Nasdaq last\clean data\01.xlsx", '2') #read second sheet of excel file
stock2['stockreturn']=np.log(stock2.AdjCloseStock / stock2.AdjCloseStock.shift(1)) #stock ln return
stock2['SP500return']=np.log(stock2.AdjCloseSP500 / stock2.AdjCloseSP500.shift(1)) #SP500 ln return
stock2 = stock2.iloc[1:] #delete first row in dataframe
betas = pd.DataFrame()
for i in range(0,(len(stock2.AdjCloseStock)//52)-1):
betas = betas.append(stock2.stockreturn.iloc[i*52:(i+1)*52].cov(stock2.SP500return.iloc[i*52:(i+1)*52])/stock2.SP500return.iloc[i*52:(i+1)*52].cov(stock2.SP500return.iloc[i*52:(i+1)*52]))
My data looks like weekly stock and S&P index return for 33 years. So the output should have 33 betas.
I tried simplifying your code and creating an example. I think the problem is that your calculation returns a float. You want to make it a pd.Series. DataFrame.append takes:
DataFrame or Series/dict-like object, or list of these
np.random.seed(20)
df = pd.DataFrame(np.random.randn(33*53, 2),
columns=['a', 'b'])
betas = pd.DataFrame()
for year in range(len(df['a'])//52 -1):
# Take some data
in_slice = pd.IndexSlice[year*52:(year+1)*52]
numerator = df['a'].iloc[in_slice].cov(df['b'].iloc[in_slice])
denominator = df['b'].iloc[in_slice].cov(df['b'].iloc[in_slice])
# Do some calculations and create a pd.Series from the result
data = pd.Series(numerator / denominator, name = year)
# Append to the DataFrame
betas = betas.append(data)
betas.index.name = 'years'
betas.columns = ['beta']
betas.head():
beta
years
0 0.107669
1 -0.009302
2 -0.063200
3 0.025681
4 -0.000813

Get pandas dataframe headers and create new Access table with them

I have the following issue. I am creating offline ACCESS tool and many times I have to create an Access CROSSTAB that has undefined number of columns requiring manipulation of data in fields beyond the Access possibilities (due to complexity of operations or its size). So I import the query as PANDAS dataframe, manipulate it and I want to pass it back. I have rows of names under AGR_NAME header and various licence types as columns. Therefore I have first field/column as TEXT and columns as INTEGERS. The column headings are e.g. {52, CN, 55, XX, 71, PR} but the fields are populated with percents or integers.
How to import the headers (undefined number) as in the sql query and pass it back to Access database table with transformed values?
The issue is that the name and number of columns can vary. Only the first columns is TEXT and the other are INTEGERS (not headers).
Basically I need to create the new table from the headers of the dataframe that can vary in number and populate it with transformed data.
LIC_TYPE CE 55 52 XE
AGR_NAME 1 0 1 2
XY 1 1 0 4
XZ 12 3 1 45
XX 44 5 7 8
ZZ 0 0 1 0
The code I have so far is:
import pyodbc
import pandas
import os
import sys
import struct
print("running as {0}-bit".format(struct.calcsize("P") * 8))
sources = pyodbc.dataSources() dsns = list(sources.keys()) dsns.sort()
sl = []
for dsn in dsns:
sl.append('%s [%s]' % (dsn, sources[dsn])) print('\n'.join(sl))
print(pyodbc.drivers())
try:
currdir = os.path.abspath(__file__)
except NameError:
import sys
currdir = os.path.abspath(os.path.dirname(sys.argv[0]))
DBfile = os.path.join(currdir, 'UNION.accdb')
cnxn = pyodbc.connect('DRIVER={Microsoft Access Driver (*.mdb, *.accdb)};DBQ=%s;' %DBfile)
sql = "Select * FROM pivooo" df = pandas.read_sql(sql,cnxn)
df = df.set_index('AGR_NAME') res = df.div(df.sum(axis=1), axis=0)
pandas.options.display.float_format = '{:.2f}%'.format
print(res.reset_index())
Any ideas?

Parsing data with variable number of columns

I have several .txt files with 140k+ lines each. They all have three types of data, which are a mix of string and floats:
- 7 col
- 14 col
- 18 col
What is the best and fastest way to parse such data?
I tried to use numpy.genfromtxt with usecols=np.arange(0,7) but obviously cuts out the 14 and 18 col data.
# for 7 col data
load = np.genfromtxt(filename, dtype=None, names=('day', 'tod', 'condition', 'code', 'type', 'state', 'timing'), usecols=np.arange(0,7))
I would like to parse the data as efficiently as possible.
The solution is rather simple and intuitive. We check if the number of columns in each row is equal to the specified number and append it to an array. For better analysis/modification of our data, we can then convert it to a Pandas DataFrame or Numpy as desired, below I show conversion to DataFrame. The number of columns in my dataset are 7, 14 and 18. I want my data labeled, so I can use Pandas' columns to label from an array.
import pandas as pd
filename = "textfile.txt"
labels_array1 = [] # 7 labels
labels_array2 = [] # 14 labels
labels_array3 = [] # 18 labels
with open(filename, "r") as f:
lines = f.readlines()
for line in lines:
num_items = len(line.split())
if num_items==7:
array1.append(line.rstrip())
elif num_items==14:
array2.append(line.rstrip())
elif num_items==18:
array3.append(line.rstrip())
else:
print("Detected a line with different columns.", num_items)
df1 = pd.DataFrame([sub.split() for sub in array1], columns=labels_array1)
df2 = pd.DataFrame([sub.split() for sub in array2], columns=labels_array2)
df3 = pd.DataFrame([sub.split() for sub in array3], columns=labels_array3)

Resources