I have the following issue. I am creating offline ACCESS tool and many times I have to create an Access CROSSTAB that has undefined number of columns requiring manipulation of data in fields beyond the Access possibilities (due to complexity of operations or its size). So I import the query as PANDAS dataframe, manipulate it and I want to pass it back. I have rows of names under AGR_NAME header and various licence types as columns. Therefore I have first field/column as TEXT and columns as INTEGERS. The column headings are e.g. {52, CN, 55, XX, 71, PR} but the fields are populated with percents or integers.
How to import the headers (undefined number) as in the sql query and pass it back to Access database table with transformed values?
The issue is that the name and number of columns can vary. Only the first columns is TEXT and the other are INTEGERS (not headers).
Basically I need to create the new table from the headers of the dataframe that can vary in number and populate it with transformed data.
LIC_TYPE CE 55 52 XE
AGR_NAME 1 0 1 2
XY 1 1 0 4
XZ 12 3 1 45
XX 44 5 7 8
ZZ 0 0 1 0
The code I have so far is:
import pyodbc
import pandas
import os
import sys
import struct
print("running as {0}-bit".format(struct.calcsize("P") * 8))
sources = pyodbc.dataSources() dsns = list(sources.keys()) dsns.sort()
sl = []
for dsn in dsns:
sl.append('%s [%s]' % (dsn, sources[dsn])) print('\n'.join(sl))
print(pyodbc.drivers())
try:
currdir = os.path.abspath(__file__)
except NameError:
import sys
currdir = os.path.abspath(os.path.dirname(sys.argv[0]))
DBfile = os.path.join(currdir, 'UNION.accdb')
cnxn = pyodbc.connect('DRIVER={Microsoft Access Driver (*.mdb, *.accdb)};DBQ=%s;' %DBfile)
sql = "Select * FROM pivooo" df = pandas.read_sql(sql,cnxn)
df = df.set_index('AGR_NAME') res = df.div(df.sum(axis=1), axis=0)
pandas.options.display.float_format = '{:.2f}%'.format
print(res.reset_index())
Any ideas?
Related
I have a generated CSV file that
doesn't have headers
has header and data occur alternately in every row (headers do not change from row to row).
E.g.:
imageId,0,feat1,30,feat2,34,feat,90
imageId,1,feat1,0,feat2,4,feat,89
imageId,2,feat1,3,feat2,3,feat,80
IMO, this format is redundant and cumbersome (I don't see why anyone would generate files in this format). The saner/normal CSV of the same data (which I can directly read using pd.read_csv():
imageId,feat1,feat2,feat
0,30,34,90
1,0,4,89
2,3,3,80
My question is, how do I read the original data into a pd dataframe? For now, I do a read_csv and then drop all alternate columns:
df=pd.read_csv(file, header=None)
df=df[range(1, len(df.columns), 2]
Problem with this is I don't get the headers, unless I make it a point to specify them.
Is there a simpler way of telling pandas that the format has data and headers in every row?
Select columns by indexing in DataFrame.iloc and set new columns names with get first row and pair values (assuming pair columns have same values like in sample data):
#default headers
df = pd.read_csv(file, header=None)
df1 = df.iloc[:, 1::2]
df1.columns = df.iloc[0, ::2].tolist()
print (df1)
imageId feat1 feat2 feat
0 0 30 34 90
1 1 0 4 89
2 2 3 3 80
I didn't measure but I would expect that it could be a problem to read the entire file (redundant headers and actual data) before filtering for the interesting stuff. So I tried to exploit the optional parameters nrows and usecols to (hopefully) limit the amount of memory needed to process the CSV input file.
# --- Utilities for generating test data ---
import random as rd
def write_csv(file, line_count=100):
with open(file, 'w') as f:
r = lambda : rd.randrange(100);
for i in range(line_count):
line = f"imageId,{i},feat1,{r()},feat2,{r()},feat,{r()}\n"
f.write(line)
file = 'text.csv'
# Generate a small CSV test file
write_csv(file, 10)
# --- Actual answer ---
import pandas as pd
# Read columns of the first row
dfi = pd.read_csv(file, header=None, nrows=1)
ncols = dfi.size
# Read data columns
dfd = pd.read_csv(file, header=None, usecols=range(1, ncols, 2))
dfd.columns = dfi.iloc[0, ::2].to_list()
print(dfd)
I am working with a large pandas dataframe where I have created a new empty column. What I want to do si to iterate over every value within a specific column of the dataframe, do a Boolean check, and then assign a value to the new column based on the output of the value check.
I would think I need to use a for loop to check the individual contents of each cell in my specified column. The problem is that I can't seem to figure out the correct syntax to correctly write a for loop that checks values in a specific column. This is what I have so far.
call_info['% of Net Capital'] = call_info['Call Amount'] / call_info['Net Capital']
for (ColumnData) in call_info['Call Amount']:
columnSeriesObj = call_info[ColumnData]
if columnSeriesObj.any - call_info['Excess Deficit'].any > 0:
call_info['Sufficient Excess?'][ColumnData] = True
else:
call_info['Sufficient Excess?'][ColumnData] = False
I get a KeyError : 38749372
call_info is a pandas dataframe. I am trying to compare call_info['Call Amount'] against call_info['Excess Deficit'] and put a True or false value in call_info['Sufficient Excess?']
**Updated to include an example of my dataframe, and the expected output
This is a snip of a larger csv file:
I have loaded the data from this CSV file using openpyxl load_workbook
From there, I converted the data into a Pandas Dataframe using the following code :
from itertools import islice
data = sheet_ranges.values
cols = next(data)[1:]
data = list(data)
idx = [r[0] for r in data]
data = (islice(r, 1, None) for r in data)
df = pd.DataFrame(data, index=idx, columns=cols)
An example of the expected output is a column within the dataframe that looks like this:
I've been able to do this in Excel, but I am looking to automate the process
I made some demo data, which hopefully represents the problem.
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame(np.random.randint(1000, size = [20, 2]), columns = ['call_amount', 'excess_deficit'])
Then you can use the following code to get the result your looking for.
df['sufficient_excess'] = (df['call_amount'] - df['excess_deficit']) > 0
which gives
call_amount excess_deficit sufficient_excess
0 684 559 True
1 629 192 True
2 835 763 True
3 707 359 True
4 9 723 False
5 277 754 False
6 804 599 True
7 70 472 False
8 600 396 True
9 314 705 False
If you need the result changing to have Yes instead of True, let me now
I wrote a code. But it's very slow. The goal is to look for matches. It doesn't have to be one-on-one matches.
I have a data frame which has about 3,600,000 entries --> "SingleDff"
I have a data frame with about 110'000 entries --> "dfnumbers"
Now - The code tries to find out if out of these 110'000 entries you can find entries in the 3'600'000 million.
I added a counter to see how "fast" it is. After 24h I only got 11'000 entries. 10% in total
I'm looking now for ways and/or ideas how I can improve the performance of the Code.
The Code:
import os
import glob
import numpy as np
import pandas as pd
#Preparation
pathfiles = 'C:\\Python\\Data\\Input\\'
df_Files = glob.glob(pathfiles + "*.csv")
df_Files = [pd.read_csv(f, encoding='utf-8', sep=';', low_memory=False) for f in df_Files]
SingleDff = pd.concat(df_Files, ignore_index=True, sort=True)
dfnumbers = pd.read_excel('C:\\Python\\Data\\Input\\UniqueNumbers.xlsx')
#Output
outputDf = pd.DataFrame()
SingleDff['isRelevant'] = np.nan
count = 0
max = len(dfnumbers['Korrigierter Wert'])
arrayVal = dfnumbers['Korrigierter Wert']
for txt in arrayVal:
outputDf = outputDf.append(SingleDff[SingleDff['name'].str.contains(txt)], ignore_index = True)
outputDf['isRelevant'] = np.where(outputDf['isRelevant'].isnull(),txt,outputDf['isRelevant'])
count += 1
outputDf.to_csv('output_match.csv')
Edit:
Example of Data
In the 110'000 Data Frame I have something like this:
ABCD-12345-1245-T1
ACDB-98765-001 AHHX800.0-3
In the huge DF i have entrys like:
AHSG200-B0097小样图.dwg
MUDI-070097-0-05-00.dwg
ABCD-12345-1245.xlsx
ABCD-12345-1245.pdf
ABCD-12345.xlsx
Now i try to find matches - For which number we can find documents
Thank you for your inputs
I'am trying to calculate 33 stock betas and write them to dataframe.
Unfortunately, I have an error in my code:
cannot concatenate object of type ""; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are vali
import pandas as pd
import numpy as np
stock1=pd.read_excel(r"C:\Users\Кир\Desktop\Uni\Master\Nasdaq\Financials 11.05\Nasdaq last\clean data\01.xlsx", '1') #read second sheet of excel file
stock2=pd.read_excel(r"C:\Users\Кир\Desktop\Uni\Master\Nasdaq\Financials 11.05\Nasdaq last\clean data\01.xlsx", '2') #read second sheet of excel file
stock2['stockreturn']=np.log(stock2.AdjCloseStock / stock2.AdjCloseStock.shift(1)) #stock ln return
stock2['SP500return']=np.log(stock2.AdjCloseSP500 / stock2.AdjCloseSP500.shift(1)) #SP500 ln return
stock2 = stock2.iloc[1:] #delete first row in dataframe
betas = pd.DataFrame()
for i in range(0,(len(stock2.AdjCloseStock)//52)-1):
betas = betas.append(stock2.stockreturn.iloc[i*52:(i+1)*52].cov(stock2.SP500return.iloc[i*52:(i+1)*52])/stock2.SP500return.iloc[i*52:(i+1)*52].cov(stock2.SP500return.iloc[i*52:(i+1)*52]))
My data looks like weekly stock and S&P index return for 33 years. So the output should have 33 betas.
I tried simplifying your code and creating an example. I think the problem is that your calculation returns a float. You want to make it a pd.Series. DataFrame.append takes:
DataFrame or Series/dict-like object, or list of these
np.random.seed(20)
df = pd.DataFrame(np.random.randn(33*53, 2),
columns=['a', 'b'])
betas = pd.DataFrame()
for year in range(len(df['a'])//52 -1):
# Take some data
in_slice = pd.IndexSlice[year*52:(year+1)*52]
numerator = df['a'].iloc[in_slice].cov(df['b'].iloc[in_slice])
denominator = df['b'].iloc[in_slice].cov(df['b'].iloc[in_slice])
# Do some calculations and create a pd.Series from the result
data = pd.Series(numerator / denominator, name = year)
# Append to the DataFrame
betas = betas.append(data)
betas.index.name = 'years'
betas.columns = ['beta']
betas.head():
beta
years
0 0.107669
1 -0.009302
2 -0.063200
3 0.025681
4 -0.000813
i had a large csv file (3000*20000) of data without headers i added one columns to represent the classes. how i can fit the data to the model when the features has no headers and it can not be added manually due to the large number of columns.
is there i way to automatically iterate each columns in a row?
when i had a small file of 4 columns i used the following code:
import pandas as pd
pd = pd.ExcelFile("bcs.xlsx")
col = [0, 1, 2, 3]
data = pd.parse(pd.sheet_names[0], parse_cols = col)
pdc = list(data["pdc"])
pds = list(data["pds"])
pdsh = list(data["pdsh"])
pd_class = list(data["class"])
features = []
for i in range(len(pdc)):
features.append([pdc[i],pds[i],pdsh[i]])
labels = []
labels = pd_class
But with a 3000 by 20000 file i don't know how to identify the features and labels/target
Let's say you have a csv like that:
1,2,3,4,0
1,2,3,4,1
1,2,3,4,1
1,2,3,4,0
where the first 4 columns are features and the last one is the label or class you want. You can read the file with pandas.read_csv and create a dataframe for you features and one for your labels which you can fit next, to your model.
import pandas as pd
#CSV localPath
mypath ='C:\\...'
#The names of the columns you want to have in your dataframe
colNames = ['Feature1','Feature2','Feature3','Feature4','class']
#Read the data as dataframe
df = pd.read_csv(filepath_or_buffer = mypath,
names = colNames , sep = ',' , header = None)
#Get the first four columns as features
features = df.ix[:,:4]
#and last columns as label
labels = df['class']