Python Pandas dataframe, how to integrate new columns into a new csv - python-3.x

guys, I need a bit help on Pandas and would appreciate greatly your inputs.
My original file looks like this:
I would like to convert it by mergering some pairs of columns (generating their averages) and returns a new file looking like this:
Also, if possible, I would also like to split the column 'RateDateTime' into two columns, one contains the date, the other contains only the time. How should I do it? I tried coding as belows but it doesn't work:
import pandas as pd
dateparse = lambda x: pd.datetime.strptime(x, '%Y/%m/%d %H:%M:%S')
df = pd.read_csv('data.csv', parse_dates=['RateDateTime'], index_col='RateDateTime',date_parser=dateparse)
a=pd.to_numeric(df['RateAsk_open'])
b=pd.to_numeric(df['RateAsk_high'])
c=pd.to_numeric(df['RateAsk_low'])
d=pd.to_numeric(df['RateAsk_close'])
e=pd.to_numeric(df['RateBid_open'])
f=pd.to_numeric(df['RateBid_high'])
g=pd.to_numeric(df['RateBid_low'])
h=pd.to_numeric(df['RateBid_close'])
df['Open'] = (a+e) /2
df['High'] = (b+f) /2
df['Low'] = (c+g) /2
df['Close'] = (d+h) /2
grouped = df.groupby('CurrencyPair')
Open=grouped['Open']
High=grouped['High']
Low=grouped['Low']
Close=grouped['Close']
w=pd.concat([Open, High,Low,Close], axis=1, keys=['Open', 'High','Low','Close'])
w.to_csv('w.csv')
Python returns:
TypeError: cannot concatenate object of type "<class 'pandas.core.groupby.groupby.SeriesGroupBy'>"; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid
Can someone help me please? Many thanks!!!

IIUYC, you don't need grouping here. You can simply update existing dataframe with new columns and specify, what columns you need to save to csv file in to_csv method. Here is example:
df['Open'] = df[['RateAsk_open', 'RateBid_open']].mean(axis=1)
df['RateDate'] = df['RateDateTime'].dt.date
df['RateTime'] = df['RateDateTime'].dt.time
df.to_csv('w.csv', columns=['CurrencyPair', 'Open', 'RateDate', 'RateTime'])

Related

Any optimize way to iterate excel and provide data into pd.read_sql() as a string one by one

#here I have to apply the loop which can provide me the queries from excel for respective reports:
df1 = pd.read_sql(SQLqueryB2, con=con1)
df2 = pd.read_sql(ORCqueryC2, con=con2)
if (df1.equals(df2)):
print(Report2 +" : is Pass")
Can we achieve above by something doing like this (by iterating ndarray)
df = pd.read_excel(path) for col, item in df.iteritems():
OR do the only option left to read the excel from "openpyxl" library and iterate row, columns and then provide the values. Hope I am clear with the question, if any doubt please comment me.
You are trying to loop through an excel file, run the 2 queries, see if they match and output the result, correct?
import pandas as pd
from sqlalchemy import create_engine
# add user, pass, database name
con = create_engine(f"mysql+pymysql://{USER}:{PWD}#{HOST}/{DB}")
file = pd.read_excel('excel_file.xlsx')
file['Result'] = '' # placeholder
for i, row in file.iterrows():
df1 = pd.read_sql(row['SQLQuery'], con)
df2 = pd.read_sql(row['Oracle Queries'], con)
file.loc[i, 'Result'] = 'Pass' if df1.equals(df2) else 'Fail'
file.to_excel('results.xlsx', index=False)
This will save a file named results.xlsx that mirrors the original data but adds a column named Result that will be Pass or Fail.
Example results.xlsx:

Pandas : how to consider content of certain columns as list

Let's say I have a simple pandas dataframe named df :
0 1
0 a [b, c, d]
I save this dataframe into a CSV file as follow :
df.to_csv("test.csv", index=False, sep="\t", encoding="utf-8")
Then later in my script I read this csv :
df = pd.read_csv("test.csv", index_col=False, sep="\t", encoding="utf-8")
Now what I want to do is to use explode() on column '1' but it does not work because the content of column '1' is not a list since I saved df into a CSV file.
What I tried so far is to change column '1' type into a list with astype() without any success.
Thank you by advance.
Try this, Since you are reading from csv file,your dataframe value in column A (1 in your case) is essentially a string for which you need to infer the values as list.
import pandas as pd
import ast
df=pd.DataFrame({"A":["['a','b']","['c']"],"B":[1,2]})
df["A"]=df["A"].apply(lambda x: ast.literal_eval(x))
Now, the following works !
df.explode("A")

Unquoted date in first column pf CSV for Python/Pandas read_csv

Incoming CSV from American Express download looks like below. (I would prefer each field has quotes around it, but it doesn't. It is treating the quoted long number in the second CSV column as the first column in the Pandas data frame, i.e. 320193480240275508 as my "Date" column:
12/13/19,'320193480240275508',Alamo Rent A Car,John
Doe,-12345,178.62,Travel-Vehicle Rental,DEBIT,
colnames = ['Date', 'TransNum', 'Payee', 'NotUsed4', 'NotUsed5', 'Amount', 'AmexCategory', 'DebitCredit']
df = pd.read_csv(filenameIn, names=colnames, header=0, delimiter=",")
delimiter=",")
pd.set_option('display.max_rows', 15)
pd.set_option('display.width', 200)
print (df)
print (df.values)
Start
Date ... DebitCredit 12/13/19 '320193480240275508' ... NaN
I have a routine to reformat the date ( to handle things like 1/3/19, and to add the century). It is called like this:
df['Date'][j] = reformatAmexDate2(df['Date'][j])
That routine shows the date as follows:
def reformatAmexDate2(oldDate):
print ("oldDate=" + oldDate)
oldDate='320193480240275508'
I saw this post which recommended dayfirst=True, and added that, but same result. I never even told Pandas that column 1 is a date, so it should treat it as text I believe.
IIUC, the problem seems to be name=colnames, it sets new names for your columns being read from csv file, as you are trying to read specific columns from csv file, you can use usecol
df = pd.read_csv(filenameIn,usecols=colnames, header=0, delimiter=",")
Looking at the data, I didn't notice the comma after the column value, i.e. the comma after "DEBIT,"
12/13/19,'320193480240275508',Alamo Rent A Car,John Doe,-12345,178.62,Travel-Vehicle Rental,DEBIT,
I just added another column at the end of my columns array:
colnames = ['Date', 'TransNum', 'Payee', 'NotUsed4', 'NotUsed5', 'Amount', 'AmexCategory', 'DebitCredit','NotUsed9']
and life is wonderful.

Error when using pandas read_excel(header=[0,1])

I'm trying to use pandas read_excel to work with a file. The file has two columns of headers so I'm trying to use the multiIndex feature apart of the header keyword argument.
import pandas as pd, os
"""data in 2015 MOR Folder"""
filename = 'MOR-JANUARY 2015.xlsx'
print(os.path.isfile(filename))
df1 = pd.read_excel(filename, header=[0,1], sheetname='MOR')
print(df1)
the error I get is ValueError: Length of new names must be 1, got 2. The file is in this google drive folder https://drive.google.com/drive/folders/0B0ynKIVAlSgidFFySWJoeFByMDQ?usp=sharing
I'm trying to follow the solution posted here
Read excel sheet with multiple header using Pandas
I could be mistaken but I don't think pandas handles parsing excel rows where there are merged cells. So in that first row, the merged cells get parsed as mostly empty cells. You'd need them nicely repeated to act correctly. This is what motivates the ffill below. If you could control the Excel workbook ahead of time and you might be able to use the code you have.
my solution
It's not pretty, but it'll get it done.
filename = 'MOR-JANUARY 2015.xlsx'
df1 = pd.read_excel(filename, sheetname='MOR', header=None)
vals = df1.values
mux = pd.MultiIndex.from_arrays(df1.ffill(1).values[:2, 1:], names=[None, 'DATE'])
df1 = pd.DataFrame(df1.values[2:, 1:], df1.values[2:, 0], mux)

Splitting the rows of a Dataframe using pyspark

code:
import os.path
file_name = os.path.join('databricks-datasets', 'cs190', 'data-001', 'millionsong.txt')
raw_data_df = sqlContext.read.load(file_name, 'text')
sample_points = raw_data_df.take(5)
print sample_points
Example output:
[Row(1,2,3),Row(4,5,6)]
From this output,wanted to parse each row in the DataFrame into individual elements, using Spark's select and split methods.
For example, split "1,2,3" into ['1','2','3']
Code:
raw_data_df.select((explode(split(raw_data_df.value,"\s+"))))
But the code doesnt seem to worf as expected any suggestions would be helpful.
Try this..
raw_data_df.map(lambda l: l[0].select(l[0].split(",")))

Resources