Pandas df.to_parquet write() got an unexpected keyword argument 'index' when ignoring index column - python-3.x

I am trying to export a pandas dataframe into a parquet format using the following:-
df.to_parquet("codeset.parquet", index=False)
I don't want to have index column in the parquet file so is this automatically done by to_parquet command or how can I get around this so that there is no index column included in the exported parquet.

Related

How to convert header of csv into valid data of pyspark dataframe

I am reading the csv data with Defined Schema. Whereas in that valid data from header is getting overwrited with schema.
I want header to get in 1st row of dataframe.

Prevent pyspark/spark from transforming timestamp when creating a dataframe from a parquet file

I am reading a parquet file into a dataframe. My goal is to verify that my time data (column type in parquet : timestamp) are ISO 8601.
The dates in time column look like this : 2021-03-13T05:34:27.100Z or 2021-03-13T05:34:27.100+0000
But when I read my dataframe, pyspark transform 2021-03-13T05:34:27.100Z into 2021-03-13 05:34:27.100
I want to keep the original format, but I can't figure out how to stop pyspark from doing this. I tried to use a custom schema with string for dates but I get this error: Parquet column cannot be converted in file file.snappy.parquet. Column: [time], Expected: string, Found: INT96
Also I tried using conf parameters but it didn't work for me.

How to Load csv File into MySQL from Python

I am trying to figure out how to load csv files into MySQL using Python.
I am getting error while inserting values.
I dropped the header row
for row in df.iterrows():
cursor.execute("insert into products(pr_id,pr_name,pr_category) values(%%s,%%s,%%s)",row)
print(df)
Programming error:
not all arguments converted during bytes formatting

PySpark parquet datatypes

I am using PySpark to read a relative large csv file (~10GB):
ddf = spark.read.csv('directory/my_file.csv')
All the columns have the datatype string
After changing the datatype of for example column_a I can see the datatype changed to an integer. If I write the ddf to a parquet file and read the parquet file I notice that all columns have the datatype string again. Question: How can I make sure the parquet file contains the correct datatypes so that I do not have to change the datatype again (while reading the parquet file).
Notes:
I write the ddf as a parquet file as follows:
ddf.repartition(10).write.parquet('directory/my_parquet_file', mode='overwrite')
I use:
PySpark version 2.0.0.2
Python 3.x
I read my large files with pandas and not have this problem. Try use pandas.
http://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.read_csv.html
In[1]: Import pandas as pd
In[2]: df = pd.read_csv('directory/my_file.csv')

Compile a dataframe from multiple CSVs using list of dfs

I am trying to create a single dataframe from 50 csv files. I need to use only two columns of the csv files namely 'Date' and 'Close'. I tried using the df.join function inside the for loop, but it eats up a lot of memory and i am getting error "Killed:9" after processing of almost 22-23 csv files.
So, now I am trying to create a list of Dataframes with only 2 columns using the for loop and then I am trying to concat the dfs outside the loop function.
I have following issues to be resolved:-
(i) Though the start date of most of the csv files have start date of 2000-01-01, but there are few csvs which have later start dates. So, I want that the main dataframe should have all the dates, with NaN or empty fields for csv with later start date.
(ii) I want to concat them across the Date as Index.
My code is :-
def compileData(symbol):
with open("nifty50.pickle","rb") as f:
symbols=pickle.load(f)
dfList=[]
main_df=pd.DataFrame()
for symbol in symbols:
df=pd.read_csv('/Users/uditvashisht/Documents/udi_py/stocks/stock_dfs/{}.csv'.format(symbol),infer_datetime_format=True,usecols=['Date','Close'],index_col=None,header=0)
df.rename(columns={'Close':symbol}, inplace=True)
dfList.append(df)
main_df=pd.concat(dfList,axis=1,ignore_index=True,join='outer')
print(main_df.head())
You can use index_col=0 in the read_csv or dflist.append(df.set_index('Date')) to put your Date column in the index of each dataframe. Then using pd.concat with axis=1, Pandas will using intrinsic data alignment to align all dataframes based on the index.

Resources