How to Load csv File into MySQL from Python - python-3.x

I am trying to figure out how to load csv files into MySQL using Python.
I am getting error while inserting values.
I dropped the header row
for row in df.iterrows():
cursor.execute("insert into products(pr_id,pr_name,pr_category) values(%%s,%%s,%%s)",row)
print(df)
Programming error:
not all arguments converted during bytes formatting

Related

How to convert header of csv into valid data of pyspark dataframe

I am reading the csv data with Defined Schema. Whereas in that valid data from header is getting overwrited with schema.
I want header to get in 1st row of dataframe.

Pandas df.to_parquet write() got an unexpected keyword argument 'index' when ignoring index column

I am trying to export a pandas dataframe into a parquet format using the following:-
df.to_parquet("codeset.parquet", index=False)
I don't want to have index column in the parquet file so is this automatically done by to_parquet command or how can I get around this so that there is no index column included in the exported parquet.

In Spark 1.6 , How to read a CSV file with duplicated column name

I am unable to find a solution for reading a CSV file which has a column name repeated twice but while reading the CSV file it's giving an error complaining duplicate column names
Is there a way to handle this in spark without altering the CSV file ?.
My CSV data looks like this delimited by Tab (\t) & some extra spaces in each column.
col1 col2 col3
2020 100 sometext
You can also try using textfile method to read csv files and then convert them to DF or use them as RDDs after splitting and mapping them back!
Hope this works!

Error while writing data from python to redshift - Invalid date format - length must be 10 or more

I have a dataframe in python where date columns in datetime64[ns] data type. Now I am trying to write this data frame to redshift. I am getting following stl_load_errors:
Invalid date format - length must be 10 or more
All my dates are 2016-10-21 format, thus have length of 10. More over, I have ensured that no row has any messed up format like 2016-1-8 where it can have only 8 character. So the error is not making sense.
Any one faced similar error while writing data to redshift ? Any explanation ?
Note:
Here's some context. I am running the python script from EC2. This script writes the data in json format to S3 bucket and then this json is uploaded to an empty redshift table. The redshift table describes the date columns as 'date' format. I know there's another way which uses boto3/copy but for now I am stuck to this method.

Compile a dataframe from multiple CSVs using list of dfs

I am trying to create a single dataframe from 50 csv files. I need to use only two columns of the csv files namely 'Date' and 'Close'. I tried using the df.join function inside the for loop, but it eats up a lot of memory and i am getting error "Killed:9" after processing of almost 22-23 csv files.
So, now I am trying to create a list of Dataframes with only 2 columns using the for loop and then I am trying to concat the dfs outside the loop function.
I have following issues to be resolved:-
(i) Though the start date of most of the csv files have start date of 2000-01-01, but there are few csvs which have later start dates. So, I want that the main dataframe should have all the dates, with NaN or empty fields for csv with later start date.
(ii) I want to concat them across the Date as Index.
My code is :-
def compileData(symbol):
with open("nifty50.pickle","rb") as f:
symbols=pickle.load(f)
dfList=[]
main_df=pd.DataFrame()
for symbol in symbols:
df=pd.read_csv('/Users/uditvashisht/Documents/udi_py/stocks/stock_dfs/{}.csv'.format(symbol),infer_datetime_format=True,usecols=['Date','Close'],index_col=None,header=0)
df.rename(columns={'Close':symbol}, inplace=True)
dfList.append(df)
main_df=pd.concat(dfList,axis=1,ignore_index=True,join='outer')
print(main_df.head())
You can use index_col=0 in the read_csv or dflist.append(df.set_index('Date')) to put your Date column in the index of each dataframe. Then using pd.concat with axis=1, Pandas will using intrinsic data alignment to align all dataframes based on the index.

Resources