Save pandas dataframe into csv file - python-3.x

I have a problem in saving large panda dataframe into the csv file.
below is just snapshot of first 3 lines of panda dataframes.
parent_sernum parent_pid created_date sernum pid pn
0 FCH21467XBN UXXXX-XXX-XXXX 2017-12-20 11:02:00.177 SSGQA20741370EA85A,SSGQA... UXXXX, 22RV-A,Uxxx-xx... 15-104065-01,15-104065-0...
1 FCH21467XBN Uxxxx-xxx-xx 2017-12-20 11:38:45.373 SSGQA20741370EA85A,SSGQA... Uxx-xxx-xxxx-A,Uxx-xx-... 15-104065-01,15-104065-0...
2 FCH2145V0UW Uxxx-xxxx-M4S 2017-12-02 11:01:26.993 SSH8A2071935A2ACDE,SSH8A... Uxx-xx-1X324RV-A,UCS-ML-... 15-104064-01,15-104064-0...
When it comes to save this dataframe into csv file. it only captures First letter and drops the rest from most of columns as below.
parent_sernum,parent_pid,created_date,sernum,pid,pn
F,U,2017-12-20 11:02:00.177,S,U,1
F,U,2017-12-20 11:38:45.373,S,U,1
F,U,2017-12-02 11:01:26.993,S,U,1
Below is my codes when it comes to save data. (df = dataframe).
Is there any option that I need to set if I need to save all from dataframe?
Or is there a limitation in csv file size so it automatically captures only fraction of data to meet the size restriction?
df.to_csv('sample.csv', index = False, header = True)

Related

pandas db created dataframe compare with csv created dataframe having type issue while comparing

I have a requirement to compare db migrated data to s3 created csv file for same table using python script with pandas library.
While doing this,I am facing dtype issue as data type has changed when it moves to csv file. for exmaple: table created dataframe has dtype as object however csv file has dtype as float.
and while doing df1table.equals(df2csv) ,getting result as false.
Even ,I tried to change the dtype of table data frame got error saying can't change string to float. Also facing issue with Null values of the table data frame compare to csv data frame.
I need a generic solution which work for all table and respective csv file.
Any better way to compare them. For ex: change both data frame into same type and compare.
looking for your reply.Thanks!
To prevent Pandas inferring the data type, you can use dtype=object as parameter of pd.read_csv:
df2csv = pd.read_csv('file.csv', dtype=object, # other params)
Example:
df1 = pd.read_csv('data.csv')
df2 = pd.read_csv('data.csv', dtype=object)
# df1
A B C
0 0.988888 1.789871 12.7
# df2
A B C
0 0.988887546565 1.789871131 12.7
CSV file:
A,B,C
0.988887546565,1.789871131,12.7

Live updating graph from increasing amount of csv files

I need to analyse some spectral data in real-time and plot it as a self-updating graph.
The program I use outputs a text file every two seconds.
Usually I do the analysis after gathering the data and the code works just fine. I create a dataframe, where each csv file represents a column. The problem is, with several thousands of csv files, the import becomes very slow and creating a dataframe out of all the csv files takes usually more than half an hour.
Below the code for creating the dataframe from multiple csv files.
''' import, append and concat files into one dataframe '''
all_files = glob.glob(os.path.join(path, filter + "*.txt")) # path to the files by joining path and file name
all_files.sort(key=os.path.getmtime)
data_frame = []
name = []
for file in (all_files):
creation_time = os.path.getmtime(file)
readible_date = datetime.fromtimestamp(creation_time)
df = pd.read_csv(file, index_col=0, header=None, sep='\t', engine='python', decimal=",", skiprows = 15)
df.rename(columns={1: readible_date}, inplace=True)
data_frame.append(df)
full_spectra = pd.concat(data_frame, axis=1)
for column in full_spectra.columns:
time_step = column - full_spectra.columns[0]
minutes = time_step.total_seconds()/60
name.append(minutes)
full_spectra.columns = name
return full_spectra
The solution I thought of was using the watchdog module and everytime a new textfile is created it gets appended as a new column to the existing dataframe and the updated dataframe is plotted. Because then, I do not need to loop over all csv files all the time.
I found a very nice example on how to use watchdog here
My problem is, I could not find a solution how after detecting the new file with watchdog, to read it and append it to the existing dataframe.
A minimalistic example code should look something like this:
def latest_filename():
"""a function that checks within a directoy for new textfiles"""
return(filename)
df = pd.DataFrame() #create a dataframe
newdata = pd.read_csv(latest_filename) #The new file is found by watchdog
df["newcolumn"] = newdata["desiredcolumn"] #append the new data as column
df.plot() #plot the data
The plotting part should be easy and my thoughts were to adapt the code presented here. I am more concerned with the self-updating dataframe.
I appreciate any help or other solutions that would solve my issue!

How to convert dataframe to a text file in spark?

I unloaded snowflake table and created a data frame.
this table has data of various datatype.
I tried to save it as a text file but got an error:
Text data source does not support Decimal(10,0).
So to resolve the error, I casted my select query and converted all columns to string datatype.
Then I got the below error:
Text data source supports only single column, and you have 5 columns.
my requirement is to create a text file as follows.
"column1value column2value column3value and so on"
You can use a CSV output with a space delimiter:
import pyspark.sql.functions as F
df.select([F.col(c).cast('string') for c in df.columns]).write.csv('output', sep=' ')
If you want only 1 output file, you can add .coalesce(1) before .write.
You need to have one column if you want to write using spark.write.text. You can use csv instead as suggested in #mck's answer or you can concatenate all columns into one before you write:
df.select(
concat_ws(" ", df.columns.map(c => col(c).cast("string")): _*).as("value")
).write
.text("output")

Issue when exporting dataframe to csv

I'm working on a mechanical engineering project. For the following code, the user enters the number of cylinders that their compressor has. A dataframe is then created with the correct number of columns and is exported to Excel as a CSV file.
The outputted dataframe looks exactly like I want it to as shown in the first link, but when opened in Excel it looks like the image in the second link:
1.my dataframe
2.Excel Table
Why is my dataframe not exporting properly to Excel and what can I do to get the same dataframe in Excel?
import pandas as pd
CylinderNo=int(input('Enter CylinderNo: '))
new_number=CylinderNo*3
list1=[]
for i in range(1,CylinderNo+1):
for j in range(0,3):
Cylinder_name=str('CylinderNo ')+str(i)
list1.append(Cylinder_name)
df = pd.DataFrame(list1,columns =['Kurbel/Zylinder'])
list2=['Triebwerk', 'Packung','Ventile']*CylinderNo
Bauteil = {'Bauteil': list2}
df2 = pd.DataFrame (Bauteil, columns = ['Bauteil'])
new=pd.concat([df, df2], axis=1)
list3=['Nan','Nan','Nan']*CylinderNo
Bewertung={'Bewertung': list3}
df3 = pd.DataFrame (Bewertung, columns = ['Bewertung'])
new2=pd.concat([new, df3], axis=1)
Empfehlung={'Empfehlung': list3}
df4 = pd.DataFrame (Empfehlung, columns = ['Empfehlung'])
new3=pd.concat([new2, df4], axis=1)
new3.set_index('Kurbel/Zylinder')
new3 = new3.set_index('Kurbel/Zylinder', append=True).swaplevel(0,1)
#export dataframe to csv
new3.to_csv('new3.csv')
To be clear, a comma-separated values (CSV) file is not an Excel format type or table. It is a delimited text file that Excel like other applications can open.
What you are comparing is simply presentation. Both data frames are exactly the same. For multindex data frames, Pandas print output does not repeat index values for readability on the console or IDE like Jupyter. But such values are not removed from underlying data frame only its presentation. If you re-order indexes, you will see this presentation changes. The full complete data frame is what is exported to CSV. And ideally for data integrity, you want the full data set exported with to_csv to be import-able back into Pandas with read_csv (which can set indexes) or other languages and applications.
Essentially, CSV is an industry format to store and transfer data. Consider using Excel spreadsheets, HTML markdown, or other reporting formats for your presentation needs. Therefore, to_csv may not be the best method. You can try to build text file manually with Python i/o write methods, with open('new.csv', 'w') as f, but will be an extensive workaround See also #Jeff's answer here but do note the latter part of solution does remove data.

Python3: How to keep the leading zero when exporting a dataframe to a csv or text file

I am using the following command: (python3)
Mydataframe__df.to_csv(string_io, sep=',', quoting=csv.QUOTE_ALL, header=True, index=False , encoding='utf-8')
df_writer = Mydata_Output.get_writer('/MYFILE_TEST.csv')
df_string = string_io.getvalue()
# save the string as bytes to with the writer
df_writer.write(df_string.encode('utf-8'))
# close the writer connection
df_writer.close()
the issue is for the columns with a format like "012345", the leading 0 is removed in the output file even when opening it with Notepad and even when the column format is set as string in the dataframe.
I'm kind of new here too so don't have the street cred here to comment.
You can preserve leading zeros by converting to a string before outputting the data. Let's say, for example, you want to have eight digits in your data columns, you could use zfill to left pad the string with zeros so it is eight digits long.
outvar = str(numvar).zfill(8)
The issue with leading 0 is that when we load the dataframe into panda before to write the csv then panda by default infer its own data type.
The .get_dataframe(infer_with_pandas=False) force to keep the source dataframe.
The issue becomes that when we have nulls in the data (other than string data field) panda does not like it so we need to change everything into string or clean the data before ....
In found the .get_dataframe(infer_with_pandas=False) in one of the publication here. I will try to reference it later.
# Read recipe inputs
Mydataframe = dataiku.Dataset("TESTING_for_leading0")
Mydataframe_df = Mydataframe.get_dataframe(infer_with_pandas=False)
Mydataframe_df.to_csv(string_io, sep=',', quoting=csv.QUOTE_ALL, header=True, index=False , encoding='utf-8')
df_writer = Mydata_Output.get_writer('/MYFILE_TEST.csv')
df_string = string_io.getvalue()
# save the string as bytes to with the writer
df_writer.write(df_string.encode('utf-8'))
# close the writer connection
df_writer.close()

Resources