Issue when exporting dataframe to csv - python-3.x

I'm working on a mechanical engineering project. For the following code, the user enters the number of cylinders that their compressor has. A dataframe is then created with the correct number of columns and is exported to Excel as a CSV file.
The outputted dataframe looks exactly like I want it to as shown in the first link, but when opened in Excel it looks like the image in the second link:
1.my dataframe
2.Excel Table
Why is my dataframe not exporting properly to Excel and what can I do to get the same dataframe in Excel?
import pandas as pd
CylinderNo=int(input('Enter CylinderNo: '))
new_number=CylinderNo*3
list1=[]
for i in range(1,CylinderNo+1):
for j in range(0,3):
Cylinder_name=str('CylinderNo ')+str(i)
list1.append(Cylinder_name)
df = pd.DataFrame(list1,columns =['Kurbel/Zylinder'])
list2=['Triebwerk', 'Packung','Ventile']*CylinderNo
Bauteil = {'Bauteil': list2}
df2 = pd.DataFrame (Bauteil, columns = ['Bauteil'])
new=pd.concat([df, df2], axis=1)
list3=['Nan','Nan','Nan']*CylinderNo
Bewertung={'Bewertung': list3}
df3 = pd.DataFrame (Bewertung, columns = ['Bewertung'])
new2=pd.concat([new, df3], axis=1)
Empfehlung={'Empfehlung': list3}
df4 = pd.DataFrame (Empfehlung, columns = ['Empfehlung'])
new3=pd.concat([new2, df4], axis=1)
new3.set_index('Kurbel/Zylinder')
new3 = new3.set_index('Kurbel/Zylinder', append=True).swaplevel(0,1)
#export dataframe to csv
new3.to_csv('new3.csv')

To be clear, a comma-separated values (CSV) file is not an Excel format type or table. It is a delimited text file that Excel like other applications can open.
What you are comparing is simply presentation. Both data frames are exactly the same. For multindex data frames, Pandas print output does not repeat index values for readability on the console or IDE like Jupyter. But such values are not removed from underlying data frame only its presentation. If you re-order indexes, you will see this presentation changes. The full complete data frame is what is exported to CSV. And ideally for data integrity, you want the full data set exported with to_csv to be import-able back into Pandas with read_csv (which can set indexes) or other languages and applications.
Essentially, CSV is an industry format to store and transfer data. Consider using Excel spreadsheets, HTML markdown, or other reporting formats for your presentation needs. Therefore, to_csv may not be the best method. You can try to build text file manually with Python i/o write methods, with open('new.csv', 'w') as f, but will be an extensive workaround See also #Jeff's answer here but do note the latter part of solution does remove data.

Related

pandas db created dataframe compare with csv created dataframe having type issue while comparing

I have a requirement to compare db migrated data to s3 created csv file for same table using python script with pandas library.
While doing this,I am facing dtype issue as data type has changed when it moves to csv file. for exmaple: table created dataframe has dtype as object however csv file has dtype as float.
and while doing df1table.equals(df2csv) ,getting result as false.
Even ,I tried to change the dtype of table data frame got error saying can't change string to float. Also facing issue with Null values of the table data frame compare to csv data frame.
I need a generic solution which work for all table and respective csv file.
Any better way to compare them. For ex: change both data frame into same type and compare.
looking for your reply.Thanks!
To prevent Pandas inferring the data type, you can use dtype=object as parameter of pd.read_csv:
df2csv = pd.read_csv('file.csv', dtype=object, # other params)
Example:
df1 = pd.read_csv('data.csv')
df2 = pd.read_csv('data.csv', dtype=object)
# df1
A B C
0 0.988888 1.789871 12.7
# df2
A B C
0 0.988887546565 1.789871131 12.7
CSV file:
A,B,C
0.988887546565,1.789871131,12.7

How do I convert my response with byte characters to readable CSV - PYTHON

I am building an API to save CSVs from Sharepoint Rest API using python 3. I am using a public dataset as an example. The original csv has 3 columns Group,Team,FIFA Ranking with corresponding data in the rows.For reference. the original csv on sharepoint ui looks like this:
after using data=response.content the output of data is:
b'Group,Team,FIFA Ranking\r\nA,Qatar,50\r\nA,Ecuador,44\r\nA,Senegal,18\r\nA,Netherlands,8\r\nB,England,5\r\nB,Iran,20\r\nB,United States,16\r\nB,Wales,19\r\nC,Argentina,3\r\nC,Saudi Arabia,51\r\nC,Mexico,13\r\nC,Poland,26\r\nD,France,4\r\nD,Australia,38\r\nD,Denmark,10\r\nD,Tunisia,30\r\nE,Spain,7\r\nE,Costa Rica,31\r\nE,Germany,11\r\nE,Japan,24\r\nF,Belgium,2\r\nF,Canada,41\r\nF,Morocco,22\r\nF,Croatia,12\r\nG,Brazil,1\r\nG,Serbia,21\r\nG,Switzerland,15\r\nG,Cameroon,43\r\nH,Portugal,9\r\nH,Ghana,61\r\nH,Uruguay,14\r\nH,South Korea,28\r\n'
how do I convert the above to csv that pandas can manipulate with the columns being Group,Team,FIFA and then the corresponding data dynamically so this method works for any csv.
I tried:
data=response.content.decode('utf-8', 'ignore').split(',')
however, when I convert the data variable to a dataframe then export the csv the csv just returns all the values in one column.
I tried:
data=response.content.decode('utf-8') or data=response.content.decode('utf-8', 'ignore') without the split
however, pandas does not take this in as a valid df and returns invalid use of dataframe constructor
I tried:
data=json.loads(response.content)
however, the format itself is invalid json format as you will get the error json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Given:
data = b'Group,Team,FIFA Ranking\r\nA,Qatar,50\r\nA,Ecuador,44\r\nA,Senegal,18\r\n' #...
If you just want a CSV version of your data you can simply do:
with open("foo.csv", "wt", encoding="utf-8", newline="") as file_out:
file_out.writelines(data.decode())
If your objective is to load this data into a pandas dataframe and the CSV is not actually important, you can:
import io
import pandas
foo = pandas.read_csv(io.StringIO(data.decode()))
print(foo)

Python data source - first two columns disappear

I have started using PowerBI and am using Python as a data source with the code below. The source data can be downloaded from here (it's about 700 megabytes). The data is originally from here (contained in IOT_2019_pxp.zip).
import pandas as pd
import numpy as np
import os
path = /path/to/file
to_chunk = pd.read_csv(os.path.join(path,'A.txt'), delimiter = '\t', header = [0,1], index_col = [0,1],
iterator=True, chunksize=1000)
def chunker(to_chunk):
to_concat = []
for chunk in to_chunk:
try:
to_concat.append(chunk['BG'].loc['BG'])
except:
pass
return to_concat
A = pd.concat(chunker(to_chunk))
I = np.identity(A.shape[0])
L = pd.DataFrame(np.linalg.inv(I-A), index=A.index, columns=A.columns)
The code simply:
Loads the file A.txt, which is a symmetrical matrix. This matrix has every sector in every region for both rows and columns. In pandas, these form a MultiIndex.
Filters just the region that I need which is BG. Since it's a symmetrical matrix, both row and column are filtered.
The inverse of the matrix is calculated giving us L, which I want to load into PowerBI. This matrix now just has a single regular Index for sector.
This is all well and good however when I load into PowerBI, the first column (sector names for each row i.e. the DataFrame Index) disappears. When the query gets processed, it is as if it were never there. This is true for both dataframes A and L, so it's not an issue of data processing. The column of row names (the DataFrame index) is still there in Python, PowerBI just drops it for some reason.
I need this column so that I can link these tables to other tables in my data model. Any ideas on how to keep it from disappearing at load time?
For what it's worth, calling reset_index() removed the index from the dataframes and they got loaded like regular columns. For whatever reason, PBI does not properly load pandas indices.
For a regular 1D index, I had to do S.reset_index().
For a MultiIndex, I had to do L.reset_index(inplace=True).

Read excel into pandas dataframe without modifying the values of excel?

I am reading an xlsx file using Python's Pandas pd.read_excel(myfile.xlsx,sheet_name="my_sheet",header=2) and writing the df to a csv file using df.to_csv.
The excel file contains several columns with percentage values in it (e.g. 27.44 %). In the dataframe the values are getting converted to 0.2744, i don't want any modification in data. How can i achieve this?
I already tried:
using lambda function to convert back value from 0.2744 to 27.44 % but this i don't want this because the column names/index are not fixed. It can be any col contain the % values
df = pd.read_excel(myexcel.xlsx,sheet_name="my_sheet",header=5,dtype={'column_name':str}) - Didn't work
df = pd.read_excel(myexcel.xlsx,sheet_name="my_sheet",header=5,dtype={'column_name':object}) - Didn't work
Tried xlrd module also, but that too converted % values to float.
df = pd.read_excel(myexcel.xlsx,sheet_name="my_sheet")
df.to_csv(mycsv.csv,sep=",",index=False)
from your xlsx save the file directly in csv format
To import your csv file use pandas library as follow:
import pandas as pd
df=pd.read_csv('my_sheet.csv') #in case your file located in the same directory
more information on pandas.read_csv

Error when using pandas read_excel(header=[0,1])

I'm trying to use pandas read_excel to work with a file. The file has two columns of headers so I'm trying to use the multiIndex feature apart of the header keyword argument.
import pandas as pd, os
"""data in 2015 MOR Folder"""
filename = 'MOR-JANUARY 2015.xlsx'
print(os.path.isfile(filename))
df1 = pd.read_excel(filename, header=[0,1], sheetname='MOR')
print(df1)
the error I get is ValueError: Length of new names must be 1, got 2. The file is in this google drive folder https://drive.google.com/drive/folders/0B0ynKIVAlSgidFFySWJoeFByMDQ?usp=sharing
I'm trying to follow the solution posted here
Read excel sheet with multiple header using Pandas
I could be mistaken but I don't think pandas handles parsing excel rows where there are merged cells. So in that first row, the merged cells get parsed as mostly empty cells. You'd need them nicely repeated to act correctly. This is what motivates the ffill below. If you could control the Excel workbook ahead of time and you might be able to use the code you have.
my solution
It's not pretty, but it'll get it done.
filename = 'MOR-JANUARY 2015.xlsx'
df1 = pd.read_excel(filename, sheetname='MOR', header=None)
vals = df1.values
mux = pd.MultiIndex.from_arrays(df1.ffill(1).values[:2, 1:], names=[None, 'DATE'])
df1 = pd.DataFrame(df1.values[2:, 1:], df1.values[2:, 0], mux)

Resources