CSV UTF-8 vs Normal CSV in Excel - excel

we have a CSV file that was creating validation errors in a process we run. The validation error made no sense as the error indicated didn't appear as the file was created exactly as instructed. I've tried several ways to resolve it without success. I eventually tried re-saving the file as CSV via Excel and noticed the file was in CSV UTF-8 format and this apparently resolved the error. I noticed the new file size is 3 bytes less than the old one but the content should be exactly the same. The file is completely in English so I am not sure what was causing this. Can anyone advise why the file size is 3 bytes less when saving as CSV rather than CSV UTF-8 format even though data in the file should be identical? These extra 3 bytes likely have caused the validation error.
Thanks for your help

Related

Not able to read .xlsb file or .xlsx (large files - 150 MB) from shared drive using python

I am facing this problem where when I try to read the file directly from shared drive it's throwing invalid path error. Trying to explain the situation below:
The data files in the form of .xlsx and .xlsb is copied to the sharepoint, which works as the source.
I used 'open in explorer' function from sharepoint and got the drive address.
Mapped the path after opening in explorer with my network drive, and added as p drive.
Now i am using this path to read the file directly using pandas read_excel.
it is throwing invalid path OS22 error
Issues :
When i am reading .xlsx file which is smaller in size 15MB, it is working well.
Trying to read another excel file 150 MB in size, getting invalid path error.
Same is happening when reading .xlsb binary files.
Already tried forward and back slashes, same error.
used open to read the file, got same invalid path error.
Though if i download the same file to local, it is working without any issue. Easily able to read the files, with same codes.
Any suggestion?

Read .csv that contains commas

I have a .csv file that contains multiple columns with texts in it. These texts contain commas, which makes things messy when I try to read the file into Python.
When I tried:
import pandas as pd
directory = 'some directory'
dataset = pd.read_csv(directory)
I got the following error:
ParserError: Error tokenizing data. C error: Expected 3 fields in line 42, saw 5
After doing some research, I found the clevercsv package.
So, I ran:
import clevercsv as csv
dataset = csv.read_csv(directory)
Running this, I got the error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 4359705: character maps to <undefined>
To overcome this, I tried:
dataset = csv.read_csv(directory, encoding="utf8")
However, 10 hours later my computer was still working on reading it. So I expect that something went wrong there.
Furthermore, when I open the file in Excel, it does split cells well. Therefore, What I tried was to save the .csv file as a .xlsx and then save it as .csv in Python with an uncommon delimiter ('~'). However, when I save my .csv file as a .xlsx file, the size of the file gets smaller, which indicates that only a part of the file is saved and that is not what I want.
Lastly, I have tried the solutions given here and here. But neither seem to work for my problem.
Given that Excel reads in the file without problems, I do expect that it should be possible to read it into Python as well. Who can help me with this?
UPDATE:
When using dataset = pd.read_csv(directory, sep = ',', error_bad_lines=False)I manage to read in the .csv. But many lines are skipped. Is there a better way for this?
panda should be work https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
Dou you tried somthing like dataset = pd.read_csv(directory, sep = ',', header = None)
Regards

Talend 7.1 tFileOutputExcel corrupt file

I'm trying to output an excel file from Talend 7.1. I've tried a few different setups and both xls and xlsx formats but they all result in the output file being corrupt and not being able to open it.
What am I doing wrong? I am loading an xlsx file into a database and this part works fine but outputting to excel I just can't figure it out! I was writing from the tMap directly to the tFileOutputExcel and it wasn't working (corrupt) so I changed it to write to a csv file first and then write that csv to the tFileOutputExcel but it is still corrupt.
This is my job detail:
And this is the settings in the tFileOutputExcel
I got this working by changing the transfer mode in the FTP component from 'ascii' to 'binary'. Such a simple thing but if this helps anyone else with this issue who is a newb like me :)

.csv is empty after reading it with pd.read_csv()

After running
df = pd.read_csv('my_file.csv'),
my original .csv file goes blank. Is there a way to read the .csv data without emptying the original file?
pd.read_csv() does not modify the file!
Here, the file before using pd.read_csv():
Using it:
And now if we check it again, the file hasn't changed (as expected):
So the problem isn't with pd.read_csv(). I would assume that you have other code that's messing things up. Take a look and tell us, so we can help you better.

openpyxl close archive after breaking read operation because max rows are 1048498 rows

I have two problems using openpyxl
The number of rows in the spreadsheet are 1048498. The iteration hogs memory so I put a logic to check for first five empty columns and break from it
Logic 1 works for me and code does not indefinitely iterate over the spreadsheet blank cells. I am using P4Python to delete this read only file after I am done reading it. However, openpyxl is still using that file and there is no method except save to close the archive used internally. Since my file is in read only mode, I cannot save the file. When P4 is trying to delete this file, I get this error - "The process cannot access the file because it is being used by another process."
Help is appreciated :)
If you open the file in read-only mode then it will not hog memory. Cells are created only when read. Memory use has been tested with huge files but if you think this is a bug then please submit a bug report with a sample file.
This looks like an existing issue or intended beahvior with openpyxl. If you have a read only file (P4Python sync operation - p4.run_sync(file_path_to_sync)) and if you are reading it using openpyxl, you will not be able to delete the file (P4Python p4.run_sync(file_path_to_sync + '#0') - Remove from workspace) until you save the file which is not possible (or intended in my case) since it is a read only file.

Resources