How to open a large csv file in pandas

How to open a large csv file in pandas - python-3.x

I collected data from social media platforms and stored it in a csv file, I have around 1.8 million comments in this file and I am trying to open it using pandas. I used engine as python while opening, the file size is 122 MB and it has code mixed comments with emojis, but I am getting an error.
ParserError: NULL byte detected. This byte cannot be processed in Python's native csv library at the moment, so please pass in engine='c' instead
So i tried with engine='c' but now got a different error
ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

Related

Not able to read .xlsb file or .xlsx (large files - 150 MB) from shared drive using python

I am facing this problem where when I try to read the file directly from shared drive it's throwing invalid path error. Trying to explain the situation below:
The data files in the form of .xlsx and .xlsb is copied to the sharepoint, which works as the source.
I used 'open in explorer' function from sharepoint and got the drive address.
Mapped the path after opening in explorer with my network drive, and added as p drive.
Now i am using this path to read the file directly using pandas read_excel.
it is throwing invalid path OS22 error
Issues :
When i am reading .xlsx file which is smaller in size 15MB, it is working well.
Trying to read another excel file 150 MB in size, getting invalid path error.
Same is happening when reading .xlsb binary files.
Already tried forward and back slashes, same error.
used open to read the file, got same invalid path error.
Though if i download the same file to local, it is working without any issue. Easily able to read the files, with same codes.
Any suggestion?

Unable to solve multiprocessing.Manager.Lock() error in Python code (VS editor)

I am using machine learning in my Python (version 3.8.5) code. In the preprocessing part, I need to hash encode few features. So earlier I have dumped a hash encoder pickle file using the features in the training phase. Saved the file with the name of 'hash_encoder.pkl'. Now in the testing phase, I need to transform the features using this pickle file. I'm using the following code given in screenshot to hash encode three string features as given in the first line.
In the encoder.transform line, I'm getting the error of "data_lock=mutiprocessing.Manager().Lock()".
At the end I'm also getting 'raise EOF error'.
I have tried using same version of pandas (1.1.3) to dump the hash_encoder file and also to load it. I'm not sure why is this coming up.
Can someone help me in understand or debugging this part?
I have added the screenshot of the error.

Read .csv that contains commas

I have a .csv file that contains multiple columns with texts in it. These texts contain commas, which makes things messy when I try to read the file into Python.
When I tried:
import pandas as pd
directory = 'some directory'
dataset = pd.read_csv(directory)
I got the following error:
ParserError: Error tokenizing data. C error: Expected 3 fields in line 42, saw 5
After doing some research, I found the clevercsv package.
So, I ran:
import clevercsv as csv
dataset = csv.read_csv(directory)
Running this, I got the error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 4359705: character maps to <undefined>
To overcome this, I tried:
dataset = csv.read_csv(directory, encoding="utf8")
However, 10 hours later my computer was still working on reading it. So I expect that something went wrong there.
Furthermore, when I open the file in Excel, it does split cells well. Therefore, What I tried was to save the .csv file as a .xlsx and then save it as .csv in Python with an uncommon delimiter ('~'). However, when I save my .csv file as a .xlsx file, the size of the file gets smaller, which indicates that only a part of the file is saved and that is not what I want.
Lastly, I have tried the solutions given here and here. But neither seem to work for my problem.
Given that Excel reads in the file without problems, I do expect that it should be possible to read it into Python as well. Who can help me with this?
UPDATE:
When using dataset = pd.read_csv(directory, sep = ',', error_bad_lines=False)I manage to read in the .csv. But many lines are skipped. Is there a better way for this?

panda should be work https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
Dou you tried somthing like dataset = pd.read_csv(directory, sep = ',', header = None)
Regards

CSV UTF-8 vs Normal CSV in Excel

we have a CSV file that was creating validation errors in a process we run. The validation error made no sense as the error indicated didn't appear as the file was created exactly as instructed. I've tried several ways to resolve it without success. I eventually tried re-saving the file as CSV via Excel and noticed the file was in CSV UTF-8 format and this apparently resolved the error. I noticed the new file size is 3 bytes less than the old one but the content should be exactly the same. The file is completely in English so I am not sure what was causing this. Can anyone advise why the file size is 3 bytes less when saving as CSV rather than CSV UTF-8 format even though data in the file should be identical? These extra 3 bytes likely have caused the validation error.
Thanks for your help

Trouble importing VTK files into Paraview (Error reading ascii data)

I am very new to using Paraview, and I'm trying to import a few VTK files and view them. However, I'm receiving the following errors:
Generic Warning: In /Users/kitware/dashboards/buildbot-slave/8275bd07/build/superbuild/paraview/src/VTK/IO/Legacy/vtkDataReader.cxx, line 1436
Error reading ascii data. Possible mismatch of datasize with declaration.
ERROR: In /Users/kitware/dashboards/buildbot-slave/8275bd07/build/superbuild/paraview/src/VTK/IO/Legacy/vtkUnstructuredGridReader.cxx, line 346
vtkUnstructuredGridReader (0x7fb15582bd10): Unrecognized keyword: ,
I can't seem to figure out what's wrong, I've tried converting them to other formats to no avail.

I don't think there's a problem with the files. I can open them with Paraview 5.6. Maybe they were generated with a version of VTK that is more recent than the one used for your version of Paraview. You should install the latest version of Paraview (or at least 5.6).
The big file results in some visible geometry, the smaller one does not. But I have no error message, everything seems ok.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to open a large csv file in pandas - python-3.x

Related

Not able to read .xlsb file or .xlsx (large files - 150 MB) from shared drive using python

Unable to solve multiprocessing.Manager.Lock() error in Python code (VS editor)

Read .csv that contains commas

CSV UTF-8 vs Normal CSV in Excel

Trouble importing VTK files into Paraview (Error reading ascii data)

Categories

Resources