.csv is empty after reading it with pd.read_csv() - python-3.x

After running
df = pd.read_csv('my_file.csv'),
my original .csv file goes blank. Is there a way to read the .csv data without emptying the original file?

pd.read_csv() does not modify the file!
Here, the file before using pd.read_csv():
Using it:
And now if we check it again, the file hasn't changed (as expected):
So the problem isn't with pd.read_csv(). I would assume that you have other code that's messing things up. Take a look and tell us, so we can help you better.

Related

CSV UTF-8 vs Normal CSV in Excel

we have a CSV file that was creating validation errors in a process we run. The validation error made no sense as the error indicated didn't appear as the file was created exactly as instructed. I've tried several ways to resolve it without success. I eventually tried re-saving the file as CSV via Excel and noticed the file was in CSV UTF-8 format and this apparently resolved the error. I noticed the new file size is 3 bytes less than the old one but the content should be exactly the same. The file is completely in English so I am not sure what was causing this. Can anyone advise why the file size is 3 bytes less when saving as CSV rather than CSV UTF-8 format even though data in the file should be identical? These extra 3 bytes likely have caused the validation error.
Thanks for your help

Python script that reads csv files

script that reads CSV files and gets headers and filter by specific column, I have tried researching on it but nothing of quality I have managed to get.
Please any help will be deeply appreciated
There's a standard csv library included with Python.
https://docs.python.org/3/library/csv.html
It will automatically create a dictionary of arrays where the first row in the CSV determines the keys in the dict.
You can also follow pandas.read_csv for the same.

Difficulty with encoding while reading data in Spark

In connection with my earlier question, when I give the command,
filePath = sc.textFile("/user/cloudera/input/Hin*/datafile.txt")
filePath.collect()
some part of the data has '\xa0' prefixed to every word, and other part of the data doesn't have that special character. I am attaching 2 pictures, one with '\xa0', and another without '\xa0'. The content shown in 2 pictures belong to same file. Only some part of the data from same file is read that way by Spark. I have checked the original data file present in HDFS, and it was problem free.
I feel that it has something to do with encoding. I tried all methods like using replaceoption in flatMap like flatMap(lambda line: line.replace(u'\xa0', ' ').split(" ")), flatMap(lambda line: line.replace(u'\xa0', u' ').split(" ")), but none worked for me. This question might sound dump, but I am newbie in using Apache Spark, and I require some assistance to overcome this problem.
Can anyone please help me? Thanks in advance.
Check the encoding of your file. When you use sc.textFile, spark expects an UTF-8 encoded file.
One of the solution is to acquire your file with sc.binaryFiles and then apply the expected encoding.
sc.binaryFile create a key/value rdd where key is the path to file and value is the content as a byte.
If you need to keep only the text and apply an decoding function, :
filePath = sc.binaryFile("/user/cloudera/input/Hin*/datafile.txt")
filePath.map(lambda x :x[1].decode('utf-8')) #or another encoding depending on your file

openpyxl close archive after breaking read operation because max rows are 1048498 rows

I have two problems using openpyxl
The number of rows in the spreadsheet are 1048498. The iteration hogs memory so I put a logic to check for first five empty columns and break from it
Logic 1 works for me and code does not indefinitely iterate over the spreadsheet blank cells. I am using P4Python to delete this read only file after I am done reading it. However, openpyxl is still using that file and there is no method except save to close the archive used internally. Since my file is in read only mode, I cannot save the file. When P4 is trying to delete this file, I get this error - "The process cannot access the file because it is being used by another process."
Help is appreciated :)
If you open the file in read-only mode then it will not hog memory. Cells are created only when read. Memory use has been tested with huge files but if you think this is a bug then please submit a bug report with a sample file.
This looks like an existing issue or intended beahvior with openpyxl. If you have a read only file (P4Python sync operation - p4.run_sync(file_path_to_sync)) and if you are reading it using openpyxl, you will not be able to delete the file (P4Python p4.run_sync(file_path_to_sync + '#0') - Remove from workspace) until you save the file which is not possible (or intended in my case) since it is a read only file.

itk ImageFileReader exception reading if I add VTK Imagewriter object creation

That's it:
I read successfully a DICOM file with itk::ImageFileReader.
Now I want to export an image.
I use vtkJPEGWriter.
When I add the line
vtkJPEGWriter* writer = vtkJPEGWriter::New();
even if that code doesn't run at the beginning of execution... I can't read the file. I comment the line, then I read the file again.
But the writer is not connected with the file reader. I don't get it. It has nothing to do at that moment!!
I'm wasting so much time, just trying to figure out what's the problem.
The problem is in the file. I don't know why it works with that file without that line. Really weird.
I just don't get it.
I will try with other files.
These lines are worked for me:
vtkSmartPointer<vtkJPEGWriter> JPEGWriter = vtkSmartPointer<vtkJPEGWriter>::New();
JPEGWriter->SetFileName("d:\\Tempx\\Pacienttest3\\Sagital.bmp");
JPEGWriter->SetInputConnection(m_pColor->GetOutputPort());
JPEGWriter->Write();
where m_pColor is kind of vtkImageMapToColors type ...

Resources