weird characters in Pandas dataframe - how to standardize to UTF-8? - python-3.x

I'm using Python + Camelot (OCR library) to read a PDF, clean up, and write to Excel or csv. There are some non-standard dashes that print out a weird character.
Using Camelot means I'm not calling "read_csv". It's coming from the PDF. A value that is supposed to be "1-4" prints out as 1–4.
I fixed this using a regular expression but a colleague mentioned I should standardize to UTF-8. I tried to do that for the header like this:
header = df.iloc[0, 1:].str.encode('utf-8')
but then that value becomes b'1\xe2\x80\x934'.
Any advice? The goal is to simply use standard text.

Related

np.save is converting floats to weird characters

I am attempting to append results to an ongoing csv file. Each result comes out as an nd.array:
[IN]: Print(savearray)
[OUT]: [[ 0.55219001 0.39838119]]
Initially I tried
np.savetxt('flux_ratios.csv', savearray,delimiter=",")
But this overwrites the old data every time I save, so instead I am attempting to append the data like this:
f = open('flux_ratios.csv', 'ab')
np.save(f, 'a',savearray)
f.close()
This is (in a sense) appending, however it is saving the numerical data as weird characters, as can be seen in this screenshot:
I have no idea why or how this is happening so any help would be greatly appreciated!
First off, np.save does not write text whereas np.savetxt does. You are trying to combine binary with text, which is why you get the odd characters when you try to read the file.
You could just change np.save(f, 'a', savearray) to np.savetxt(f, savearray, delimiter=',').
Otherwise you could also consider using pandas.to_csv in append mode.

Pandas read_csv method can't get 'œ' character properly while using encoding ISO 8859-15

I have some trubble reading with pandas a csv file which include the special character 'œ'.
I've done some reseach and it appears that this character has been added to the ISO 8859-15 encoding standard.
I've tried to specify this encoding standard to the pandas read_csv methods but it doesn't properly get this special character (I got instead a '☐') in the result dataframe :
df= pd.read_csv(my_csv_path, ";", header=None, encoding="ISO-8859-15")
Does someone know how could I get the right 'œ' character (or eaven better the string 'oe') instead of this ?
Thank's a lot :)
As a matter of facts, I've just tried to write down the dataframe than I get with the read_csv and ISO-8859-15 encoding (using pd.to_csv method and "ISO-8859-15" encoding) and the special 'œ' character properly appears in the result csv file... :
df.to_csv(my_csv_full_path, sep=';', index=False, encoding="ISO-8859-15")
So it seems that pandas has properly read the special character in my csv file but can't show it within the dataframe...
Anyone have a clue ? I've manage the problem by manually rewrite this special character before reading my csv with pandas but that doesn't answer my question :(

Extract strings from shapefile attribute using GDAL and Python 3.X

I have a shapefile that consists of two fields/attributes, one being integers, the other being strings.
I can extract the integers into Python array by first using the function gdal.RasterizeLayer() to burn the shapefile into a .tiff image as the first band. Then, I use my_raster.GetRasterBand(1).ReadAsArray() to read the integers as an array.
However, I would like to extract the string values from the other field/attribute. I doing the exact same thing but I have already changed the attribute name in the gdal.RasterizeLayer() specification. However, calling GetRasterBand(1).ReadAsArray() only gives me zeros.
Does anyone know whether it is possible to read strings from rasters?
Btw: I'm using the exact same code as here.
Check it out from
Pure Python version -- gdal.RasterizeLayer

Pyspark dataframe.write.csv use pipe as separator cause strange characters in output file

I have a dataframe with two columns. Both are of string type.
When I tried to save the dataframe as csv with pipe as separator with following code:
df.write.csv("/outputpath/",sep="|")
the output file contains strange characters.
Please see attached screenshot.
If I instead use tab as separator sep="\t", everything looks good.
Just wonder if anyone has any idea what could go wrong here?
I'm using
Spark 2.2.0 with Python 3.4

How to parse CSV files with double-quoted strings in Julia?

I want to read CSV files where the columns are separated by commas. The columns can be strings and if those strings contain a comma in their content, they are wrapped in double-quotes. Currently I'm loading my data using:
file = open("data.csv","r")
data = readcsv(file)
But this code code would split the follwing string into 4 pieces whereas it only should be 3:
1,"text, more text",3,4
Is there a way in Julia's Standard Library to parse CSV while respecting quoting or do I have to write my own custom solution?
The readcsv function in base is super-basic (just blindly splitting on commas).
You will probably be happier with readtable from the DataFrames.jl package: http://juliastats.github.io/DataFrames.jl/io.html
To use the package, you just need to Pkg.add("DataFrames"), and then import it with `using DataFrames"
The readcsv function in base (0.3 prerelease) can now read quoted columns.
julia> readcsv(IOBuffer("1,\"text, more text\",3,4"))
1x4 Array{Any,2}:
1.0 "text, more text" 3.0 4.0
It is much simpler than DataFrames. But may be quicker if you just need the data as an array.

Resources