Pass encoding option in Pyspark text method - apache-spark

I have fixed length file encoded in ISO-8859-1.Spark 2.4 is not honoring encoding passed as option.
Below is the sample code(chars got corrupted).
g_df = g_spark.read.option("encoding", "ISO-8859-1").text(loc)
g_df.repartition(1).write.csv(path=loc, header="true", mode="overwrite", encoding="ISO-8859-1")
Howeve, when I read it as csv file ,chars are stored as expected.
g_df = g_spark.read.option("encoding", "ISO-8859-1").csv(loc)
g_df.repartition(1).write.csv(path=loc, header="true", mode="overwrite", encoding="ISO-8859-1")
This looks like spark does not support encoding for text method.
As this is fixed length file, so I cannot use csv method.
Could you please suggest a way out

Text method does not honor encoding, so I have tried RDD APIS, which is working for me.
Below is the sample code
encoded_rdd = sc.textFile(loc, use_unicode=False).map(lambda x: x.decode("iso-8859-1"))
encoded_df = encoded_rdd.map(Row("val")).toDF()

Related

Python Tabula Library - Output File Is Empty

I am using the Tabula module in Python.
I am trying to output text from a PDF.
I am using this code:
pdf_read = tabula.read_pdf(
input_path = "Test File.pdf",
pages = start_page_number,
guess=False,
area=(81.735,18.55,391.285,273.61),
relative_area = False,
format="TSV",
output_path="testing_area.tsv"
)
When I go to run my code, it says "The output file is empty."
Any idea why this could be?
Edit: If I remove everything except the input_path and pages, my data is getting read into pdf_read correctly, it just does not output into an external file.
Something is wrong with this option...hmm...
Edit #2: I figured out why the area part was not working and now it is, but I still can't get this to output a file for some reason.
Edit #3: I tried looking at this: How to convert PDF to CSV with tabula-py?
But I keep getting an error message: "build_options() got an unexpected keyword argument 'spreadsheet'
Edit #4: I'm using the latest version of tabula.py, which doesn't have the spreadsheet option.
Still can't output a file with data though.
I don't know why that wasn't working above, so the output of pdf_read is a list.
I converted the list into a dataframe and then output the dataframe using to_csv.
Code is below:
import pandas as pd
df = pd.DataFrame(pdf_read,columns=["column_a"])
output_df = df.to_csv(
"alternative_attempt_1.txt",
header=True,
index=True,
sep='\t',
mode='w',
encoding="cp1252"
)

How to read Greek characters in pandas?

I am dealing with a dataframe that has Greek characters. They appear like that:
The data are here:
toy.to_json()
'{"a_a":{"0":49.0,"1":50.0,"2":52.0,"3":53.0,"4":54.0},"grade":{"0":3.0,"1":5.0,"2":4.0,"3":5.0,"4":4.0},"sex":{"0":"\\u00c1\\u00e3\\u00fc\\u00f1\\u00e9","1":"\\u00c1\\u00e3\\u00fc\\u00f1\\u00e9","2":"\\u00c1\\u00e3\\u00fc\\u00f1\\u00e9","3":"\\u00c1\\u00e3\\u00fc\\u00f1\\u00e9","4":"\\u00c1\\u00e3\\u00fc\\u00f1\\u00e9"},"age":{"0":122.0,"1":125.0,"2":119.0,"3":122.0,"4":127.0},"fath_job":{"0":2.0,"1":2.0,"2":2.0,"3":2.0,"4":2.0},"phscs":{"0":49.0,"1":73.0,"2":61.0,"3":75.0,"4":59.0},"pcc":{"0":10.0,"1":26.0,"2":19.0,"3":28.0,"4":23.0},"pcg":{"0":21.0,"1":28.0,"2":20.0,"3":25.0,"4":19.0},"tasc":{"0":17.0,"1":5.0,"2":17.0,"3":8.0,"4":11.0},"class":{"0":0.0,"1":0.0,"2":0.0,"3":0.0,"4":0.0},"grade3":{"0":"\\u00ec\\u00dd\\u00f4\\u00f1\\u00e9\\u00ef\\u00f2","1":"\\u00f5\\u00f8\\u00e7\\u00eb\\u00fc\\u00f2","2":"\\u00f5\\u00f8\\u00e7\\u00eb\\u00fc\\u00f2","3":"\\u00f5\\u00f8\\u00e7\\u00eb\\u00fc\\u00f2","4":"\\u00f5\\u00f8\\u00e7\\u00eb\\u00fc\\u00f2"},"pcc3":{"0":"\\u00f7\\u00e1\\u00ec\\u00e7\\u00eb\\u00de","1":"\\u00f5\\u00f8\\u00e7\\u00eb\\u00de","2":"\\u00ec\\u00dd\\u00f4\\u00f1\\u00e9\\u00e1","3":"\\u00f5\\u00f8\\u00e7\\u00eb\\u00de","4":"\\u00ec\\u00dd\\u00f4\\u00f1\\u00e9\\u00e1"},"tasc3":{"0":3.0,"1":1.0,"2":3.0,"3":2.0,"4":2.0},"pcg3":{"0":"\\u00ec\\u00dd\\u00f4\\u00f1\\u00e9\\u00e1","1":"\\u00f5\\u00f8\\u00e7\\u00eb\\u00de","2":"\\u00ec\\u00dd\\u00f4\\u00f1\\u00e9\\u00e1","3":"\\u00f5\\u00f8\\u00e7\\u00eb\\u00de","4":"\\u00ec\\u00dd\\u00f4\\u00f1\\u00e9\\u00e1"},"phscs3":{"0":"\\u00f7\\u00e1\\u00ec\\u00e7\\u00eb\\u00de","1":"\\u00f5\\u00f8\\u00e7\\u00eb\\u00de","2":"\\u00ec\\u00dd\\u00f4\\u00f1\\u00e9\\u00e1","3":"\\u00f5\\u00f8\\u00e7\\u00eb\\u00de","4":"\\u00ec\\u00dd\\u00f4\\u00f1\\u00e9\\u00e1"}}'
I tried to import the file with encoding = 'utf_8' but it did not work.
Here are some other approaches I tried:
toy.to_csv('toy.csv', index = False)
import chardet
rawdata = open('toy.csv', 'rb').read()
result = chardet.detect(rawdata)
charenc = result['encoding']
pd.read_csv('toy.csv', encoding = charenc)
pd.read_csv('toy.csv', encoding = 'cp737')
Try something like this:
pd.read_csv('toy.csv', encoding = 'iso8859_7')
It solved the problem for me, I hope it's the same for you.
On Mac (macOS Catalina 10.15.7) this works:
gdf = gpd.read_file("./../data/processed/landmarks.shp", encoding="mac_greek")
Apparently, you have to find the right encoding for your system. A list of python encodings can be found in the documentation. However, it is still not perfect. I still have some random characters. I do not know it that is due to the data source or because the list refers to python 2.
Greek encoded geo pandas data

Error in reading a ascii encoded csv file?

I have a csv file named Qid-NamedEntityMapping.csv having data like this:
Q1000070 b'Myron V. George'
Q1000296 b'Fred (footballer, born 1979)'
Q1000799 b'Herbert Greenfield'
Q1000841 b'Stephen A. Northway'
Q1001203 b'Buddy Greco'
Q100122 b'Kurt Kreuger'
Q1001240 b'Buddy Lester'
Q1001867 b'Fyodor Stravinsky'
The second column is 'ascii' encoded, and when I am reading the file using the following code, then also it not being read properly:
import chardet
import pandas as pd
def find_encoding(fname):
r_file = open(fname, 'rb').read()
result = chardet.detect(r_file)
charenc = result['encoding']
return charenc
my_encoding = find_encoding('datasets/KGfacts/Qid-
NamedEntityMapping.csv')
df = pd.read_csv('datasets/KGfacts/Qid-
NamedEntityMapping.csv',error_bad_lines=False, encoding=my_encoding)
But the output looks like this:
Also, I tried to use encoding='UTF-8'. but still, the output is the same.
What can be done to read it properly?
Looks like you have an improperly saved TSV file. Once you circumvent the TAB problem (as suggested in my comment), you can convert the column with names to a more suitable representation.
Let's assume that the second column of the dataframe is called "names". The b'XXX' thing is probably a bytes [mis]representation of a string. Convert it to a bytes object with ast.literal_eval and then decode to a string:
import ast
df["names"].apply(ast.literal_eval).apply(bytes.decode)
#0 Myron...
#1 Fred...
Last but not least, your problem has almost nothing to do with encodings or charsets.
Your issue looks like the CSV is actually tab separated; so you need to have sep='\t' in the read_csv function. It's reading everything else as a single column, except "born 1979" in the first row, as that is the only cell with a comma in it.

csv to pandas.DataFrame while keeping data original encoding

I have a csv file with some utf8 unicode characters in it, which I want to load into a pandas.DataFrame while keeping the unicode characters as is, not escaping them.
Input .csv:
letter,unicode_primary,unicode_alternatives
8,\u0668,"\u0668,\u06F8"
Code:
df = pd.DataFrame.from_csv("file.csv")
print(df.loc[0].unicode_primary)
Result:
> \\u0668
Desired Result:
> \u0668
or
> 8
Please use read_csv instead of from_csv as follows.
df = pd.DataFrame(pd.read_csv("file.csv", encoding = 'utf_8'))
print(df.loc[0].unicode_primary)

Custom filetype in Python 3

How to start creating my own filetype in Python ? I have a design in mind but how to pack my data into a file with a specific format ?
For example I would like my fileformat to be a mix of an archive ( like other format such as zip, apk, jar, etc etc, they are basically all archives ) with some room for packed files, plus a section of the file containing settings and serialized data that will not be accessed by an archive-manager application.
My requirement for this is about doing all this with the default modules for Cpython, without external modules.
I know that this can be long to explain and do, but I can't see how to start this in Python 3.x with Cpython.
Try this:
from zipfile import ZipFile
import json
data = json.dumps(['foo', {'bar': ('baz', None, 1.0, 2)}])
with ZipFile('foo.filetype', 'w') as myzip:
myzip.writestr('digest.json', data)
The file is now a zip archive with a json file (thats easy to read in again in many lannguages) for data you can add files to the archive with myzip write or writestr. You can read data back with:
with ZipFile('foo.filetype', 'r') as myzip:
json_data_read = myzip.read('digest.json')
newdata = json.loads(json_data_read)
Edit: you can append arbitrary data to the file with:
f = open('foo.filetype', 'a')
f.write(data)
f.close()
this works for winrar but python can no longer process the zipfile.
Use this:
import base64
import gzip
import ast
def save(data):
data = "[{}]".format(data).encode()
data = base64.b64encode(data)
return gzip.compress(data)
def load(data):
data = gzip.decompress(data)
data = base64.b64decode(data)
return ast.literal_eval(data.decode())[0]
How to use this with file:
open(filename, "wb").write(save(data)) # save data
data = load(open(filename, "rb").read()) # load data
This might look like this is able to be open with archive program
but it cannot because it is base64 encoded and they have to decode it to access it.
Also you can store any type of variable in it!
example:
open(filename, "wb").write(save({"foo": "bar"})) # dict
open(filename, "wb").write(save("foo bar")) # string
open(filename, "wb").write(save(b"foo bar")) # bytes
# there's more you can store!
This may not be appropriate for your question but I think this may help you.
I have a similar problem faced... but end up with some thing like creating a zip file and then renamed the zip file format to my custom file format... But it can be opened with the winRar.

Resources