Read data with multi delimiter in pyspark - apache-spark

I have a input file which looks like this and has "|" as multi-delimiter :
162300111000000000106779"|"2005-11-16 14:12:32.860000000"|"1660320"|"0"|"2005-11-16 14:12:32.877000000"|""|""|""|""|""|""|""|"False"|"120600111000000000106776```
I can read this type of record with UDF as below :
inputDf = glueContext.sparkSession.read.option("delimiter", input_file_delimiter,)
.csv("s3://" + landing_bucket_name + "/" + input_file_name)
udf = UserDefinedFunction(lambda x: re.sub('"', '', str(x)))
new_df = inputDf.select(*[udf(column).alias(column) for column in inputDf.columns])
but when i get the input file as
000/00"|"AE71501"|"Complaint for Attachment of Earnings Order"|"In accordance with section test of the Attachment of Test Act Test."|"Non-Test"|"Other non-test offences"|"N"|"Other Non-Test"|"Non-Test
I am getting below exception while reading it, using the same UDF, my code fails at exact same location where i have mu UDF :
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfa' in position 66: ordinal not in range(128)
Any help on below will be great :
- Optimized code to read both type of files , considering "|" as separator.
- How my existing UDF can handle the second type of input records.

This is likely caused by running in Python 2.x which has two separate types for string-like objects (unicode strings and non-unicode strings, which are nowadays simply byte sequences).
Spark will read in your data (which are bytes, as there is no such thing as plain text), and decode the lines as a sequence of Unicode strings. When you call str on a Unicode string that has a codepoint that is not in the ASCII range of codepoints, Python 2 will produce an error:
# python2.7>>> unicode_str = u"ú"
>>> type(unicode_str)
<type 'unicode'>
>>> str(unicode_str)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfa' in position 0: ordinal not in range(128)
The recommended path is that you work with Unicode strings (which is the default string object in Python 3) all throughout your program, except at the point where you either read/receive data (where you should provide a suitable encoding scheme, so that you can decode the raw bytes) and at the point where you write/send data (again, where you use an encoding to encode the data as a series of bytes). This is called “the Unicode sandwich”.
Many libraries, including Spark, already decode bytes and encode unicode strings for you. If you simply remove the call to str in your user defined function, your code will likely work:
#pyspark shell using Python 2.7
>>> spark.sparkContext.setLogLevel("OFF") # to hide the big Py4J traceback that is dumped to the console, without modifying the log4j.properties file
>>> from py4j.protocol import Py4JJavaError
>>> from pyspark.sql.types import *
>>> from pyspark.sql.functions import udf
>>> df = spark.read.csv("your_file.csv", sep="|")
>>> def strip_double_quotes_after_str_conversion(s):
... import re
... return re.sub('"', '', str(s))
...
>>> def strip_double_quotes_without_str_conversion(s):
... import re
... return re.sub('"', '', s)
...
>>> df.select(*[udf(strip_double_quotes_without_str_conversion, StringType())(column).alias(column) for column in df.columns]).show()
+------+-------+--------------------+--------------------+--------------------+----------------+---+--------------------+----+
| _c0| _c1| _c2| _c3| _c4| _c5|_c6| _c7| _c8|
+------+-------+--------------------+--------------------+--------------------+----------------+---+--------------------+----+
|037/02|TH68150|Aggravated vehicl...|Contrary to secti...|Theft of motor ve...|Vehicle offences| Y|Aggravated Vehicl...|37.2|
+------+-------+--------------------+--------------------+--------------------+----------------+---+--------------------+----+
>>> try:
... df.select(*[udf(strip_double_quotes_after_str_conversion, StringType())(column).alias(column) for column in df.columns]).show()
... except Py4JJavaError as e:
... print("That failed. Root cause: %s" % e.java_exception.getCause().getMessage().rsplit("\n", 2)[-2])
...
That failed. Root cause: UnicodeEncodeError: 'ascii' codec can't encode character u'\xfa' in position 78: ordinal not in range(128)
So, the solution to the experienced problem is simple: don’t use str in your UDF.
Note that Python 2.x will no longer be maintained as of January 1st 2020. You’d do well transitioning to Python 3.x before that. In fact, had you executed this in a Python 3 interpreter, you would not have experienced the issue at all.

Related

How to get python to tolerate UTF-8 encoding errors

I have a set of UTF-8 texts I have scraped from web pages. I am trying to extract keywords from these files like so:
import os
import json
from rake_nltk import Rake
rake_nltk_var = Rake()
directory = 'files'
results = {}
for filename in os.scandir(directory):
if filename.is_file():
with open("files/" + filename.name, encoding="utf-8", mode = 'r') as infile:
text = infile.read()
rake_nltk_var.extract_keywords_from_text(text)
keyword_extracted = rake_nltk_var.get_ranked_phrases()
results[filename.name] = keyword_extracted
with open("extracted-keywords.json", "w") as outfile:
json.dump(results, outfile)
One of the files I've managed to process so far is throwing the following error on read:
Traceback (most recent call last):
File "extract-keywords.py", line 11, in <module>
text = infile.read()
File "c:\python36\lib\codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 66: invalid start byte
0x92 is a right single quotation mark, but the 66th char of the file is a "u" so IDK where this error is coming from. Regardless, is there some way to make the codec tolerate such encoding errors? For example, Perl simply substitutes a question mark for any character it can't decode. Is there some way to get Python to do the same? I have a lot of files and can't afford to stop and debug every encoding error they might contain.
I have a set of UTF-8 texts I have scraped from web pages
If they can't be read with the script you've shown, then these are not actually UTF-8 encoded files.
We have to know about the code which wrote the files in the first place to tell the correct way to decode. However, the ’ character is 0x92 byte in code page 1252, so try using that encoding instead, i.e.:
with open("files/" + filename.name, encoding="cp1252") as infile:
text = infile.read()
Ignoring decoding errors corrupts the data, so it's best to use the correct decoder when possible, so try and do that first! However, about this part of the question:
Regardless, is there some way to make the codec tolerate such encoding errors? For example, Perl simply substitutes a question mark for any character it can't decode. Is there some way to get Python to do the same?
Yes, you can specify errors="replace"
>>> with open("/tmp/f.txt", "w", encoding="cp1252") as f:
... f.write('this is a right quote: \N{RIGHT SINGLE QUOTATION MARK}')
...
>>> with open("/tmp/f.txt", encoding="cp1252") as f:
... print(f.read()) # using correct encoding
...
this is a right quote: ’
>>> with open("/tmp/f.txt", encoding="utf-8", errors="replace") as f:
... print(f.read()) # using incorrect encoding and replacing errors
this is a right quote: �

Why some emojis are not converted back into their representation?

I am working on emoji detection module. For some emojis I am observing weird behavior that is after converting them to utf-8 encoding they are not converted back to their original representation form. I need their exact colored representation to be send as API response instead of sending unicode escaped string. Any leads?
In [1]: x = "example1: 🤭 and example2: 😁 and example3: 🥺"
In [2]: x.encode('utf8')
Out[2]: b'example1: \xf0\x9f\xa4\xad and example2: \xf0\x9f\x98\x81 and example3: \xf0\x9f\xa5\xba'
In [3]: x.encode('utf8').decode('utf8')
Out[3]: 'example1: \U0001f92d and example2: 😁 and example3: \U0001f97a'
In [4]: print( x.encode('utf8').decode('utf8') )
*example1: 🤭 and example2: 😁 and example3: 🥺*
Link Emoji used in example
Update 1:
By this example it must be much clearer to explain. Here, two emojis are rendered when I have send unicode escape string, but 3rd exampled failed to convert exact emoji, what to do in such case?
'\U0001f92d' == '🤭' is True. It is an escape code but is still the same character...Two ways of display/entry. The former is the repr() of the string, printing calls str(). Example:
>>> s = '🤭'
>>> print(repr(s))
'\U0001f92d'
>>> print(str())
🤭
>>> s
'\U0001f92d'
>>> print(s)
🤭
When Python generates the repr() it uses an escape code representation if it thinks the display can't handle the character. The content of the string is still the same...the Unicode code point.
It's a debug feature. For example, is the white space spaces or tabs? The repr() of the string makes it clear by using \t as an escape code.
>>> s = 'a\tb'
>>> print(s)
a b
>>> s
'a\tb'
As to why an escape code is used for one emoji and not another, it depends on the version of Unicode supported by the version of Python used.
Pyton 3.8 uses Unicode 9.0, and one of your emoji isn't defined at that version level:
>>> import unicodedata as ud
>>> ud.unidata_version
'9.0.0'
>>> ud.name('😁')
'GRINNING FACE WITH SMILING EYES'
>>> ud.name('🤭')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: no such name

read tweets extracted with python

I am trying to read tweets in excel. Tweets have been retrieved with python (and tweepy) then saved in a csv file:
# -*- coding: utf-8 -*-
writer= csv.writer(open(r"C:\path\twitter_"+date+".csv", "w"), lineterminator='\n', delimiter =';')
writer.writerow(["username", "nb_followers", "tweet_text"])
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token_key, access_token_secret)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
for tweet in tweepy.Cursor(api.search, q="dengue+OR+%23dengue", lang="en", since=date, until=end_date).items():
username=tweet.user.screen_name
nb_followers=tweet.user.followers_count
tweet_text=tweet.text.encode('utf-8')
writer.writerow([username, nb_followers, tweet_text])
Due to the utf-8 encoding, I have problems reading them in a text editor or excel.
For example this tweet:
gives this in excel:
b"\xe2\x80\x9c#ThislsWow: I want to do this \xf0\x9f\x98\x8d http://t.co/rGfv9e70Tj\xe2\x80\x9d pu\xc3\xb1eta you're going to get bitten by the mosquito and get dengue"
How to get the original characters? How to remove the b at the beginning, useful only in a python program?
EDIT :
As per Alastair McCormack's comment:
I removed the encoding of my field and added it in the writer:
writer= csv.writer(open(r"C:\path\twitter_"+date+".csv", "w", encoding="UTF-8"), lineterminator='\n', delimiter =';')
tweet_text=tweet.text.replace("\n", "").replace("\r", "")
Now I have the following error:
tweet: Traceback (most recent call last):
File "twitter_influence.py", line 88, in <module>
print("tweet:", tweet_text)
File "C:\Users\rlalande\Envs\tweepy\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2026' in position 137: character maps to <undefined>
EDIT2 :
I am now using the following:
import codecs
sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach())
(seen in this post: https://stackoverflow.com/a/4374457/1875861)
There is no more error but it doesn't output the correct characters.
For example this tweet:
gives this output in excel:
Malay Mail Online Alarming rise in dengue casesMalay Mail Online“The ministry started a campaign for construction… http://t.co/MuLFlMwkY0
Before, with direct encoding of the field, I had:
b'Malay Mail Online\n\nAlarming rise in dengue casesMalay Mail Online\xe2\x80\x9cThe ministry started a campaign for construction\xe2\x80\xa6 http://t.co/MuLFlMwkY0'
The result is different but not really better... Why is the quote character not outputted correctly? In one case it outputs … and in the other case \xe2\x80\xa6.
It's because the CSV writer expects all input to be Unicode strings. You're getting the __repr__() of a byte string.
Set the encoding of your output file by replacing the first line with:
writer= csv.writer(open(r"C:\path\twitter_"+date+".csv", "w", encoding="UTF-8"), lineterminator='\n', delimiter =';')
This means that any Unicode strings written to the file will be translated automagically. Then remove the explicit encode():
tweet_text=tweet.text
Edit:
Excel needs to be coerced into reading UTF-8 files if you don't use the import function. The easiest way to do this is to add UTF-8 BOM signature to the start of the file.
Python provides a shortcut if you use the utf_8_sig encoding. E.g.
writer= csv.writer(open(r"C:\path\twitter_"+date+".csv", "w", encoding="utf_8_sig"), lineterminator='\n', delimiter =';')
You can also check your file in a decent UTF-8 editor like Notepad++ or Atom.

How do I convert a Python 3 byte-string variable into a regular string? [duplicate]

This question already has answers here:
Convert bytes to a string
(22 answers)
Closed 2 years ago.
I have read in an XML email attachment with
bytes_string=part.get_payload(decode=False)
The payload comes in as a byte string, as my variable name suggests.
I am trying to use the recommended Python 3 approach to turn this string into a usable string that I can manipulate.
The example shows:
str(b'abc','utf-8')
How can I apply the b (bytes) keyword argument to my variable bytes_string and use the recommended approach?
The way I tried doesn't work:
str(bbytes_string, 'utf-8')
You had it nearly right in the last line. You want
str(bytes_string, 'utf-8')
because the type of bytes_string is bytes, the same as the type of b'abc'.
Call decode() on a bytes instance to get the text which it encodes.
str = bytes.decode()
How to filter (skip) non-UTF8 charachers from array?
To address this comment in #uname01's post and the OP, ignore the errors:
Code
>>> b'\x80abc'.decode("utf-8", errors="ignore")
'abc'
Details
From the docs, here are more examples using the same errors parameter:
>>> b'\x80abc'.decode("utf-8", "replace")
'\ufffdabc'
>>> b'\x80abc'.decode("utf-8", "backslashreplace")
'\\x80abc'
>>> b'\x80abc'.decode("utf-8", "strict")
Traceback (most recent call last):
...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0:
invalid start byte
The errors argument specifies the response when the input string can’t be converted according to the encoding’s rules. Legal values for this argument are 'strict' (raise a UnicodeDecodeError exception), 'replace' (use U+FFFD, REPLACEMENT CHARACTER), or 'ignore' (just leave the character out of the Unicode result).
UPDATED:
TO NOT HAVE ANY b and quotes at first and end
How to convert bytes as seen to strings, even in weird situations.
As your code may have unrecognizable characters to 'utf-8' encoding,
it's better to use just str without any additional parameters:
some_bad_bytes = b'\x02-\xdfI#)'
text = str( some_bad_bytes )[2:-1]
print(text)
Output: \x02-\xdfI
if you add 'utf-8' parameter, to these specific bytes, you should receive error.
As PYTHON 3 standard says, text would be in utf-8 now with no concern.

Python 3.2 TypeError - can't figure out what it means

I originally put this code through Python 2.7 but needed to move to Python 3.x because of work. I've been trying to figure out how to get this code to work in Python 3.2, with no luck.
import subprocess
cmd = subprocess.Popen('net use', shell=True, stdout=subprocess.PIPE)
for line in cmd.stdout:
if 'no' in line:
print (line)
I get this error
if 'no' in (line):
TypeError: Type str doesn't support the buffer API
Can anyone provide me with an answer as to why this is and/or some documentation to read?
Much appreciated.
Python 3 uses the bytes type in a lot places where the encoding is not clearly defined. The stdout of your subprocess is a file object working with bytes data. So, you cannot check if there is some string within a bytes object, e.g.:
>>> 'no' in b'some bytes string'
Traceback (most recent call last):
File "<pyshell#13>", line 1, in <module>
'no' in b'some bytes string'
TypeError: Type str doesn't support the buffer API
What you need to do instead is a test if the bytes string contains another bytes string:
>>> b'no' in b'some bytes string'
False
So, back to your problem, this should work:
if b'no' in line:
print(line)

Resources