Spark does not recognize new lines, &amp, etc. from String - apache-spark

I'm trying to process text data (Twitter tweets) with PySpark. Emojis and special characters are being red correctly but "\n", "&amp" appear to be escaped. Spark does not recognize them. Probably others too. One example tweet in my Spark DF would look like this:
"Hello everyone\n\nHow is it going? 😉 Take care & enjoy"
I would like Spark to read them correctly. The files are stored as parquet and I'm reading them like this:
tweets = spark.read.format('parquet')\
.option('header', 'True')\
.option('encoding', 'utf-8')\
.load(path)
Below are some sample input data, which I took from the original JSONL files (I stored the data as parquet later).
"full_text": "RT #OurWarOnCancer: Where is our FEDERAL vaccination
education campaign for HPV?! Where is our FEDERAL #lungcancer
screening program?! (and\u2026"
"full_text": "\u2b55\ufe0f#HPV is the most important cause of
#CervicalCancer But it doesn't just cause cervical cancer (see the figure\ud83d\udc47) \n\u2b55\ufe0fThat means they can be PREVENTED"
Reading directly from JSONL files results in the same recognizing problems.
tweets = spark.read.\
.option('encoding', 'utf-8')\
.json(path)
How can Spark recognize them correctly? Thank you in advance.

the below code might be helpful to solve your problem,
Input taken:
"Hello everyone\n\nHow is it going? 😉 Take care & enjoy"
"full_text": "RT #OurWarOnCancer: Where is our FEDERAL vaccination education campaign for HPV?! Where is our FEDERAL #lungcancer screening program?! (and\u2026 &"
"full_text": "\u2b55\ufe0f#HPV is the most important cause of #CervicalCancer But it doesn't just cause cervical cancer (see the figure\ud83d\udc47) \n\u2b55\ufe0fThat means they can be PREVENTED #theNCI #NCIprevention #AmericanCancer #cancereu #uicc #IARCWHO #EuropeanCancer #KanserSavascisi #AUTF_DEKANLIK #OncoAlert"
code to solve the problem:
from pyspark.sql.functions import *
df=spark.read.csv("file:///home/sathya/Desktop/stackoverflo/raw-data/input.tweet")
df1=df.withColumn("cleandata",regexp_replace('_c0', '&|\\\\n', ''))
df1.select("cleandata").show(truncate=False)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|cleandata |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Hello everyoneHow is it going? 😉 Take care & enjoy |
|"full_text": "RT #OurWarOnCancer: Where is our FEDERAL vaccination education campaign for HPV?! Where is our FEDERAL #lungcancer screening program?! (and\u2026 &" |
|"full_text": "\u2b55\ufe0f#HPV is the most important cause of #CervicalCancer But it doesn't just cause cervical cancer (see the figure\ud83d\udc47) \u2b55\ufe0fThat means they can be PREVENTED #theNCI #NCIprevention #AmericanCancer #cancereu #uicc #IARCWHO #EuropeanCancer #KanserSavascisi #AUTF_DEKANLIK #OncoAlert"|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Related

Training / using OpenAI GPT-3 for translations

I'm trying to use OpenAI for translation of my products descriptions from one language to some other languages (EN, DE, CZ, SK, HU, PL, SI...). The translations, especially to SK/CZ/HU/PL languages are (mainly gramatically) quite bad (using text-davinci-003 model). I've got an idea - I already have a few thousands of similar products fully translated into all of these languages by professional translators. Is it possible to use those existing correct translations to train GPT-3 and then use this model to translate new texts? Has anybody already tried something similar?
Have you tried using the Edits endpoint? You can fix grammar and spelling with it.
If you run test.py the OpenAI API will return the following completion:
What day of the week is it?
test.py
import openai
openai.api_key = 'sk-xxxxxxxxxxxxxxxxxxxx'
openai.Edit.create(
model = 'text-davinci-edit-001',
input = 'What day of the wek is it?',
instruction = 'Fix the spelling mistakes'
)
content = response['choices'][0]['text']
print(content)

Cleaning up a column based on spelling? Pandas

I've got two very important, user entered, information columns in my data frame. They are mostly cleaned up except for one issue: the spelling, and the way names are written differ. For example I have five entries for one name: "red rocks canyon", "redrcks", "redrock canyon", "red rocks canyons". This data set is too large for me to go through and clean this manually (2 million entries). Are there any strategies to clean these features up with code?
I would look into doing phonetic string matching here. The basic idea behind this approach is to obtain a phonetic encoding for each entered string, and then group spelling variations by their encoding. Then, you could choose the most frequent variation in each group to be the "correct" spelling.
There are several different variations on phonetic encoding, and a great package in Python for trying some of them out is jellyfish. Here is an example of how to use it with the Soundex encoding:
import jellyfish
import pandas as pd
data = pd.DataFrame({
"name": [
"red rocks canyon",
"redrcks",
"redrock canyon",
"red rocks canyons",
"bosque",
"bosque escoces",
"bosque escocs",
"borland",
"borlange"
]
})
data["soundex"] = data.name.apply(lambda x: jellyfish.soundex(x))
print(data.groupby("soundex").agg({"name": lambda x: ", ".join(x)}))
This prints:
name
soundex
B200 bosque
B222 bosque escoces, bosque escocs
B645 borland, borlange
R362 red rocks canyon, redrcks, redrock canyon, red...
This definitely won't be perfect and you'll have to be careful as it might group things too aggressively, but I hope it gives you something to try!

Improve speed of wikipedia.search with Python

I have a dictionary of famous people's names sorted by their initials. I want to convert these names into their respective Wikipedia title page names. These are the same for the first three given in this example, but Alexander Bell gets correctly converted to Alexander Graham Bell after running this code.
The algorithm works, although took about an hour to do all the 'AA' names and I am hoping for it to do this all the way up to 'ZZ'.
Is there any optimisation I can do on this? For example I saw something about batch requests but am not sure if it applies to my algorithm.
Or is there a more efficient method that I could use to get this same information?
Thanks.
import wikipedia
PeopleDictionary = {'AA':['Amy Adams', 'Aaron Allston'], 'AB':['Alia Bhatt', 'Alexander Bell']}
for key, val in PeopleDictionary.items():
for val in range(len(PeopleDictionary[key])):
Name_URL_All = wikipedia.search(PeopleDictionary[key][val])
if Name_URL_All:
Name_URL = Name_URL_All[0]
PeopleDictionary[key][val] = Name_URL

Accessing unicode content from DataFrame returns unicode content with additional backslash in Python3

I have a CSV file that consists of some tweets downloaded through API. The tweets consist of some Unicode characters and i have pretty fair idea how to decode them.
I put the CSV File into DataFrame,
df = pd.read_csv('sample.csv', header=None)
columns = ['time', 'tweet']
df.columns = columns
one of the tweets is -
b'RT : This little girl dressed as her father for Halloween, a employee \xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x91\x8c (via )'
But when i access this tweet through the command -
df['tweet'][0]
the output is returned in below format -
"b'RT : This little girl dressed as her father for Halloween, a employee \\xf0\\x9f\\x98\\x82\\xf0\\x9f\\x98\\x82\\xf0\\x9f\\x91\\x8c (via ) '"
I am not able to figure out why this extra backslash is getting appended to the tweet. As a result, this content is not getting decoded. Below are the few rows from the DataFrame.
time tweet
0 2018-11-02 05:55:46 b'RT : This little girl dressed as her father for Halloween, a employee \xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x91\x8c (via )'
1 2018-11-02 05:46:41 b'RT : This little girl dressed as her father for Halloween, a employee \xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x91\x8c (via )'
2 2018-11-02 03:44:35 b'Like, you could use a line map that just shows the whole thing instead of showing a truncated map that\xe2\x80\x99s confusing.\xe2\x80\xa6 (via )
3 2018-11-02 03:37:03 b' service is a joke. No service northbound No service northbound from Navy Yard after a playoff game at 11:30pm. And they\xe2\x80\xa6'
Screenshot of 'sample.csv'.
As i mentioned before, any of these tweets if accessed directly, there will be an extra backslash that will be appended in the output.
Can anyone please explain why this is happening and how to avoid it?
thanks
You did not show the contents of your CSV file, but it looks like whoever created it recorded the "string representation of the bytes object as it came from tweeter" - that is, inside the CSV file itself, you will find the literal b'\xff...' characters.
So, when you read it from Python, despite when printing as a string it appears to be a bytes-object (the ones that are represented with b'...'), they a string, with that representation as content.
One way to have these back as proper strings would be to just let Python eval their content - then, tehy become valid Bytes objects, which can be decoded into text. It is always a good idea to use ast.literal_eval ,as eval is too arbirtrary.
So, after you have your data loaded into your dataframe, this could fix your tweets column:
import ast
df['tweet'] = df['tweet'].map(lambda x: ast.literal_eval(x).decode('utf-8') if x.startswith("b'") else x)

Python 3: Saving API Results into CSV

I'm writing a script which requires a daily updated CSV source file which lists many movie details and have decided to use Python3 to create and update it even though I don't know too much about it.
I believe I've got the code down to pull the information via TheMovieDB.org's API that I need, but currently can only get it to echo the results and not save in a CSV. Below are a couple of questions I have, the code that I currently have, and an example of it's current output.
Questions:
1. What do I need to do add to get the resulting data into a CSV? I've tried many things but so far haven't gotten anything to work
2. What would I need to add so that rerunning the script would completely overwrite the CSV produced from the last run? (not append or error out)
3. Optional: Unless tedious or a pain, it would be nice to have a column for each of the values provided per title within the CSV.
Thanks!!
Current Code
import http.client
import requests
import csv
conn = http.client.HTTPSConnection("api.themoviedb.org")
payload = "{}"
conn.request("GET", "/3/discover/movie?page=20&include_video=false&include_adult=false&sort_by=primary_release_date.desc&language=en-US&api_key=XXXXXXXXXXXXXXXXXXXXXXXXXXX", payload)
res = conn.getresponse()
data = res.read()
print(data.decode("utf-8"))
Result That's Echoed from the above Current Code
{"page":20,"total_results":360846,"total_pages":18043,"results":[{"vote_count":0,"id":521662,"video":false,"vote_average":0,"title":"森のかたみ","popularity":1.098018,"poster_path":"/qmj1gJ33lF7BhEOWAvK0mt6hRGH.jpg","original_language":"ja","original_title":"森のかたみ","genre_ids":[],"backdrop_path":null,"adult":false,"overview":"","release_date":"2019-01-01"},{"vote_count":0,"id":518636,"video":false,"vote_average":0,"title":"Stadtkomödie:
Geschenkt","popularity":1.189812,"poster_path":null,"original_language":"de","original_title":"Stadtkomödie:
Geschenkt","genre_ids":[35],"backdrop_path":null,"adult":false,"overview":"","release_date":"2019-01-01"},{"vote_count":0,"id":520720,"video":false,"vote_average":0,"title":"Kim
Possible","popularity":1.188148,"poster_path":"/3QGHTLgNKRphu3bLvGpoTZ1Ce9U.jpg","original_language":"en","original_title":"Kim
Possible","genre_ids":[10751,28,12],"backdrop_path":null,"adult":false,"overview":"Live-action
film adaptation of the Disney Channel original series Kim
Possible.","release_date":"2019-01-01"},{"vote_count":0,"id":521660,"video":false,"vote_average":0,"title":"Speak
Low","popularity":1.098125,"poster_path":"/qYQQlizCTfD5km7GIrTWrBb4E9b.jpg","original_language":"ja","original_title":"小さな声で囁いて","genre_ids":[],"backdrop_path":null,"adult":false,"overview":"","release_date":"2019-01-01"},{"vote_count":0,"id":497834,"video":false,"vote_average":0,"title":"Saturday Fiction","popularity":1.148142,"poster_path":null,"original_language":"zh","original_title":"兰心大剧院","genre_ids":[],"backdrop_path":null,"adult":false,"overview":"An
actress working undercover for the Allies in 1941 Shanghai discovers
the Japanese plan to attack Pearl
Harbor.","release_date":"2019-01-01"},{"vote_count":0,"id":523461,"video":false,"vote_average":0,"title":"Wie
gut ist deine
Beziehung?","popularity":1.188171,"poster_path":null,"original_language":"de","original_title":"Wie
gut ist deine
Beziehung?","genre_ids":[35],"backdrop_path":null,"adult":false,"overview":"","release_date":"2019-01-01"},{"vote_count":0,"id":507118,"video":false,"vote_average":0,"title":"Schwartz &
Schwartz","popularity":1.345715,"poster_path":null,"original_language":"de","original_title":"Schwartz
&
Schwartz","genre_ids":[80],"backdrop_path":null,"adult":false,"overview":"","release_date":"2019-01-01"},{"vote_count":0,"id":505916,"video":false,"vote_average":0,"title":"Kuru","popularity":1.107158,"poster_path":null,"original_language":"ja","original_title":"来る","genre_ids":[],"backdrop_path":null,"adult":false,"overview":"After
the inexplicable message, at his workplace, of a mysterious death, a
man is introduced to a freelance writer and his
girlfriend.","release_date":"2019-01-01"},{"vote_count":0,"id":521028,"video":false,"vote_average":0,"title":"Tsokos:
Zersetzt","popularity":1.115739,"poster_path":null,"original_language":"de","original_title":"Tsokos:
Zersetzt","genre_ids":[53],"backdrop_path":null,"adult":false,"overview":"","release_date":"2019-01-01"},{"vote_count":0,"id":516910,"video":false,"vote_average":0,"title":"Rufmord","popularity":1.658291,"poster_path":null,"original_language":"de","original_title":"Rufmord","genre_ids":[18],"backdrop_path":null,"adult":false,"overview":"","release_date":"2019-01-01"},{"vote_count":0,"id":514224,"video":false,"vote_average":0,"title":"Shadows","popularity":1.289124,"poster_path":null,"original_language":"en","original_title":"Shadows","genre_ids":[16],"backdrop_path":null,"adult":false,"overview":"Plot
kept under
wraps.","release_date":"2019-01-01"},{"vote_count":0,"id":483202,"video":false,"vote_average":0,"title":"Eli","popularity":1.118757,"poster_path":null,"original_language":"en","original_title":"Eli","genre_ids":[27],"backdrop_path":null,"adult":false,"overview":"A
boy receiving treatment for his auto-immune disorder discovers that
the house he's living isn't as safe as he
thought.","release_date":"2019-01-01"},{"vote_count":0,"id":491287,"video":false,"vote_average":0,"title":"Untitled Lani Pixels
Project","popularity":1.951231,"poster_path":null,"original_language":"en","original_title":"Untitled
Lani Pixels
Project","genre_ids":[10751,16,12,35],"backdrop_path":null,"adult":false,"overview":"Evil
forces have invaded an isolated island and have targeted Patrick and
Susan's grandfather, Mr. Campbell. Guided by Jack, a charming Irish
rogue, the siblings end up on a dangerous journey filled with magic
and
mystery.","release_date":"2019-01-01"},{"vote_count":2,"id":49046,"video":false,"vote_average":0,"title":"All
Quiet on the Western
Front","popularity":6.197559,"poster_path":"/jZWVtbxyztDTSM0LXDcE6vdVTVC.jpg","original_language":"en","original_title":"All
Quiet on the Western
Front","genre_ids":[28,12,18,10752],"backdrop_path":null,"adult":false,"overview":"A
young German soldier's terrifying experiences and distress on the
western front during World War
I.","release_date":"2018-12-31"},{"vote_count":1,"id":299782,"video":false,"vote_average":0,"title":"The
Other Side of the
Wind","popularity":4.561363,"poster_path":"/vnfNbuyPqo5zJavqlgI3J50xJSi.jpg","original_language":"en","original_title":"The
Other Side of the
Wind","genre_ids":[35,18],"backdrop_path":null,"adult":false,"overview":"Orson
Welles' unfinished masterpiece, restored and assembled based on
Welles' own notes. During the last 15 years of his life, Welles, who
died in 1985, worked obsessively on the film, which chronicles a
temperamental film director—much like him—who is battling with the
Hollywood establishment to finish an iconoclastic
work.","release_date":"2018-12-31"},{"vote_count":0,"id":289600,"video":false,"vote_average":0,"title":"The
Sandman","popularity":3.329464,"poster_path":"/eju4vLNx9sSvscowmnKNLi3sFVe.jpg","original_language":"en","original_title":"The
Sandman","genre_ids":[27],"backdrop_path":"/zo67d5klQiFR3PCyvER39IMwZ73.jpg","adult":false,"overview":"THE
SANDMAN tells the story of Nathan, a young student in the city who
struggles to forget his childhood trauma at the hands of the serial
killer dubbed \"The Sandman.\" Nathan killed The Sandman years ago, on
Christmas Eve, after he witnessed the murder of his mother... until he
sees the beautiful woman who lives in the apartment across the way
dying at the hands of that same masked killer. This brutal murder
plunges Nathan into an odyssey into the night country of his past, his
dreams... and the buried secrets of The
Sandman.","release_date":"2018-12-31"},{"vote_count":0,"id":378177,"video":false,"vote_average":0,"title":"Luxembourg","popularity":1.179703,"poster_path":null,"original_language":"en","original_title":"Luxembourg","genre_ids":[],"backdrop_path":null,"adult":false,"overview":"The
story of a group of people living in a permanent nuclear winter in the
ruins of the old civilisation destroyed by an atomic
war.","release_date":"2018-12-31"},{"vote_count":0,"id":347392,"video":false,"vote_average":0,"title":"Slice","popularity":3.248065,"poster_path":"/ySWPZihd5ynCc1aNLQUXmiw5H2V.jpg","original_language":"en","original_title":"Slice","genre_ids":[35],"backdrop_path":"/rtL9nzXtSvo1MW05kho9oeimCdb.jpg","adult":false,"overview":"When
a pizza delivery driver is murdered on the job, the city searches for
someone to blame: ghosts? drug dealers? a disgraced
werewolf?","release_date":"2018-12-31"},{"vote_count":0,"id":438674,"video":false,"vote_average":0,"title":"Dragged
Across
Concrete","popularity":3.659627,"poster_path":"/p4tpV4nGeocuOKhp0enuiQNDvhi.jpg","original_language":"en","original_title":"Dragged
Across
Concrete","genre_ids":[18,80,53,9648],"backdrop_path":null,"adult":false,"overview":"Two
policemen, one an old-timer (Gibson), the other his volatile younger
partner (Vaughn), find themselves suspended when a video of their
strong-arm tactics becomes the media's cause du jour. Low on cash and
with no other options, these two embittered soldiers descend into the
criminal underworld to gain their just due, but instead find far more
than they wanted awaiting them in the
shadows.","release_date":"2018-12-31"},{"vote_count":0,"id":437518,"video":false,"vote_average":0,"title":"Friend
of the
World","popularity":4.189267,"poster_path":"/hf3LucIg7t7DUvgGJ9DjQyHcI4J.jpg","original_language":"en","original_title":"Friend
of the
World","genre_ids":[35,18,27,878,53,10752],"backdrop_path":null,"adult":false,"overview":"After
a catastrophic war, an eccentric general guides a filmmaker through a
ravaged bunker.","release_date":"2018-12-31"}]}
import json
import http.client
import requests
import csv
conn = http.client.HTTPSConnection("api.themoviedb.org")
payload = "{}"
conn.request("GET", "/3/discover/movie?page=20&include_video=false&include_adult=false&sort_by=primary_release_date.desc&language=en-US&api_key=XXXXXXXXXXXXXXXXXXXXXXXXXXX", payload)
res = conn.getresponse()
data = res.read()
json_data = json.loads(data)
results=json_data["results"]
for item in results:
print (item('vote_count'))
#write code to get necessary objects to write in csv
This is a way how you can do it. Comment if you have any query.
That looks like a JSON object, so you can parse it into a python dictionary using:
import json
mydict = json.loads(data)
Probably the values you want are in mydict[results] which is another set of key:value pairs. Depending on how you want these you could use a CSV library or just iterate through them and the print the contents with a tab between them.
for item in vars["results"]:
for k in item:
print("{}\t{}".format(k,item.get(k)))

Resources