Spark does not recognize new lines, &, etc. from String - apache-spark
I'm trying to process text data (Twitter tweets) with PySpark. Emojis and special characters are being red correctly but "\n", "&" appear to be escaped. Spark does not recognize them. Probably others too. One example tweet in my Spark DF would look like this:
"Hello everyone\n\nHow is it going? 😉 Take care & enjoy"
I would like Spark to read them correctly. The files are stored as parquet and I'm reading them like this:
tweets = spark.read.format('parquet')\
.option('header', 'True')\
.option('encoding', 'utf-8')\
.load(path)
Below are some sample input data, which I took from the original JSONL files (I stored the data as parquet later).
"full_text": "RT #OurWarOnCancer: Where is our FEDERAL vaccination
education campaign for HPV?! Where is our FEDERAL #lungcancer
screening program?! (and\u2026"
"full_text": "\u2b55\ufe0f#HPV is the most important cause of
#CervicalCancer But it doesn't just cause cervical cancer (see the figure\ud83d\udc47) \n\u2b55\ufe0fThat means they can be PREVENTED"
Reading directly from JSONL files results in the same recognizing problems.
tweets = spark.read.\
.option('encoding', 'utf-8')\
.json(path)
How can Spark recognize them correctly? Thank you in advance.
the below code might be helpful to solve your problem,
Input taken:
"Hello everyone\n\nHow is it going? 😉 Take care & enjoy"
"full_text": "RT #OurWarOnCancer: Where is our FEDERAL vaccination education campaign for HPV?! Where is our FEDERAL #lungcancer screening program?! (and\u2026 &"
"full_text": "\u2b55\ufe0f#HPV is the most important cause of #CervicalCancer But it doesn't just cause cervical cancer (see the figure\ud83d\udc47) \n\u2b55\ufe0fThat means they can be PREVENTED #theNCI #NCIprevention #AmericanCancer #cancereu #uicc #IARCWHO #EuropeanCancer #KanserSavascisi #AUTF_DEKANLIK #OncoAlert"
code to solve the problem:
from pyspark.sql.functions import *
df=spark.read.csv("file:///home/sathya/Desktop/stackoverflo/raw-data/input.tweet")
df1=df.withColumn("cleandata",regexp_replace('_c0', '&|\\\\n', ''))
df1.select("cleandata").show(truncate=False)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|cleandata |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Hello everyoneHow is it going? 😉 Take care & enjoy |
|"full_text": "RT #OurWarOnCancer: Where is our FEDERAL vaccination education campaign for HPV?! Where is our FEDERAL #lungcancer screening program?! (and\u2026 &" |
|"full_text": "\u2b55\ufe0f#HPV is the most important cause of #CervicalCancer But it doesn't just cause cervical cancer (see the figure\ud83d\udc47) \u2b55\ufe0fThat means they can be PREVENTED #theNCI #NCIprevention #AmericanCancer #cancereu #uicc #IARCWHO #EuropeanCancer #KanserSavascisi #AUTF_DEKANLIK #OncoAlert"|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Related
Training / using OpenAI GPT-3 for translations
I'm trying to use OpenAI for translation of my products descriptions from one language to some other languages (EN, DE, CZ, SK, HU, PL, SI...). The translations, especially to SK/CZ/HU/PL languages are (mainly gramatically) quite bad (using text-davinci-003 model). I've got an idea - I already have a few thousands of similar products fully translated into all of these languages by professional translators. Is it possible to use those existing correct translations to train GPT-3 and then use this model to translate new texts? Has anybody already tried something similar?
Have you tried using the Edits endpoint? You can fix grammar and spelling with it. If you run test.py the OpenAI API will return the following completion: What day of the week is it? test.py import openai openai.api_key = 'sk-xxxxxxxxxxxxxxxxxxxx' openai.Edit.create( model = 'text-davinci-edit-001', input = 'What day of the wek is it?', instruction = 'Fix the spelling mistakes' ) content = response['choices'][0]['text'] print(content)
Cleaning up a column based on spelling? Pandas
I've got two very important, user entered, information columns in my data frame. They are mostly cleaned up except for one issue: the spelling, and the way names are written differ. For example I have five entries for one name: "red rocks canyon", "redrcks", "redrock canyon", "red rocks canyons". This data set is too large for me to go through and clean this manually (2 million entries). Are there any strategies to clean these features up with code?
I would look into doing phonetic string matching here. The basic idea behind this approach is to obtain a phonetic encoding for each entered string, and then group spelling variations by their encoding. Then, you could choose the most frequent variation in each group to be the "correct" spelling. There are several different variations on phonetic encoding, and a great package in Python for trying some of them out is jellyfish. Here is an example of how to use it with the Soundex encoding: import jellyfish import pandas as pd data = pd.DataFrame({ "name": [ "red rocks canyon", "redrcks", "redrock canyon", "red rocks canyons", "bosque", "bosque escoces", "bosque escocs", "borland", "borlange" ] }) data["soundex"] = data.name.apply(lambda x: jellyfish.soundex(x)) print(data.groupby("soundex").agg({"name": lambda x: ", ".join(x)})) This prints: name soundex B200 bosque B222 bosque escoces, bosque escocs B645 borland, borlange R362 red rocks canyon, redrcks, redrock canyon, red... This definitely won't be perfect and you'll have to be careful as it might group things too aggressively, but I hope it gives you something to try!
Improve speed of wikipedia.search with Python
I have a dictionary of famous people's names sorted by their initials. I want to convert these names into their respective Wikipedia title page names. These are the same for the first three given in this example, but Alexander Bell gets correctly converted to Alexander Graham Bell after running this code. The algorithm works, although took about an hour to do all the 'AA' names and I am hoping for it to do this all the way up to 'ZZ'. Is there any optimisation I can do on this? For example I saw something about batch requests but am not sure if it applies to my algorithm. Or is there a more efficient method that I could use to get this same information? Thanks. import wikipedia PeopleDictionary = {'AA':['Amy Adams', 'Aaron Allston'], 'AB':['Alia Bhatt', 'Alexander Bell']} for key, val in PeopleDictionary.items(): for val in range(len(PeopleDictionary[key])): Name_URL_All = wikipedia.search(PeopleDictionary[key][val]) if Name_URL_All: Name_URL = Name_URL_All[0] PeopleDictionary[key][val] = Name_URL
Accessing unicode content from DataFrame returns unicode content with additional backslash in Python3
I have a CSV file that consists of some tweets downloaded through API. The tweets consist of some Unicode characters and i have pretty fair idea how to decode them. I put the CSV File into DataFrame, df = pd.read_csv('sample.csv', header=None) columns = ['time', 'tweet'] df.columns = columns one of the tweets is - b'RT : This little girl dressed as her father for Halloween, a employee \xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x91\x8c (via )' But when i access this tweet through the command - df['tweet'][0] the output is returned in below format - "b'RT : This little girl dressed as her father for Halloween, a employee \\xf0\\x9f\\x98\\x82\\xf0\\x9f\\x98\\x82\\xf0\\x9f\\x91\\x8c (via ) '" I am not able to figure out why this extra backslash is getting appended to the tweet. As a result, this content is not getting decoded. Below are the few rows from the DataFrame. time tweet 0 2018-11-02 05:55:46 b'RT : This little girl dressed as her father for Halloween, a employee \xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x91\x8c (via )' 1 2018-11-02 05:46:41 b'RT : This little girl dressed as her father for Halloween, a employee \xf0\x9f\x98\x82\xf0\x9f\x98\x82\xf0\x9f\x91\x8c (via )' 2 2018-11-02 03:44:35 b'Like, you could use a line map that just shows the whole thing instead of showing a truncated map that\xe2\x80\x99s confusing.\xe2\x80\xa6 (via ) 3 2018-11-02 03:37:03 b' service is a joke. No service northbound No service northbound from Navy Yard after a playoff game at 11:30pm. And they\xe2\x80\xa6' Screenshot of 'sample.csv'. As i mentioned before, any of these tweets if accessed directly, there will be an extra backslash that will be appended in the output. Can anyone please explain why this is happening and how to avoid it? thanks
You did not show the contents of your CSV file, but it looks like whoever created it recorded the "string representation of the bytes object as it came from tweeter" - that is, inside the CSV file itself, you will find the literal b'\xff...' characters. So, when you read it from Python, despite when printing as a string it appears to be a bytes-object (the ones that are represented with b'...'), they a string, with that representation as content. One way to have these back as proper strings would be to just let Python eval their content - then, tehy become valid Bytes objects, which can be decoded into text. It is always a good idea to use ast.literal_eval ,as eval is too arbirtrary. So, after you have your data loaded into your dataframe, this could fix your tweets column: import ast df['tweet'] = df['tweet'].map(lambda x: ast.literal_eval(x).decode('utf-8') if x.startswith("b'") else x)
Python 3: Saving API Results into CSV
I'm writing a script which requires a daily updated CSV source file which lists many movie details and have decided to use Python3 to create and update it even though I don't know too much about it. I believe I've got the code down to pull the information via TheMovieDB.org's API that I need, but currently can only get it to echo the results and not save in a CSV. Below are a couple of questions I have, the code that I currently have, and an example of it's current output. Questions: 1. What do I need to do add to get the resulting data into a CSV? I've tried many things but so far haven't gotten anything to work 2. What would I need to add so that rerunning the script would completely overwrite the CSV produced from the last run? (not append or error out) 3. Optional: Unless tedious or a pain, it would be nice to have a column for each of the values provided per title within the CSV. Thanks!! Current Code import http.client import requests import csv conn = http.client.HTTPSConnection("api.themoviedb.org") payload = "{}" conn.request("GET", "/3/discover/movie?page=20&include_video=false&include_adult=false&sort_by=primary_release_date.desc&language=en-US&api_key=XXXXXXXXXXXXXXXXXXXXXXXXXXX", payload) res = conn.getresponse() data = res.read() print(data.decode("utf-8")) Result That's Echoed from the above Current Code {"page":20,"total_results":360846,"total_pages":18043,"results":[{"vote_count":0,"id":521662,"video":false,"vote_average":0,"title":"森のかたみ","popularity":1.098018,"poster_path":"/qmj1gJ33lF7BhEOWAvK0mt6hRGH.jpg","original_language":"ja","original_title":"森のかたみ","genre_ids":[],"backdrop_path":null,"adult":false,"overview":"","release_date":"2019-01-01"},{"vote_count":0,"id":518636,"video":false,"vote_average":0,"title":"Stadtkomödie: Geschenkt","popularity":1.189812,"poster_path":null,"original_language":"de","original_title":"Stadtkomödie: Geschenkt","genre_ids":[35],"backdrop_path":null,"adult":false,"overview":"","release_date":"2019-01-01"},{"vote_count":0,"id":520720,"video":false,"vote_average":0,"title":"Kim Possible","popularity":1.188148,"poster_path":"/3QGHTLgNKRphu3bLvGpoTZ1Ce9U.jpg","original_language":"en","original_title":"Kim Possible","genre_ids":[10751,28,12],"backdrop_path":null,"adult":false,"overview":"Live-action film adaptation of the Disney Channel original series Kim Possible.","release_date":"2019-01-01"},{"vote_count":0,"id":521660,"video":false,"vote_average":0,"title":"Speak Low","popularity":1.098125,"poster_path":"/qYQQlizCTfD5km7GIrTWrBb4E9b.jpg","original_language":"ja","original_title":"小さな声で囁いて","genre_ids":[],"backdrop_path":null,"adult":false,"overview":"","release_date":"2019-01-01"},{"vote_count":0,"id":497834,"video":false,"vote_average":0,"title":"Saturday Fiction","popularity":1.148142,"poster_path":null,"original_language":"zh","original_title":"兰心大剧院","genre_ids":[],"backdrop_path":null,"adult":false,"overview":"An actress working undercover for the Allies in 1941 Shanghai discovers the Japanese plan to attack Pearl Harbor.","release_date":"2019-01-01"},{"vote_count":0,"id":523461,"video":false,"vote_average":0,"title":"Wie gut ist deine Beziehung?","popularity":1.188171,"poster_path":null,"original_language":"de","original_title":"Wie gut ist deine Beziehung?","genre_ids":[35],"backdrop_path":null,"adult":false,"overview":"","release_date":"2019-01-01"},{"vote_count":0,"id":507118,"video":false,"vote_average":0,"title":"Schwartz & Schwartz","popularity":1.345715,"poster_path":null,"original_language":"de","original_title":"Schwartz & Schwartz","genre_ids":[80],"backdrop_path":null,"adult":false,"overview":"","release_date":"2019-01-01"},{"vote_count":0,"id":505916,"video":false,"vote_average":0,"title":"Kuru","popularity":1.107158,"poster_path":null,"original_language":"ja","original_title":"来る","genre_ids":[],"backdrop_path":null,"adult":false,"overview":"After the inexplicable message, at his workplace, of a mysterious death, a man is introduced to a freelance writer and his girlfriend.","release_date":"2019-01-01"},{"vote_count":0,"id":521028,"video":false,"vote_average":0,"title":"Tsokos: Zersetzt","popularity":1.115739,"poster_path":null,"original_language":"de","original_title":"Tsokos: Zersetzt","genre_ids":[53],"backdrop_path":null,"adult":false,"overview":"","release_date":"2019-01-01"},{"vote_count":0,"id":516910,"video":false,"vote_average":0,"title":"Rufmord","popularity":1.658291,"poster_path":null,"original_language":"de","original_title":"Rufmord","genre_ids":[18],"backdrop_path":null,"adult":false,"overview":"","release_date":"2019-01-01"},{"vote_count":0,"id":514224,"video":false,"vote_average":0,"title":"Shadows","popularity":1.289124,"poster_path":null,"original_language":"en","original_title":"Shadows","genre_ids":[16],"backdrop_path":null,"adult":false,"overview":"Plot kept under wraps.","release_date":"2019-01-01"},{"vote_count":0,"id":483202,"video":false,"vote_average":0,"title":"Eli","popularity":1.118757,"poster_path":null,"original_language":"en","original_title":"Eli","genre_ids":[27],"backdrop_path":null,"adult":false,"overview":"A boy receiving treatment for his auto-immune disorder discovers that the house he's living isn't as safe as he thought.","release_date":"2019-01-01"},{"vote_count":0,"id":491287,"video":false,"vote_average":0,"title":"Untitled Lani Pixels Project","popularity":1.951231,"poster_path":null,"original_language":"en","original_title":"Untitled Lani Pixels Project","genre_ids":[10751,16,12,35],"backdrop_path":null,"adult":false,"overview":"Evil forces have invaded an isolated island and have targeted Patrick and Susan's grandfather, Mr. Campbell. Guided by Jack, a charming Irish rogue, the siblings end up on a dangerous journey filled with magic and mystery.","release_date":"2019-01-01"},{"vote_count":2,"id":49046,"video":false,"vote_average":0,"title":"All Quiet on the Western Front","popularity":6.197559,"poster_path":"/jZWVtbxyztDTSM0LXDcE6vdVTVC.jpg","original_language":"en","original_title":"All Quiet on the Western Front","genre_ids":[28,12,18,10752],"backdrop_path":null,"adult":false,"overview":"A young German soldier's terrifying experiences and distress on the western front during World War I.","release_date":"2018-12-31"},{"vote_count":1,"id":299782,"video":false,"vote_average":0,"title":"The Other Side of the Wind","popularity":4.561363,"poster_path":"/vnfNbuyPqo5zJavqlgI3J50xJSi.jpg","original_language":"en","original_title":"The Other Side of the Wind","genre_ids":[35,18],"backdrop_path":null,"adult":false,"overview":"Orson Welles' unfinished masterpiece, restored and assembled based on Welles' own notes. During the last 15 years of his life, Welles, who died in 1985, worked obsessively on the film, which chronicles a temperamental film director—much like him—who is battling with the Hollywood establishment to finish an iconoclastic work.","release_date":"2018-12-31"},{"vote_count":0,"id":289600,"video":false,"vote_average":0,"title":"The Sandman","popularity":3.329464,"poster_path":"/eju4vLNx9sSvscowmnKNLi3sFVe.jpg","original_language":"en","original_title":"The Sandman","genre_ids":[27],"backdrop_path":"/zo67d5klQiFR3PCyvER39IMwZ73.jpg","adult":false,"overview":"THE SANDMAN tells the story of Nathan, a young student in the city who struggles to forget his childhood trauma at the hands of the serial killer dubbed \"The Sandman.\" Nathan killed The Sandman years ago, on Christmas Eve, after he witnessed the murder of his mother... until he sees the beautiful woman who lives in the apartment across the way dying at the hands of that same masked killer. This brutal murder plunges Nathan into an odyssey into the night country of his past, his dreams... and the buried secrets of The Sandman.","release_date":"2018-12-31"},{"vote_count":0,"id":378177,"video":false,"vote_average":0,"title":"Luxembourg","popularity":1.179703,"poster_path":null,"original_language":"en","original_title":"Luxembourg","genre_ids":[],"backdrop_path":null,"adult":false,"overview":"The story of a group of people living in a permanent nuclear winter in the ruins of the old civilisation destroyed by an atomic war.","release_date":"2018-12-31"},{"vote_count":0,"id":347392,"video":false,"vote_average":0,"title":"Slice","popularity":3.248065,"poster_path":"/ySWPZihd5ynCc1aNLQUXmiw5H2V.jpg","original_language":"en","original_title":"Slice","genre_ids":[35],"backdrop_path":"/rtL9nzXtSvo1MW05kho9oeimCdb.jpg","adult":false,"overview":"When a pizza delivery driver is murdered on the job, the city searches for someone to blame: ghosts? drug dealers? a disgraced werewolf?","release_date":"2018-12-31"},{"vote_count":0,"id":438674,"video":false,"vote_average":0,"title":"Dragged Across Concrete","popularity":3.659627,"poster_path":"/p4tpV4nGeocuOKhp0enuiQNDvhi.jpg","original_language":"en","original_title":"Dragged Across Concrete","genre_ids":[18,80,53,9648],"backdrop_path":null,"adult":false,"overview":"Two policemen, one an old-timer (Gibson), the other his volatile younger partner (Vaughn), find themselves suspended when a video of their strong-arm tactics becomes the media's cause du jour. Low on cash and with no other options, these two embittered soldiers descend into the criminal underworld to gain their just due, but instead find far more than they wanted awaiting them in the shadows.","release_date":"2018-12-31"},{"vote_count":0,"id":437518,"video":false,"vote_average":0,"title":"Friend of the World","popularity":4.189267,"poster_path":"/hf3LucIg7t7DUvgGJ9DjQyHcI4J.jpg","original_language":"en","original_title":"Friend of the World","genre_ids":[35,18,27,878,53,10752],"backdrop_path":null,"adult":false,"overview":"After a catastrophic war, an eccentric general guides a filmmaker through a ravaged bunker.","release_date":"2018-12-31"}]}
import json import http.client import requests import csv conn = http.client.HTTPSConnection("api.themoviedb.org") payload = "{}" conn.request("GET", "/3/discover/movie?page=20&include_video=false&include_adult=false&sort_by=primary_release_date.desc&language=en-US&api_key=XXXXXXXXXXXXXXXXXXXXXXXXXXX", payload) res = conn.getresponse() data = res.read() json_data = json.loads(data) results=json_data["results"] for item in results: print (item('vote_count')) #write code to get necessary objects to write in csv This is a way how you can do it. Comment if you have any query.
That looks like a JSON object, so you can parse it into a python dictionary using: import json mydict = json.loads(data) Probably the values you want are in mydict[results] which is another set of key:value pairs. Depending on how you want these you could use a CSV library or just iterate through them and the print the contents with a tab between them. for item in vars["results"]: for k in item: print("{}\t{}".format(k,item.get(k)))