read tweets extracted with python - excel

I am trying to read tweets in excel. Tweets have been retrieved with python (and tweepy) then saved in a csv file:
# -*- coding: utf-8 -*-
writer= csv.writer(open(r"C:\path\twitter_"+date+".csv", "w"), lineterminator='\n', delimiter =';')
writer.writerow(["username", "nb_followers", "tweet_text"])
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token_key, access_token_secret)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
for tweet in tweepy.Cursor(api.search, q="dengue+OR+%23dengue", lang="en", since=date, until=end_date).items():
username=tweet.user.screen_name
nb_followers=tweet.user.followers_count
tweet_text=tweet.text.encode('utf-8')
writer.writerow([username, nb_followers, tweet_text])
Due to the utf-8 encoding, I have problems reading them in a text editor or excel.
For example this tweet:
gives this in excel:
b"\xe2\x80\x9c#ThislsWow: I want to do this \xf0\x9f\x98\x8d http://t.co/rGfv9e70Tj\xe2\x80\x9d pu\xc3\xb1eta you're going to get bitten by the mosquito and get dengue"
How to get the original characters? How to remove the b at the beginning, useful only in a python program?
EDIT :
As per Alastair McCormack's comment:
I removed the encoding of my field and added it in the writer:
writer= csv.writer(open(r"C:\path\twitter_"+date+".csv", "w", encoding="UTF-8"), lineterminator='\n', delimiter =';')
tweet_text=tweet.text.replace("\n", "").replace("\r", "")
Now I have the following error:
tweet: Traceback (most recent call last):
File "twitter_influence.py", line 88, in <module>
print("tweet:", tweet_text)
File "C:\Users\rlalande\Envs\tweepy\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2026' in position 137: character maps to <undefined>
EDIT2 :
I am now using the following:
import codecs
sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach())
(seen in this post: https://stackoverflow.com/a/4374457/1875861)
There is no more error but it doesn't output the correct characters.
For example this tweet:
gives this output in excel:
Malay Mail Online Alarming rise in dengue casesMalay Mail Online“The ministry started a campaign for construction… http://t.co/MuLFlMwkY0
Before, with direct encoding of the field, I had:
b'Malay Mail Online\n\nAlarming rise in dengue casesMalay Mail Online\xe2\x80\x9cThe ministry started a campaign for construction\xe2\x80\xa6 http://t.co/MuLFlMwkY0'
The result is different but not really better... Why is the quote character not outputted correctly? In one case it outputs … and in the other case \xe2\x80\xa6.

It's because the CSV writer expects all input to be Unicode strings. You're getting the __repr__() of a byte string.
Set the encoding of your output file by replacing the first line with:
writer= csv.writer(open(r"C:\path\twitter_"+date+".csv", "w", encoding="UTF-8"), lineterminator='\n', delimiter =';')
This means that any Unicode strings written to the file will be translated automagically. Then remove the explicit encode():
tweet_text=tweet.text
Edit:
Excel needs to be coerced into reading UTF-8 files if you don't use the import function. The easiest way to do this is to add UTF-8 BOM signature to the start of the file.
Python provides a shortcut if you use the utf_8_sig encoding. E.g.
writer= csv.writer(open(r"C:\path\twitter_"+date+".csv", "w", encoding="utf_8_sig"), lineterminator='\n', delimiter =';')
You can also check your file in a decent UTF-8 editor like Notepad++ or Atom.

Related

How to get python to tolerate UTF-8 encoding errors

I have a set of UTF-8 texts I have scraped from web pages. I am trying to extract keywords from these files like so:
import os
import json
from rake_nltk import Rake
rake_nltk_var = Rake()
directory = 'files'
results = {}
for filename in os.scandir(directory):
if filename.is_file():
with open("files/" + filename.name, encoding="utf-8", mode = 'r') as infile:
text = infile.read()
rake_nltk_var.extract_keywords_from_text(text)
keyword_extracted = rake_nltk_var.get_ranked_phrases()
results[filename.name] = keyword_extracted
with open("extracted-keywords.json", "w") as outfile:
json.dump(results, outfile)
One of the files I've managed to process so far is throwing the following error on read:
Traceback (most recent call last):
File "extract-keywords.py", line 11, in <module>
text = infile.read()
File "c:\python36\lib\codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 66: invalid start byte
0x92 is a right single quotation mark, but the 66th char of the file is a "u" so IDK where this error is coming from. Regardless, is there some way to make the codec tolerate such encoding errors? For example, Perl simply substitutes a question mark for any character it can't decode. Is there some way to get Python to do the same? I have a lot of files and can't afford to stop and debug every encoding error they might contain.
I have a set of UTF-8 texts I have scraped from web pages
If they can't be read with the script you've shown, then these are not actually UTF-8 encoded files.
We have to know about the code which wrote the files in the first place to tell the correct way to decode. However, the ’ character is 0x92 byte in code page 1252, so try using that encoding instead, i.e.:
with open("files/" + filename.name, encoding="cp1252") as infile:
text = infile.read()
Ignoring decoding errors corrupts the data, so it's best to use the correct decoder when possible, so try and do that first! However, about this part of the question:
Regardless, is there some way to make the codec tolerate such encoding errors? For example, Perl simply substitutes a question mark for any character it can't decode. Is there some way to get Python to do the same?
Yes, you can specify errors="replace"
>>> with open("/tmp/f.txt", "w", encoding="cp1252") as f:
... f.write('this is a right quote: \N{RIGHT SINGLE QUOTATION MARK}')
...
>>> with open("/tmp/f.txt", encoding="cp1252") as f:
... print(f.read()) # using correct encoding
...
this is a right quote: ’
>>> with open("/tmp/f.txt", encoding="utf-8", errors="replace") as f:
... print(f.read()) # using incorrect encoding and replacing errors
this is a right quote: �

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte while reading a text file

I am training a word2vec model, using about 700 text files as my corpus. But, when I start reading the files after the preprocessing step, I get the mentioned error. The code is as follows
class MyCorpus(object):
def __iter__(self):
for i in ceo_path: /// ceo_path contains abs path of all text files
file = open(i, 'r', encoding='utf-8')
text = file.read()
###########
########### /// text preprocessing steps
###########
yield final_text /// returns preprocessed text
sentences = MyCorpus()
logging.basicConfig(format="%(levelname)s - %(asctime)s: %(message)s", datefmt= '%H:%M:%S', level=logging.INFO)
# training the model
cores = multiprocessing.cpu_count()
w2v_model = Word2Vec(min_count=5,
iter=30,
window=3,
size=200,
sample=6e-5,
alpha=0.025,
min_alpha=0.0001,
negative=20,
workers=cores-1,
sg=1)
w2v_model.build_vocab(sentences)
w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)
w2v_model.save('ceo1.model')
The error that I am getting is:
Traceback (most recent call last):
File "C:/Users/name/PycharmProjects/prac2/hbs_word2vec.py", line 131, in <module>
w2v_model.build_vocab(sentences)
File "C:\Users\name\PycharmProjects\prac1\venv\lib\site-packages\gensim\models\base_any2vec.py", line 921, in build_vocab
total_words, corpus_count = self.vocabulary.scan_vocab(
File "C:\Users\name\PycharmProjects\prac1\venv\lib\site-packages\gensim\models\word2vec.py", line 1403, in scan_vocab
total_words, corpus_count = self._scan_vocab(sentences, progress_per, trim_rule)
File "C:\Users\name\PycharmProjects\prac1\venv\lib\site-packages\gensim\models\word2vec.py", line 1372, in _scan_vocab
for sentence_no, sentence in enumerate(sentences):
File "C:/Users/name/PycharmProjects/prac2/hbs_word2vec.py", line 65, in __iter__
text = file.read()
File "C:\Users\name\AppData\Local\Programs\Python\Python38-32\lib\codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
I am not able to understand the error as I am new to this. I was not getting the error in reading the text files when I wasn't using the iter function and sending the data in chunks as I am doing currently.
It looks like one of your files doesn't have proper utf-8-encoded text.
(Your Word2Vec-related code probably isn't necessary for hitting the error, at all. You could probably trigger the same error with just: sentences_list = list(MyCorpus()).)
To find which file, two different possibilities might be:
Change your MyCorpus class so that it prints the path of each file before it tries to read it.
Add a Python try: ... except UnicodeDecodeError: ... statement around the read, and when the exception is caught, print the offending filename.
Once you know the file involved, you may want to fix the file, or change the code to be able to handle the files you have.
Maybe they're not really in utf-8 encoding, in which case you'd specify a different encoding.
Maybe just one or a few have problems, and it's be OK to just print their names for later investigation, and skip them. (You could use the exception-handling approach above to do that.)
Maybe, those that aren't utf-8 are always in some other platform-specific encoding, so when utf-8 fails, you could try a 2nd encoding.
Separately, when you solve the encoding issue, your iterable MyCorpus is not yet returning whet the Word2Vec class expects.
It doesn't want full text plain strings. It needs those texts to already be broken up into individual word-tokens.
(Often, simply performing a .split() on a string is close-enough-to-real-tokenization to try as a starting point, but usually, projects use some more-sophisticated punctuation-aware tokenization.)

What to do if using utf-8 while writing in a file results in replacement of some characters?

Writing some raw text + html data into file with these lines:
ready_file = open('example.txt', 'w', encoding='utf-8')
ready_file.write(raw_data_html)
ready_file.close()
If I'm using encoding, it results in some characters, such as ' and ", being replaced with ”, “, Ђ™ and so on.
If I'm not using encoding, it results in this type of errors:
UnicodeEncodeError: 'charmap' codec can't encode character 'X' in position Y: character maps to
undefined
IF i'm writing only text data withour html part and not using encoding as in example below:
ready_file = open('example.txt', 'w')
ready_file.write(raw_data)
ready_file.close()
then it's fine and no ' or " are being replaced with ” and so on.
How do I avoid this error or avoid my characters being replaced with god knows what?
UPD: figured it out! I had ” instead of " in my initial files.
Thanks for the answers!

listing filenames in a directory with .docx extension using python

'''This script is to copy text from documents (docx) to simple text file
'''
import sys
import ntpath
import os
from docx import Document
docpath = os.path.abspath(r'C:\Users\Khairul Basar\Documents\CWD Projects\00_WORKING\WL_SLOT1_submission_date_30-03-2018\1-100')
txtpath = os.path.abspath(r'C:\Users\Khairul Basar\Documents\CWD Projects\00_WORKING\WL_SLOT1_submission_date_30-03-2018\Textfiles')
for filename in os.listdir(docpath):
try:
document = Document(os.path.join(docpath, filename))
# print(document.paragraphs)
print(filename)
savetxt = os.path.join(txtpath, ntpath.basename(filename).split('.')[0] + ".txt")
print('Reading ' + filename)
# print(savetxt)
fullText = []
for para in document.paragraphs:
# print(para.text)
fullText.append(para.text)
with open(savetxt, 'wt') as newfile:
for item in fullText:
newfile.write("%s\n" % item)
# with open(savetxt, 'a') as f:
# f.write(para.text)
# print(" ".join([line.rstrip('\n') for line in f]))
# newfile.write(fullText)
# newfile.save()
# newfile.save()
#
# newfile.write('\n\n'.join(fullText))
# newfile.close()
except:
# print(filename)
# document = Document(os.path.join(docpath, filename))
# print(document.paragraphs)
print('Please fix an error')
exit()
# print("Please supply an input and output file. For example:\n"
# # " example-extracttext.py 'My Office 2007 document.docx' 'outp"
# "utfile.txt'")
# Fetch all the text out of the document we just created
# Make explicit unicode version
# Print out text of document with two newlines under each paragraph
print(savetxt)
Above python 3 script is to read Docx file and create txt files. In one directory I have 100s docx files, but it is only creating 19 txt files and then exiting the program. I couldn't figure why?
Docx files are output files from OCR software, all are English text ( no image, tables or graph or something special).
Today again I run the program after removing the Try/Except instruction and result is same:
1.docx
Reading 1.docx
10.docx
Reading 10.docx
100.docx
Reading 100.docx
11.docx
Reading 11.docx
12.docx
Reading 12.docx
13.docx
Reading 13.docx
14.docx
Reading 14.docx
15.docx
Reading 15.docx
16.docx
Reading 16.docx
17.docx
Reading 17.docx
18.docx
Reading 18.docx
Traceback (most recent call last):
File "C:\Users\Khairul Basar\Documents\CWD Projects\docx2txtv2.py", line 26,
in
newfile.write("%s\n" % item)
File "C:\Python36\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u0113' in position
77: character maps to
Some other post Here resolve this by .encode("utf-8")
But if i use it then I get b' my text' in every line - which i don't need.
UPDATE fixed
I have made change to following line:
with open(savetxt, 'w', encoding='utf-8') as newfile:
by adding encoding='utf-8'
help i took from this post. post
Thank you who has formated my post in a nice way.
usr2564301 has pointed out to remove Try/except from the code. By doing so i got exact error why it was not working or exiting the program prematurely.
The problem was my Docx has many characters which are beyond 8-bit character set. To convert that non-english characters to English encoding='utf-8' is used.
That solved the problem.
anyway, all credit goes to usr2564301 who is somewhere I don't know.

UnicodeEncodeError: 'charmap' codec can't encode character '\ufe0f' in position 62: character maps to <undefined>

I am trying to scrape geo locations based on the urls and after about 500 searches and extraction of geo location I am getting encoding error. I have included the encoding utf-8 in the code and also followed the following command in the cmd.
chcp 65001
set PYTHONIOENCODING=utf-8
And yet I am getting the following error :
Traceback (most recent call last):
File "__main__.py", line 33, in <module>
outputfile.write(newline)
File "C:\Program Files\Anaconda3\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\ufe0f' in position 62: character maps to <undefined>
I am using Python 3.x updated version on anaconda of all the packages.
#!/usr/bin/python
import sys
from twitter_location import get_location
from google_location import get_coordinates
# Open output file
outputfile = open(sys.argv[2], 'w')
# Read input file
with open(sys.argv[1], 'r', encoding = "utf-8", errors='ignore') as csv:
# Skip headers line
next(csv)
# Loop in lines
for line in csv:
# Extract userid
print (line)
permalink = line.split(',')[-1].strip()
userid = permalink.split('/')[3]
# Get location as string if exists
location = get_location(userid)
if location is None:
print ('user {} can not be reached or do not exposes any location.'.format(userid))
continue
else:
# If location is ok, get coordinates
coordinates = get_coordinates(location)
print ('{}: {}'.format(userid, coordinates))
# Copy current input line and add coordinates at the end
newline = '{},{}\n'.format(line.strip(), coordinates)
# Write in output file
outputfile.write(newline)
I am looking for two things here
Help with the encoding error
I want to put back the input headers in the output + the new column header
My input files have following headers
username date retweets favorites text geo mentions hashtags id permalink
while writing the output I am able to get all the columns + the new geo coordinates column too. But I am not able to put back the headers in the output files.
Appreciate the help, Thanks in advance.

Resources