how to convert string stored in list into utf8? - python-3.x

i have tokenized the text form the text files stored in a list and stored the tokenized text in a variable and when i print that variable it shows the wrong result.
import glob
files = glob.glob("D:\Pakistan Constitution\*.txt")
documents = []
for file in files:
with open(file) as f:
documents.append(f.read())
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist] for document in documents]
print(texts)
I expect the tokenized words but result occur is like that
['ÿþp\x00a\x00r\x00t\x00', '\x00v\x00', '\x00', '\x00r\x00e\x00l\x00a\x00t\x00i\x00o\x00n\x00s\x00', '\x00b\x00e\x00t\x00w\x00e\x00e\x00n\x00',
So anyone can help me regarding this

Related

How to find the number of common words in a text file and delete them in python?

The question is to:
Firstly,find the number of all words in a text file
Secondly, delete the common words like, a, an , and, to, in, at, but,... (it is allowed to write a list of these words)
Thirdly, find the number of the remaining words (unique words)
Make a list of them
the file name should be used as the parameter of the function
I have done the first part of the question
import re
file = open('text.txt', 'r', encoding = 'latin-1')
word_list = file.read().split()
for x in word_list:
print(x)
res = len(word_list)
print ('The number of words in the text:' + str(res))
def uncommonWords (file):
uncommonwords = (list(file))
for i in uncommonwords:
i += 1
print (i)
The code shows till the number of the words and nothing appears after that.
you can do it like this
# list of common words you want to remove
stop_words = set(["is", "the", "to", "in"])
# set to collect unique words
words_in_file = set()
with open("words.txt") as text_file:
for line in text_file:
for word in line.split():
words_in_file.add(word)
# remove common words from word list
unique_words = words_in_file - stop_words
print(list(unique_words))
First, you may want to get rid of punctuation : as showed in this answer, you should do :
nonPunct = re.compile('.*[A-Za-z0-9].*')
filtered = [w for w in text if nonPunct.match(w)]
then, you could do
from collections import Counter
counts = Counter(filtered)
you can then access the list of unique words with list(counts.keys()) and then you can chose to ignore the words you don't want with
[word for word in list(counts.keys()) if word not in common_words]
Hope this answers your question.

how to extract text from PDF file using python , i never did this and not getting the DOM of PDF file

this is my PDF file "https://drive.google.com/open?id=1M9k1AO17ZSwT6HTrTrB-uz85ps3WL1wS"
Help me someone to extract this , as i search on SO getting some clue to extract text using these libries PyPDF2, PyPDF2.pdf , PageObject, u_, ContentStream, b_, TextStringObject ,but not getting how to use it.
someone please help me to extract this with some explanation, so i can understand the code and tell me how to read DOM of PDF file.
you need to install some libaries:
pip install PyPDF2
pip install textract
pip install nltk
This will download the libraries you require t0 parsePDF documents and extract keywords. In order to do this, make sure your PDF file is stored within the folder where you’re writing your script.
Startup your favourite editor and type:
Note: All lines starting with # are comments.
Step 1: Import all libraries:
import PyPDF2
import textract
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
Step 2: Read PDF File
#write a for-loop to open many files -- leave a comment if you'd #like to learn how
filename = 'enter the name of the file here'
#open allows you to read the file
pdfFileObj = open(filename,'rb')
#The pdfReader variable is a readable object that will be parsed
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
#discerning the number of pages will allow us to parse through all #the pages
num_pages = pdfReader.numPages
count = 0
text = ""
#The while loop will read each page
while count < num_pages:
pageObj = pdfReader.getPage(count)
count +=1
text += pageObj.extractText()
#This if statement exists to check if the above library returned #words. It's done because PyPDF2 cannot read scanned files.
if text != "":
text = text
#If the above returns as False, we run the OCR library textract to #convert scanned/image based PDF files into text
else:
text = textract.process(fileurl, method='tesseract', language='eng')
# Now we have a text variable which contains all the text derived #from our PDF file. Type print(text) to see what it contains. It #likely contains a lot of spaces, possibly junk such as '\n' etc.
# Now, we will clean our text variable, and return it as a list of keywords.
Step 3: Convert text into keywords
#The word_tokenize() function will break our text phrases into #individual words
tokens = word_tokenize(text)
#we'll create a new list which contains punctuation we wish to clean
punctuations = ['(',')',';',':','[',']',',']
#We initialize the stopwords variable which is a list of words like #"The", "I", "and", etc. that don't hold much value as keywords
stop_words = stopwords.words('english')
#We create a list comprehension which only returns a list of words #that are NOT IN stop_words and NOT IN punctuations.
keywords = [word for word in tokens if not word in stop_words and not word in punctuations]
Now you have keywords for your file stored as a list. You can do whatever you want with it. Store it in a spreadsheet if you want to make the PDF searchable, or parse a lot of files and conduct a cluster analysis. You can also use it to create a recommender system for resumes for jobs ;)

Python:Comparing the strings in two files and printing the match does not works properly

I am trying to compare the strings of file "formatted_words.txt" with another customised file "dictionary.txt" and in the output I am trying to print those words from "formatted_words.txt"formatted_words file which are present in file "dictionary.txt"dictionary file.
from itertools import izip
with open("formatted_words.txt") as words_file:
with open("dictionary.txt") as dict_file:
all_strings = list(map(str.strip,dict_file))
for word in words_file:
for a_string in all_strings:
if word in a_string:
print a_string
Nevertheless, in the output, all the words of the file "formatted_words.txt" are getting printed, though many words from this file are not in the "dictionary.txt".I cannot use any builtin python dictionary.Any help would be appreciated.
Using sets:
with open('formatted_words.txt') as words_file:
with open('dictionary.txt') as dict_file:
all_strings = set(map(str.strip, dict_file))
words = set(map(str.strip, words_file))
for word in all_strings.intersection(words):
print(word)
Prints nothing because the intersection is empty

How can I remove all POS tags except for 'VBD' and 'VBN' from my CSV file?

I want to remove words tagged with the specific part-of-speech tags VBD and VBN from my CSV file. But, I'm getting the error "IndexError: list index out of range" after entering the following code:
for word in POS_tag_text_clean:
if word[1] !='VBD' and word[1] !='VBN':
words.append(word[0])
My CSV file has 10 reviews of 10 people and the row name is Comment.
Here is my full code:
df_Comment = pd.read_csv("myfile.csv")
def clean(text):
stop = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()
tagged = nltk.pos_tag(text)
text = text.rstrip()
text = re.sub(r'[^a-zA-Z]', ' ', text)
stop_free = " ".join([i for i in text.lower().split() if((i not in stop) and (not i.isdigit()))])
punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
return normalized
text_clean = []
for text in df)Comment['Comment']:
text_clean.append(clean(text).split())
print(text_clean)
POS_tag_text_clean = [nltk.pos_tag(t) for t in text_clean]
print(POS_tag_text_clean)
words=[]
for word in POS_tag_text_clean:
if word[1] !='VBD' and word[1] !='VBN':
words.append(word[0])
How can I fix the error?
It is a bit hard to understand your problem without an example and the corresponding outputs, but it might be this:
Assuming that text is a string, text_clean will be a list of lists of strings, where every string represents a word. After the part-of-speech tagging, POS_tag_text_clean will therefore be a list of lists of tuples, each tuple containing a word and its tag.
If I'm right, then your last loop actually loops over items from your dataframe instead of words, as the name of the variable suggests. If an item has only one word (which is not so unlikely, since you filter a lot in clean()), your call to word[1] will fail with an error similar to the one you report.
Instead, try this code:
words = []
for item in POS_tag_text_clean:
words_in_item = []
for word in item:
if word[1] !='VBD' and word[1] !='VBN':
words_in_item .append(word[0])
words.append(words_in_item)

i get this error "expexted string or buffer"

file = open("C:\\Users\\file.txt")
text = file.read()
def ie_preprocess(text):
sent_tokenizer = PunktSentenceTokenizer(text)
sents=sent_tokenizer.tokenize(text)
print(sents)
word_tokenizer = WordPunctTokenizer()
words =nltk.word_tokenize(sents)
print(words)
tagges = nltk.pos_tag(words)
print(tagges)
ie_preprocess(text)
nltk.word_tokenize() takes in text which is expected to be a string, but you are passing in sents which is a list of sentences.
Instead, you want:
words = nltk.word_tokenize(text)
If you would like to tokenize each sentence into a list of words and get this back as a list of lists, you could use
words = [nltk.word_tokenize(sentence) for sentence in sents]

Resources