How to remove duplicate sentences from paragraph using NLTK? - python-3.x

I had a huge document with many repeated sentences such as (footer text, hyperlinks with alphanumeric chars), I need to get rid of those repeated hyperlinks or Footer text. I have tried with the below code but unfortunately couldn't succeed. Please review and help.
corpus = "We use file handling methods in python to remove duplicate lines in python text file or function. The text file or function has to be in the same directory as the python program file. Following code is one way of removing duplicates in a text file bar.txt and the output is stored in foo.txt. These files should be in the same directory as the python script file, else it won’t work.Now, we should crop our big image to extract small images with amounts.In terms of topic modelling, the composites are documents and the parts are words and/or phrases (phrases n words in length are referred to as n-grams).We use file handling methods in python to remove duplicate lines in python text file or function.As an example I will use some image of a bill, saved in the pdf format. From this bill I want to extract some amounts.All our wrappers, except of textract, can’t work with the pdf format, so we should transform our pdf file to the image (jpg). We will use wand for this.Now, we should crop our big image to extract small images with amounts."
from nltk.tokenize import sent_tokenize
sentences_with_dups = []
for sentence in corpus:
words = sentence.sent_tokenize(corpus)
if len(set(words)) != len(words):
sentences_with_dups.append(sentence)
print(sentences_with_dups)
else:
print('No duplciates found')
Error message for the above code :
AttributeError: 'str' object has no attribute 'sent_tokenize'
Desired Output :
Duplicates = ['We use file handling methods in python to remove duplicate lines in python text file or function.','Now, we should crop our big image to extract small images with amounts.']
Cleaned_corpus = {removed duplicates from corpus}

First of all, the example you provided is messed up with spaces between the last period and next sentence, there are a lot of space missing in between them, so I cleaned up.
Then you can do:
corpus = "......"
sentences = sent_tokenize(corpus)
duplicates = list(set([s for s in sentences if sentences.count(s) > 1]))
cleaned = list(set(sentences))
Above will mess the order. If you care about the order, you can do the following to preserve:
duplicates = []
cleaned = []
for s in sentences:
if s in cleaned:
if s in duplicates:
continue
else:
duplicates.append(s)
else:
cleaned.append(s)

Related

Automating The Boring Stuff With Python - Chapter 8 - Exercise - Regex Search

I'm trying to complete the exercise for Chapter 8 using which takes a user supplied regular expression and uses it to search each string in each text file in a folder.
I keep getting the error:
AttributeError: 'NoneType' object has no attribute 'group'
The code is here:
import os, glob, re
os.chdir("C:\Automating The Boring Stuff With Python\Chapter 8 - \
Reading and Writing Files\Practice Projects\RegexSearchTextFiles")
userRegex = re.compile(input('Enter your Regex expression :'))
for textFile in glob.glob("*.txt"):
currentFile = open(textFile) #open the text file and assign it to a file object
textCurrentFile = currentFile.read() #read the contents of the text file and assign to a variable
print(textCurrentFile)
#print(type(textCurrentFile))
searchedText = userRegex.search(textCurrentFile)
searchedText.group()
When I try this individually in the IDLE shell it works:
textCurrentFile = "What is life like for those left behind when the last foreign troops flew out of Afghanistan? Four people from cities and provinces around the country told the BBC they had lost basic freedoms and were struggling to survive."
>>> userRegex = re.compile(input('Enter the your Regex expression :'))
Enter the your Regex expression :troops
>>> searchedText = userRegex.search(textCurrentFile)
>>> searchedText.group()
'troops'
But I can't seem to make it work in the code when I run it. I'm really confused.
Thanks
Since you are just looping across all .txt files, there could be files that doesn't have the word "troops" in it. To prove this, don't call the .group(), just perform:
print(textFile, textCurrentFile, searchedText)
If you see that searchedText is None, then that means the contents of textFile (which is textCurrentFile) doesn't have the word "troops".
You could either:
Add the word troops in all .txt files.
Only select the target .txt files, not all.
Check first if if the match is found before accessing .group()
print(searchedText.group() if searchedText else None)

PIL Drawing text and breaking lines at \n

Hi im having some trouble getting sometimes longer texts which should be line breaked at specific lines onto an image it always just prints the \n with it without breaking the line and i cant find any info online if this is even possible or if it just sees the raw string to put on the img without checking for linebreaks. The Text is just some random stuff from a CSV
def place_text(self,text,x,y):
temp = self.csv_input[int(c)][count]
font = ImageFont.truetype('arial.ttf', 35) # font z.b.: arial.ttf
w_txt, h_txt = font.getsize(text)
print("Jetzt sind wie in der zweiten möglichkeit")
draw_text = ImageDraw.Draw(self.card[self.cardCount])
draw_text.text((x, y), temp, fill="black", font=font, align="left")
Yeah i know this Code is kinda all over the place but for putting the text on the image that shouldnt cause any issues does it?
Writing stuff on an Imgae with that results in just one line of continuous text with the \n's still in there and no line breaks.
Found the answer the String pulled fomr the CSV had to be decoded again before beeing placed
text = bytes(text, 'utf-8').decode("unicode_escape")
did the trick

how to extract text from PDF file using python , i never did this and not getting the DOM of PDF file

this is my PDF file "https://drive.google.com/open?id=1M9k1AO17ZSwT6HTrTrB-uz85ps3WL1wS"
Help me someone to extract this , as i search on SO getting some clue to extract text using these libries PyPDF2, PyPDF2.pdf , PageObject, u_, ContentStream, b_, TextStringObject ,but not getting how to use it.
someone please help me to extract this with some explanation, so i can understand the code and tell me how to read DOM of PDF file.
you need to install some libaries:
pip install PyPDF2
pip install textract
pip install nltk
This will download the libraries you require t0 parsePDF documents and extract keywords. In order to do this, make sure your PDF file is stored within the folder where you’re writing your script.
Startup your favourite editor and type:
Note: All lines starting with # are comments.
Step 1: Import all libraries:
import PyPDF2
import textract
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
Step 2: Read PDF File
#write a for-loop to open many files -- leave a comment if you'd #like to learn how
filename = 'enter the name of the file here'
#open allows you to read the file
pdfFileObj = open(filename,'rb')
#The pdfReader variable is a readable object that will be parsed
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
#discerning the number of pages will allow us to parse through all #the pages
num_pages = pdfReader.numPages
count = 0
text = ""
#The while loop will read each page
while count < num_pages:
pageObj = pdfReader.getPage(count)
count +=1
text += pageObj.extractText()
#This if statement exists to check if the above library returned #words. It's done because PyPDF2 cannot read scanned files.
if text != "":
text = text
#If the above returns as False, we run the OCR library textract to #convert scanned/image based PDF files into text
else:
text = textract.process(fileurl, method='tesseract', language='eng')
# Now we have a text variable which contains all the text derived #from our PDF file. Type print(text) to see what it contains. It #likely contains a lot of spaces, possibly junk such as '\n' etc.
# Now, we will clean our text variable, and return it as a list of keywords.
Step 3: Convert text into keywords
#The word_tokenize() function will break our text phrases into #individual words
tokens = word_tokenize(text)
#we'll create a new list which contains punctuation we wish to clean
punctuations = ['(',')',';',':','[',']',',']
#We initialize the stopwords variable which is a list of words like #"The", "I", "and", etc. that don't hold much value as keywords
stop_words = stopwords.words('english')
#We create a list comprehension which only returns a list of words #that are NOT IN stop_words and NOT IN punctuations.
keywords = [word for word in tokens if not word in stop_words and not word in punctuations]
Now you have keywords for your file stored as a list. You can do whatever you want with it. Store it in a spreadsheet if you want to make the PDF searchable, or parse a lot of files and conduct a cluster analysis. You can also use it to create a recommender system for resumes for jobs ;)

Why the output of "open" function doesn't allow me to attribute index?

I started to learn programming in python3 and i am doing a project that reads the content of a text file and tells you how many words are in the file. Being me I always want to challenge myself and tried to add in the output message the name of the file so in the future I will do a GUI for it and so on.
The error that I get is : AttributeError: '_io.TextIOWrapper' object has no attribute 'index'
Here is my code:
# Open text file
document = open("text2.txt", "r+")
# Reads the text file and splits it into arrays
text_split = document.read().split()
# Count the words
words = len(text_split)
# Display the counted words
document_name = document[document.index("name=")]
output = "In the file {} there are {} words.".format(document_name, words)
print (output)
Decided to take #Jean-François Fabre 's advice and abandoned the idea to also output the name of the file (FOR NOW).

Save Tweets as .csv, Contains String Literals and Entities

I have tweets saved in JSON text files. I have a friend who wants tweets containing keywords, and the tweets need to be saved in a .csv. Finding the tweets is easy, but I run into two problems and am struggling with finding a good solution.
Sample data are here. I have included the .csv file that is not working as well as a file where each row is a tweet in JSON format.
To get into a dataframe, I use pd.io.json.json_normalize. It works smoothly and handles nested dictionaries well, but pd.to_csv does not work because it does not handle, as far as I can tell, string literals well. Some of the tweets contain '\n' in the text field, and pandas writes new lines when that happens.
No problem, I process pd['text'] to remove '\n'. The resulting file still has too many rows, 1863 compared to the 1388 it should. I then modified my code to replace all string-literals:
tweets['text'] = [item.replace('\n', '') for item in tweets['text']]
tweets['text'] = [item.replace('\r', '') for item in tweets['text']]
tweets['text'] = [item.replace('\\', '') for item in tweets['text']]
tweets['text'] = [item.replace('\'', '') for item in tweets['text']]
tweets['text'] = [item.replace('\"', '') for item in tweets['text']]
tweets['text'] = [item.replace('\a', '') for item in tweets['text']]
tweets['text'] = [item.replace('\b', '') for item in tweets['text']]
tweets['text'] = [item.replace('\f', '') for item in tweets['text']]
tweets['text'] = [item.replace('\t', '') for item in tweets['text']]
tweets['text'] = [item.replace('\v', '') for item in tweets['text']]
Same result, pd.to_csv saves a file with more rows than actual tweets. I could replace string literals in all columns, but that is clunky.
Fine, don't use pandas. with open(outpath, 'w') as f: and so on creates a .csv file with the correct number of rows. Reading the file, either with pd.read_csv or reading line by line will fail, however.
It fails because of how Twitter handles entities. If a tweet's text contains a url, mention, hashtag, media, or link, then Twitter returns a dictionary that contains commas. When pandas flattens the tweet, the commas get preserved within a column, which is good. But when the data are read in, pandas splits what should be one column into multiple columns. For example, a column might look like [{'screen_name': 'ProfOsinbajo','name': 'Prof Yemi Osinbajo','id': 2914442873,'id_str': '2914442873', 'indices': [0,' 13]}]', so splitting on commas creates too many columns:
[{'screen_name': 'ProfOsinbajo',
'name': 'Prof Yemi Osinbajo',
'id': 2914442873",
'id_str': '2914442873'",
'indices': [0,
13]}]
That is the outcome whether I use with open(outpath) as f: as well. With that approach, I have to split lines, so I split on commas. Same problem - I do not want to split on commas if they occur in a list.
I want those data to be treated as one column when saved to file or read from file. What am I missing? In terms of the data at the repository above, I want to convert forstackoverflow2.txt to a .csv with as many rows as tweets. Call this file A.csv, and let's say it has 100 columns. When opened, A.csv should also have 100 columns.
I'm sure there are details I've left out, so please let me know.
Using the csv module works. It writes the file out as a .csv while counting the lines, then reads it back in and counts the lines again.
The result matched, and opening the .csv in Excel also gives 191 columns and 1338 lines of data.
import json
import csv
with open('forstackoverflow2.txt') as f,\
open('out.csv','w',encoding='utf-8-sig',newline='') as out:
data = json.loads(next(f))
print('columns',len(data))
writer = csv.DictWriter(out,fieldnames=sorted(data))
writer.writeheader() # write header
writer.writerow(data) # write the first line of data
for i,line in enumerate(f,2): # start line count at two
data = json.loads(line)
writer.writerow(data)
print('lines',i)
with open('out.csv',encoding='utf-8-sig',newline='') as f:
r = csv.DictReader(f)
lines = list(r)
print('readback columns',len(lines[0]))
print('readback lines',len(lines))
Output:
columns 191
lines 1338
readback lines 1338
readback columns 191
#Mark Tolonen's answer is helpful, but I ended up going a separate route. When saving the tweets to file, I removed all \r, \n, \t, and \0 characters from anywhere in the JSON. Then, I save the file with as tab separated so that commas in fields like location or text do not confuse a read function.

Resources