python not removing punctuation - python-3.x

i have a text file i want to remove punctuation and save it as a new file but it is not removing anything any idea why?
code:
def punctuation(string):
punctuations = '''!()-[]{};:'"\,<>./?##$%^&*_~'''
for x in string.lower():
if x in punctuations:
string = string.replace(x, "")
# Print string without punctuation
print(string)
file = open('ir500.txt', 'r+')
file_no_punc = (file.read())
punctuation(l)
with open('ir500_no_punc.txt', 'w') as file:
file.write(file_no_punc)
removing any punctuation why?

def punctuation(string):
punctuations = '''!()-[]{};:'"\,<>./?##$%^&*_~'''
for x in string.lower():
if x in punctuations:
string = string.replace(x, "")
# return string without punctuation
return string
file = open('ir500.txt', 'r+')
file_no_punc = (file.read())
file_no_punc = punctuation(file_no_punc)
with open('ir500_no_punc.txt', 'w') as file:
file.write(file_no_punc)
Explanation:
I changed only punctuation(l) to file_no_punc = punctuation(file_no_punc) and print(string) to return string
1) what is l in punctuation(l) ?
2) you are calling punctuation() - which works correctly - but do not use its return value
3) because it is not currently returning a value, just printing it ;-)
Please note that I made only the minimal change to make it work. You might want to post it to our code review site, to see how it could be improved.
Also, I would recommend that you get a good IDE. In my opinion, you cannot beat PyCharm community edition. Learn how to use the debugger; it is your best friend. Set breakpoints, run the code; it will stop when it hits a breakpoint; you can then examine the values of your variables.

taking out the file reading/writing, you could to remove the punctuation from a string like this:
table = str.maketrans("", "", r"!()-[]{};:'\"\,<>./?##$%^&*_~")
# # or maybe even better
# import string
# table = str.maketrans("", "", string.punctuation)
file_with_punc = r"abc!()-[]{};:'\"\,<>./?##$%^&*_~def"
file_no_punc = file_with_punc.lower().translate(table)
# abcdef
where i use str.maketrans and str.translate.
note that python strings are immutable. there is no way to change a given string; every operation you perform on a string will return a new instance.

Related

Creating a python spellchecker using tkinter

For school, I need to create a spell checker, using python. I decided to do it using a GUI created with tkinter. I need to be able to input a text (.txt) file that will be checked, and a dictionary file, also a text file. The program needs to open both files, check the check file against the dictionary file, and then display any words that are misspelled.
Here's my code:
import tkinter as tk
from tkinter.filedialog import askopenfilename
def checkFile():
# get the sequence of words from a file
text = open(file_ent.get())
dictDoc = open(dict_ent.get())
for ch in '!"#$%&()*+,-./:;<=>?#[\\]^_`{|}~':
text = text.replace(ch, ' ')
words = text.split()
# make a dictionary of the word counts
wordDict = {}
for w in words:
wordDict[w] = wordDict.get(w,0) + 1
for k in dictDict:
dictDoc.pop(k, None)
misspell_lbl["text"] = dictDoc
# Set-up the window
window = tk.Tk()
window.title("Temperature Converter")
window.resizable(width=False, height=False)
# Setup Layout
frame_a = tk.Frame(master=window)
file_lbl = tk.Label(master=frame_a, text="File Name")
space_lbl = tk.Label(master=frame_a, width = 6)
dict_lbl =tk.Label(master=frame_a, text="Dictionary File")
file_lbl.pack(side=tk.LEFT)
space_lbl.pack(side=tk.LEFT)
dict_lbl.pack(side=tk.LEFT)
frame_b = tk.Frame(master=window)
file_ent = tk.Entry(master=frame_b, width=20)
dict_ent = tk.Entry(master=frame_b, width=20)
file_ent.pack(side=tk.LEFT)
dict_ent.pack(side=tk.LEFT)
check_btn = tk.Button(master=window, text="Spellcheck", command=checkFile)
frame_c = tk.Frame(master=window)
message_lbl = tk.Label(master=frame_c, text="Misspelled Words:")
misspell_lbl = tk.Label(master=frame_c, text="")
message_lbl.pack()
misspell_lbl.pack()
frame_a.pack()
frame_b.pack()
check_btn.pack()
frame_c.pack()
# Run the application
window.mainloop()
I want the file to check against the dictionary and display the misspelled words in the misspell_lbl.
The test files I'm using to make it work, and to submit with the assignment are here:
check file
dictionary file
I preloaded the files to the site that I'm submitting this on, so it should just be a matter of entering the file name and extension, not the entire path.
I'm pretty sure the problem is with my function to read and check the file, I've been beating my head on a wall trying to solve this, and I'm stuck. Any help would be greatly appreciated.
Thanks.
The first problem is with how you try to read the files. open(...) will return a _io.TextIOWrapper object, not a string and this is what causes your error. To get the text from the file, you need to use .read(), like this:
def checkFile():
# get the sequence of words from a file
with open(file_ent.get()) as f:
text = f.read()
with open(dict_ent.get()) as f:
dictDoc = f.read().splitlines()
The with open(...) as f part gives you a file object called f, and automatically closes the file when it's done. This is more concise version of
f = open(...)
text = f.read()
f.close()
f.read() will get the text from the file. For the dictionary I also added .splitlines() to turn the newline separated text into a list.
I couldn't really see where you'd tried to check for misspelled words, but you can do it with a list comprehension.
misspelled = [x for x in words if x not in dictDoc]
This gets every word which is not in the dictionary file and adds it to a list called misspelled. Altogether, the checkFile function now looks like this, and works as expected:
def checkFile():
# get the sequence of words from a file
with open(file_ent.get()) as f:
text = f.read()
with open(dict_ent.get()) as f:
dictDoc = f.read().splitlines()
for ch in '!"#$%&()*+,-./:;<=>?#[\\]^_`{|}~':
text = text.replace(ch, ' ')
words = text.split()
# make a dictionary of the word counts
wordDict = {}
for w in words:
wordDict[w] = wordDict.get(w,0) + 1
misspelled = [x for x in words if x not in dictDoc]
misspell_lbl["text"] = misspelled

How to split strings from .txt file into a list, sorted from A-Z without duplicates?

For instance, the .txt file includes 2 lines, separated by commas:
John, George, Tom
Mark, James, Tom,
Output should be:
[George, James, John, Mark, Tom]
The following will create the list and store each item as a string.
def test(path):
filename = path
with open(filename) as f:
f = f.read()
f_list = f.split('\n')
for i in f_list:
if i == '':
f_list.remove(i)
res1 = []
for i in f_list:
res1.append(i.split(', '))
res2 = []
for i in res1:
res2 += i
res3 = [i.strip(',') for i in res2]
for i in res3:
if res3.count(i) != 1:
res3.remove(i)
res3.sort()
return res3
print(test('location/of/file.txt'))
Output:
['George', 'James', 'John', 'Mark', 'Tom']
Your file opening is fine, although the 'r' is redundant since that's the default. You claim it's not, but it is. Read the documentation.
You have not described what task is so I have no idea what's going on there. I will assume that it is correct.
Rather than populating a list and doing a membership test on every iteration - which is O(n^2) in time - can you think of a different data structure that guarantees uniqueness? Google will be your friend here. Once you discover this data structure, you will not have to perform membership checks at all. You seem to be struggling with this concept; the answer is a set.
The input data format is not rigorously defined. Separators may be commas or commas with trailing spaces, and may appear (or not) at the end of the line. Consider making an appropriate regular expression and using its splitting feature to split individual lines, though normal splitting and stripping may be easier to start.
In the following example code, I've:
ignored task since you've said that that's fine;
separated actual parsing of file content from parsing of in-memory content to demonstrate the function without a file;
used a set comprehension to store unique results of all split lines; and
used a generator to sorted that drops empty strings.
from io import StringIO
from typing import TextIO, List
def parse(f: TextIO) -> List[str]:
words = {
word.strip()
for line in f
for word in line.split(',')
}
return sorted(
word for word in words if word != ''
)
def parse_file(filename: str) -> List[str]:
with open(filename) as f:
return parse(f)
def test():
f = StringIO('John, George , Tom\nMark, James, Tom, ')
words = parse(f)
assert words == [
'George', 'James', 'John', 'Mark', 'Tom',
]
f = StringIO(' Han Solo, Boba Fet \n')
words = parse(f)
assert words == [
'Boba Fet', 'Han Solo',
]
if __name__ == '__main__':
test()
I came up with a very simple solution if anyone will need:
lines = x.read().split()
lines.sort()
new_list = []
[new_list.append(word) for word in lines if word not in new_list]
return new_list
with open("text.txt", "r") as fl:
list_ = set()
for line in fl.readlines():
line = line.strip("\n")
line = line.split(",")
[list_.add(_) for _ in line if _ != '']
print(list_)
I think that you missed a comma after Jim in the first line.
You can avoid the use of a loop by using split property :
content=file.read()
my_list=content.split(",")
to delete the occurence in your list you can transform it to set :
my_list=list(set(my_list))
then you can sort it using sorted
so the finale code :
with open("file.txt", "r") as file :
content=file.read()
my_list=content.replace("\n","").replace(" ", "").split(",")
result=sorted(list(set(my_list)))
you can add a key to your sort function

Extract characters within certain symbols

I have extracted text from an HTML file, and have the whole thing in a string.
I am looking for a method to loop through the string, and extract only values that are within square brackets and put strings in a list.
I have looked in to several questions, among them this one: Extract character before and after "/"
But i am having a hard time modifying it. Can someone help?
Solved!
Thank you for all your inputs, I will definitely look more into regex. I managed to do what i wanted in a pretty manual way (may not be beautiful):
#remove all html code and append to string
for i in html_file:
html_string += str(html2text.html2text(i))
#set this boolean if current character is either [ or ]
add = False
#extract only values within [ or ], based on add = T/F
for i in html_string:
if i == '[':
add = True
if i == ']':
add = False
clean_string += str(i)
if add == True:
clean_string += str(i)
#split string into list without square brackets
clean_string_list = clean_string.split('][')
The HTML file I am looking to get as pure text (dataframe later on) instead of HTML, is my personal Facebook data that i have downloaded.
Try out this regex, given a string it will place all text inside [ ] into a list.
import re
print(re.findall(r'\[(\w+)\]','spam[eggs][hello]'))
>>> ['eggs', 'hello']
Also this is a great reference for building your own regex.
https://regex101.com
EDIT: If you have nested square brackets here is a function that will handle that case.
import re
test ='spam[eg[nested]gs][hello]'
def square_bracket_text(test_text,found):
"""Find text enclosed in square brackets within a string"""
matches = re.findall(r'\[(\w+)\]',test_text)
if matches:
found.extend(matches)
for word in found:
test_text = test_text.replace('[' + word + ']','')
square_bracket_text(test_text,found)
return found
match = []
print(square_bracket_text(test,match))
>>>['nested', 'hello', 'eggs']
hope it helps!
You can also use re.finditer() for this, see below example.
Let suppose, we have word characters inside brackets so regular expression will be \[\w+\].
If you wish, check it at https://rextester.com/XEMOU85362.
import re
s = "<h1>Hello [Programmer], you are [Excellent]</h1>"
g = re.finditer("\[\w+\]", s)
l = list() # or, l = []
for m in g:
text = m.group(0)
l.append(text[1: -1])
print(l) # ['Programmer', 'Excellent']

Python changing file name

My application offers the ability to the user to export its results. My application exports text files with name Exp_Text_1, Exp_Text_2 etc. I want it so that if a file with the same file name pre-exists in Desktop then to start counting from this number upwards. For example if a file with name Exp_Text_3 is already in Desktop, then I want the file to be created to have the name Exp_Text_4.
This is my code:
if len(str(self.Output_Box.get("1.0", "end"))) == 1:
self.User_Line_Text.set("Nothing to export!")
else:
import os.path
self.txt_file_num = self.txt_file_num + 1
file_name = os.path.join(os.path.expanduser("~"), "Desktop", "Exp_Txt" + "_" + str(self.txt_file_num) + ".txt")
file = open(file_name, "a")
file.write(self.Output_Box.get("1.0", "end"))
file.close()
self.User_Line_Text.set("A text file has been exported to Desktop!")
you likely want os.path.exists:
>>> import os
>>> help(os.path.exists)
Help on function exists in module genericpath:
exists(path)
Test whether a path exists. Returns False for broken symbolic links
a very basic example would be create a file name with a formatting mark to insert the number for multiple checks:
import os
name_to_format = os.path.join(os.path.expanduser("~"), "Desktop", "Exp_Txt_{}.txt")
#the "{}" is a formatting mark so we can do file_name.format(num)
num = 1
while os.path.exists(name_to_format.format(num)):
num+=1
new_file_name = name_to_format.format(num)
this would check each filename starting with Exp_Txt_1.txt then Exp_Txt_2.txt etc. until it finds one that does not exist.
However the format mark may cause a problem if curly brackets {} are part of the rest of the path, so it may be preferable to do something like this:
import os
def get_file_name(num):
return os.path.join(os.path.expanduser("~"), "Desktop", "Exp_Txt_" + str(num) + ".txt")
num = 1
while os.path.exists(get_file_name(num)):
num+=1
new_file_name = get_file_name(num)
EDIT: answer to why don't we need get_file_name function in first example?
First off if you are unfamiliar with str.format you may want to look at Python doc - common string operations and/or this simple example:
text = "Hello {}, my name is {}."
x = text.format("Kotropoulos","Tadhg")
print(x)
print(text)
The path string is figured out with this line:
name_to_format = os.path.join(os.path.expanduser("~"), "Desktop", "Exp_Txt_{}.txt")
But it has {} in the place of the desired number. (since we don't know what the number should be at this point) so if the path was for example:
name_to_format = "/Users/Tadhg/Desktop/Exp_Txt_{}.txt"
then we can insert a number with:
print(name_to_format.format(1))
print(name_to_format.format(2))
and this does not change name_to_format since str objects are Immutable so the .format returns a new string without modifying name_to_format. However we would run into a problem if out path was something like these:
name_to_format = "/Users/Bob{Cat}/Desktop/Exp_Txt_{}.txt"
#or
name_to_format = "/Users/Bobcat{}/Desktop/Exp_Txt_{}.txt"
#or
name_to_format = "/Users/Smiley{:/Desktop/Exp_Txt_{}.txt"
Since the formatting mark we want to use is no longer the only curly brackets and we can get a variety of errors:
KeyError: 'Cat'
IndexError: tuple index out of range
ValueError: unmatched '{' in format spec
So you only want to rely on str.format when you know it is safe to use. Hope this helps, have fun coding!

Insert spaces next to punctuation when writing to .txt file

I have written a function that uses an nltk tokenizer to preprocess .txt files. Basically, the function takes a .txt file, modifies it so that each sentence appears on a separate line, and overwrites the modified file on the old file.
I would like to modify the function (or maybe to create another function) to also insert spaces before punctuation and sometimes after punctuation, as in the case of a parenthesis. In other words, leaving aside what the function already does, I also would like it to change "I want to write good, clean sentences." into "I want to write good , clean sentences ."
I am a beginner, and I suspect I probably am just missing something pretty simple. A little help would be much appreciated.
My existing code is below:
import nltk.data
def readtowrite(filename):
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
with open(filename, 'r+') as f:
fout = str(f.read())
stuff = str('\n'.join(sent_detector.tokenize(fout.strip())))
f.seek(0)
f.write(stuff)
Here is the answer I came up with. Basically, I created a separate function to insert spaces before and after the punctuation in a sentence. I then called that function in the readtowrite function.
Code below:
import string
import nltk.data
def strip_punct(sentence):
wordlist = []
for word in sentence:
for char in word:
cleanword = ""
if char in string.punctuation:
char = " " + char + " "
cleanword += char
wordlist.append(cleanword)
return ''.join(wordlist)
def readtowrite(filename):
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
with open(filename, 'r+') as f:
fout = str(f.read())
stuff = str('\n'.join(sent_detector.tokenize(fout.strip())))
morestuff = str(strip_punct(stuff))
f.seek(0)
f.write(morestuff)
I think loading nltk.data.load('tokenizers/punkt/english.pickle') is equivalent to calling the sent_tokenize() and word_tokenize function in NLTK.
Maybe this script will be more helpful:
def readtowrite(infile, outfile):
with open(outfile, 'w') as fout:
with open(filename, 'r') as fin:
output = "\n".join([" ".join(word_tokenize(i)) for i in sent_tokenize(str(f.read()))])
fout.write(output)

Resources