I'm attempting to make a program that removes all punctuation from a textfile, however I keep running into an error where it only prints the title of the file rather than the contents within the file.
def removePunctuation(word):
punctuation_list=['.', ',', '?', '!', ';', ':', '\\', '/', "'", '"']
for character in word:
for punctuation in punctuation_list:
if character == punctuation:
word = word.replace(punctuation, "")
return word
print(removePunctuation('phrases.txt'))
Whenever I run the code, it just prints the name of the file; 'phrasestxt' without any punctuation. I want the program to print all the text that is present within the document, which is a few paragraphs long. Any help would be appreciated!
In this case, you must open your file and read it:
def removePunctuation(file_path):
with open(file_path, 'r') as fd:
word = fd.read()
punctuation_list=['.', ',', '?', '!', ';', ':', '\\', '/', "'", '"']
for character in word:
for punctuation in punctuation_list:
if character == punctuation:
word = word.replace(punctuation, "")
return word
print(removePunctuation('phrases.txt'))
If you want, you can replace your double for loop by
word = "".join([i for i in word if i not in punctuation_list])
Related
I am trying to parse a long string of 'objects' enclosed by quotes delimitated by commas. EX:
s='"12345","X","description of x","X,Y",,,"345355"'
output=['"12345"','"X"','"description of x"','"X,Y"','','','"345355"']
I am using split to delimitate by commas:
s=["12345","X","description of x","X,Y",,,"345355"]
s.split(',')
This almost works but the output for the string segment ...,"X,Y",... ends up parsing the data enclosed by quotes to "X and Y". I need the split to ignore commas inside of quotes
Split_Output
Is there a way I can delaminate by commas except for in quotes?
I tried using a regex but it ignores the ...,,,... in data because there are no quotes for blank data in the file I'm parsing. I am not an expert with regex and this sample I used from Python split string on quotes. I do understand what this example is doing and not sure how I could modify it to allow parse data that is not enclosed by quotes.
Thanks!
Regex_Output
split by " (quote) instead of by , (comma) then it will split the string into a list with extra commas, then you can just remove all elements that are commas
s='"12345","X","description of x","X,Y",,,"345355"'
temp = s.split('"')
print(temp)
#> ['', '12345', ',', 'X', ',', 'description of x', ',', 'X,Y', ',,,', '345355', '']
values_to_remove = ['', ',', ',,,']
result = list(filter(lambda val: not val in values_to_remove, temp))
print(result)
#> ['12345', 'X', 'description of x', 'X,Y', '345355']
this should work:
In [1]: import re
In [2]: s = '"12345","X","description of x","X,Y",,,"345355"'
In [3]: pattern = r"(?<=[\",]),(?=[\",])"
In [4]: re.split(pattern, s)
Out[4]: ['"12345"', '"X"', '"description of x"', '"X,Y"', '', '', '"345355"']
Explanation:
(?<=...) is a "positive lookbehind assertion". It causes your pattern (in this case, just a comma, ",") to match commas in the string only if they are preceded by the pattern given by .... Here, ... is [\",], which means "either a quotation mark or a comma".
(?=...) is a "positive lookahead assertion". It causes your pattern to match commas in the string only if they are followed by the pattern specified as ... (again, [\",]: either a quotation mark or a comma).
Since both of these assertions must be satisfied for the pattern to match, it will still work correctly if any of your 'objects' begin or end with commas as well.
You can replace all quotes with empty string.
s='"12345","X","description of x","X,Y",,,"345355"'
n = ''
i = 0
while i < len(s):
if i >= len(s):
break
if i<len(s) and s[i] == '"':
i+=1
while i<len(s) and s[i] != '"':
n+=s[i]
i+=1
i+=1
if i < len(s) and s[i] == ",":
n+=", "
i+=1
n.split(", ")
output: ['12345', 'X', 'description of x', 'X,Y', '', '', '345355']
I need to split a string by multiple delimiters.
My string is HELLO+WORLD-IT*IS=AMAZING.
I would like the result be
["HELLO", "+", "WORLD", "-", "IT", "*", "IS", "=", "AMAZING"
I hear that re.findall() may handle it but I can't find out the solution.
Using re.split works in this case. Put every delimiter in a capturing group:
pattern = "(\+|-|\*|=)"
result = re.split(pattern, string)
Given:
s='HELLO+WORLD-IT*IS=AMAZING'
You can split on any break between a word and non word character as a general case with the word boundary assertion \b:
>>> re.split(r'\b', s)
['', 'HELLO', '+', 'WORLD', '-', 'IT', '*', 'IS', '=', 'AMAZING', '']
And remove the '' at the start and end like so:
>>> re.split(r'\b', ur_string)[1:-1]
['HELLO', '+', 'WORLD', '-', 'IT', '*', 'IS', '=', 'AMAZING']
Or if you know that is the full set of delimiters that you want to use for a split, define a character class of them and capture the delimiter:
>>> re.split(r'([+\-*=])', s)
['HELLO', '+', 'WORLD', '-', 'IT', '*', 'IS', '=', 'AMAZING']
Since \b is a zero width assertion (it does not consume characters to match) you don't have to capture what the delimiter was that caused the split. The assertion of \b is also true at the start and end of the string so those blanks need to be removed.
Since - is used in a character class to define a range of characters such as [0-9] you have to escape the - in [+\-*=].
As mentioned in the title I am trying to create a function to parse through a string, which would be my file name and return another version but with all the backslashes replaced with forward slashes. My file names are saved with backslashes instead of forward slashes and thus does not work unless I use 'r' before the file name. I know this is an easy workaround but I am now interested in defining a function to fix this solution.
Here is the code I am attempting to use:
backslash = '\''
def parser(string, character):
letters = []
for i in string:
if i != character:
letters.append(i)
else:
letters.append('/')
return letters
This is my output, which is obviously wrong. Does anyone have any ideas how I can fix my issue or a way to circumvent this?
[B',
'o',
'b',
'\\',
'g',
'o',
'e',
's',
'\\',
's',
'h',
'o',
'p',
'p',
'i',
'n',
'g']
p.s. If it makes any difference I am using windows 10 and microsoft.
The backslash ( \) character is used to escape characters that
otherwise have a special meaning, such as newline, backslash itself,
or the quote character
Your backslash character is holding ', which is not correct. For your variable to hold the backslash literal itself, you must use two backslash as given in below code.
The following code parses correctly -
backslash = '\\'
def parser(string, character):
letters = []
for i in string:
if i != character:
letters.append(i)
else:
letters.append('/')
return letters
# The address location to be parsed
address_with_backslash = 'C:\\user\\something\\InvestingScientist'
print("Original address : " + address_with_backslash)
print("\nAddress after Parsing : " + "".join(parser(address_with_backslash,backslash)))
Output :
Original address : C:\user\something\InvestingScientist
Address after Parsing : C:/user/something/InvestingScientist
Hope this helps!
backslash is the escape character so if you want to have a literal backslash in a string you need to use a double-backslash
Heres the solution:
backslash = '\\'
def parser(string, character):
letters = []
for i in string:
if i != character:
letters.append(i)
else:
letters.append('/')
return letters
Say for example you are looping through letters in a list, but you have to check for punctuation. Would the following code still be O(n), n being the max characters in a line? I think this because the punctuation list is a fixed size, so that if statement would still be O(1) right?
punctuation = [',', '.', '?', '!', ':', ';', '"', ' ', '\t', '\n']
for letters in line:
if letters not in punctuation:
word += letters
Yes you're right, since the punctuation list is fixed in size (and not dependent on N), the overall time complexity of your code should be O(N).
As other commentators have pointed out, O(N*M) would probably be more precise, with N being the number of characters you are reading in total, and M the number of punctuation characters.
If you want to optimize there, you could store the punctuation characters in a set, where in operates in constant time:
punctuation = {',', '.', '?', '!', ':', ';', '"', ' ', '\t', '\n'}
for letter in line:
if letter not in punctuation:
word += letter
text='I miss Wonderland #feeling sad #omg'
prefix=('#','#')
for line in text:
if line.startswith(prefix):
text=text.replace(line,'')
print(text)
The output should be:
'I miss Wonderland'
But my output is the original string with the prefix removed
So it seems that you do not in fact want to remove the whole "string" or "line", but rather the word? Then you'll want to split your string into words:
words = test.split(' ')
And now iterate through each element in words, performing your check on the first letter. Lastly, combine these elements back into one string:
result = ""
for word in words:
if !word.startswith(prefix):
result += (word + " ")
for line in text in your case will iterate over each character in the text, not each word. So when it gets to e.g., '#' in '#feeling', it will remove the #, but 'feeling' will remain because none of the other characters in that string start with/are '#' or '#'. You can confirm that your code is going character by character by doing:
for line in text:
print(line)
Try the following instead, which does the filtering in a single line:
text = 'I miss Wonderland #feeling sad #omg'
prefix = ('#','#')
words = text.split() # Split the text into a list of its individual words.
# Join only those words that don't start with prefix
print(' '.join([word for word in words if not word.startswith(prefix)]))