Strip symbols/punctuation from a given string - python-3.x

I'm trying to remove all forms of punctuation in a string/file.
This is my code thus far.
>>def remove_symbols(p):
>>punc=set('''`~!##$%^&*()-_=+\|]}[{;:'",<.>/?''')
>>for line in p:
>>clean =''.join(c for c in line if not c in punc)
>>print(clean)
But the end result looks like this if p = "I'm your's!"
I
m
y
o
u
r
s
When really, I want it to look like this --> "Im yours"
I would appreciate any suggestions.

It looks like you're trying to remove symbols from a paragraph by iterating through it one line at a time. But instead of iterating through each line, you're iterating through each character. To iterate through each line instead, use split:
def remove_symbols(p):
punc=set('''`~!##$%^&*()-_=+\|]}[{;:'",<.>/?''')
for line in p.split("\n"):
clean =''.join(c for c in line if not c in punc)
print(clean)
remove_symbols("I'm your's!")
Result:
Im yours
Alternatively, get rid of the for loop entirely, and let your expression run over the whole text at once.
def remove_symbols(p):
punc=set('''`~!##$%^&*()-_=+\|]}[{;:'",<.>/?''')
return ''.join(c for c in p if not c in punc)
print remove_symbols("I'm your's!")

Related

How to achieve below situation using python list comprehension?

rows = [(d = re.split("\s{2,}|\|", line)) for line in lines if len(d) > 5 and d[0]!='' ]
As in the code snippet shown, I am splitting a list of lines by spaces in each line. I am trying to assign split to a variable d so that I can use it later in if condition and can avoid repetitive split.
Is there way to achieve it?
rows = [d for d in [re.split("\s{2,}|\|", line) for line in lines] if len(d) > 5 and d[0]!='']

How to count strings in specified field within each line of one or more csv files

Writing a Python program (ver. 3) to count strings in a specified field within each line of one or more csv files.
Where the csv file contains:
Field1, Field2, Field3, Field4
A, B, C, D
A, E, F, G
Z, E, C, D
Z, W, C, Q
the script is executed, for example:
$ ./script.py 1,2,3,4 file.csv
And the result is:
A 10
C 7
D 2
E 2
Z 2
B 1
Q 1
F 1
G 1
W 1
ERROR
the script is executed, for example:
$ ./script.py 1,2,3,4 file.csv file.csv file.csv
Where the error occurs:
for rowitem in reader:
for pos in field:
pos = rowitem[pos] ##<---LINE generating error--->##
if pos not in fieldcnt:
fieldcnt[pos] = 1
else:
fieldcnt[pos] += 1
TypeError: list indices must be integers or slices, not str
Thank you!
Judging from the output, I'd say that the fields in the csv file does not influence the count of the string. If the string uniqueness is case-insensitive please remember to use yourstring.lower() to return the string so that different case matches are actually counted as one. Also do keep in mind that if your text is large the number of unique strings you might find could be very large as well, so some sort of sorting must be in place to make sense of it! (Or else it might be a long list of random counts with a large portion of it being just 1s)
Now, to get a count of unique strings using the collections module is an easy way to go.
file = open('yourfile.txt', encoding="utf8")
a= file.read()
#if you have some words you'd like to exclude
stopwords = set(line.strip() for line in open('stopwords.txt'))
stopwords = stopwords.union(set(['<media','omitted>','it\'s','two','said']))
# make an empty key-value dict to contain matched words and their counts
wordcount = {}
for word in a.lower().split(): #use the delimiter you want (a comma I think?)
# replace punctuation so they arent counted as part of a word
word = word.replace(".","")
word = word.replace(",","")
word = word.replace("\"","")
word = word.replace("!","")
if word not in stopwords:
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
That should do it. The wordcount dict should contain the word and it's frequency. After that just sort it using collections and print it out.
word_counter = collections.Counter(wordcount)
for word, count in word_counter.most_common(20):
print(word, ": ", count)
I hope this solves your problem. Lemme know if you face problems.

How to process each word in a python program

I want to write a program that reads every word from every line of a text file.
I tried using nested loop but the second loop starts reading each word. Can someone explain this? Accodrding to me it should read the individual words instead of letters.
fh=open("romeo.txt")
d=dict()
c=0
for i in fh:
for j in i:
d[c]=j
c+=1
print(d)
for i in d:
print(d.get('moon',None))
the output is shown in Picture 1
I made a code which does the thing I want but is there any short way to do it?
fh=open("romeo.txt")
d=dict()
c=0
for i in fh:
i=i.rstrip()
print("by the first loop ######################", i)
k=i.split()
for j in k:
print("by the second loop ##################", j)
d[c]=j
c+=1
print(d)
the output which I want is given in Picture 2
Also, can I use split() function here to do it?
How can I use it because it seems to get only the last line of the file as a list and I want all the words in list or dictionary.
Thank You
for i in fh:
This line iterates through each line of text in the file
for j in i:
Since i is a string, this line iterates through each letter in each line. Instead of doing it this way, split() the line over whitespace and then iterate through the resulting list:
for line in fh:
for word in line.split():
#do stuff
Anyway since you wanted a short way to do it here's a neat one liner:
To make a list of each word in the file:
[word for line in open("romeo.txt") for word in line.split()]
To make a dict (list is better since your keys are integer indices anyway):
{c: i for c, i in enumerate([word for line in open("romeo.txt") for word in line.split()])}

Python calling class variable in forloop

I am new to python, and I have confused with the below for loop usage. Can anyone please help me to understand the class usage in the below forloop.
import sys
def checkline():
glb.linecount += 1
w = glb.l.split()
glb.wordcount += len(w)
class glb:
linecount = 0
wordcount = 0
l = []
f = open('Untitled9.ipynb','r')
for glb.l in f.readlines(): #what glb.l exactly does?
checkline()
print(glb.linecount, glb.wordcount)
This entire program counts the lines and words in a file. specifically,
glb.l becomes each line in a file, so you could iterate and count the words in each one of them.
Let me pseudo code it for you.
Open the file `Untitled9.ipynb` for reading. //f
For each line in the file: // checkline
Store the line.// youre adding the line to glb.l, which you will later iterate on to count the words in the file.
Add one to the line count.
For each space, add one to the word count. // counting the results of the split() on glb.l
Print the line and the word count.

remove the item in string

How do I remove the other stuff in the string and return a list that is made of other strings ? This is what I have written. Thanks in advance!!!
def get_poem_lines(poem):
r""" (str) -> list of str
Return the non-blank, non-empty lines of poem, with whitespace removed
from the beginning and end of each line.
>>> get_poem_lines('The first line leads off,\n\n\n'
... + 'With a gap before the next.\nThen the poem ends.\n')
['The first line leads off,', 'With a gap before the next.', 'Then the poem ends.']
"""
list=[]
for line in poem:
if line == '\n' and line == '+':
poem.remove(line)
s = poem.remove(line)
for a in s:
list.append(a)
return list
split and strip might be what you need:
s = 'The first line leads off,\n\n\n With a gap before the next.\nThen the poem ends.\n'
print([line.strip() for line in s.split("\n") if line])
['The first line leads off,', 'With a gap before the next.', 'Then the poem ends.']
Not sure where the + fits in as it is, if it is involved somehow either strip or str.replace it, also avoid using list as a variable name, it shadows the python list.
lastly strings have no remove method, you can .replace but since strings are immutable you will need to reassign the poem to the the return value of replace i.e poem = poem.replace("+","")
You can read all non-empty lines like this:
list_m = [line if line not in ["\n","\r\n"] for line in file];
Without looking at your input sample, I am assuming that you simply want your white spaces to be removed. In that case,
for x in range(0, len(list_m)):
list_m[x] = list_m[x].replace("[ ](?=\n)", "");

Resources