how to replace question mark to word - python-3.x

I have Arabic tweet and I want to replace question marks and exclamation into Arabic word synonymous I tried this code i used regular expression but nothing happens. I used jupyter notebook
def replace_questionmark(tweet):
text = re.sub("!", "تعجب",tweet)
text = re.sub('استفهام','؟' ,tweet)
return tweet
data_df['clean text'] = data_df['Text'].apply(lambda x: replace_questionmark(x))

The following code solves your problem
import pandas as pd
import re
Text = [u'I am feeling good !', u'I am testing this code ؟']
data_df = pd.DataFrame(columns=['Text'], data=Text)
def replace_questionmark(tweet):
text = tweet.replace(u'!', u'تعج')
text = text.replace(u'؟', u'استفهام')
return text.encode('utf-8')
data_df['clean text'] = data_df['Text'].apply(lambda x: replace_questionmark(x))
print(data_df)
Output
Text clean text
0 I am feeling good ! I am feeling good تعج
1 I am testing this code ؟ I am testing this code استفهام

Related

error while removing the stop-words from the text

I am trying to remove stopwords from my data and I have used this statement to download the stopwords.
stop = set(stopwords.words('english'))
This has character 'd' as one of the stopwords. So, when I apply this to my function it is removing 'd' from the word. Please see the attached picture for the reference and guide me how to fix this.
enter image description here
I checked out the code and noticed that you are applying the rem_stopwords function on the clean_text column, while you should apply it on tweet column.
Otherwise, NLTK removes d, I, and other characters when they are independent tokens, a token here is a word after you split on spaces, so if you have i'd, it will not remove d nor I since they are combined into a word. However if you have 'I like Football' it will remove I, since it will be an independent token.
You can try this code, it will solve your problem
import pandas as pd
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
stop = set(stopwords.words('english'))
df['clean_text'] = df['Tweet'].apply(lambda x: ' '.join([word for word in x.split() if word.lower() not in (stop)]))

python not removing punctuation

i have a text file i want to remove punctuation and save it as a new file but it is not removing anything any idea why?
code:
def punctuation(string):
punctuations = '''!()-[]{};:'"\,<>./?##$%^&*_~'''
for x in string.lower():
if x in punctuations:
string = string.replace(x, "")
# Print string without punctuation
print(string)
file = open('ir500.txt', 'r+')
file_no_punc = (file.read())
punctuation(l)
with open('ir500_no_punc.txt', 'w') as file:
file.write(file_no_punc)
removing any punctuation why?
def punctuation(string):
punctuations = '''!()-[]{};:'"\,<>./?##$%^&*_~'''
for x in string.lower():
if x in punctuations:
string = string.replace(x, "")
# return string without punctuation
return string
file = open('ir500.txt', 'r+')
file_no_punc = (file.read())
file_no_punc = punctuation(file_no_punc)
with open('ir500_no_punc.txt', 'w') as file:
file.write(file_no_punc)
Explanation:
I changed only punctuation(l) to file_no_punc = punctuation(file_no_punc) and print(string) to return string
1) what is l in punctuation(l) ?
2) you are calling punctuation() - which works correctly - but do not use its return value
3) because it is not currently returning a value, just printing it ;-)
Please note that I made only the minimal change to make it work. You might want to post it to our code review site, to see how it could be improved.
Also, I would recommend that you get a good IDE. In my opinion, you cannot beat PyCharm community edition. Learn how to use the debugger; it is your best friend. Set breakpoints, run the code; it will stop when it hits a breakpoint; you can then examine the values of your variables.
taking out the file reading/writing, you could to remove the punctuation from a string like this:
table = str.maketrans("", "", r"!()-[]{};:'\"\,<>./?##$%^&*_~")
# # or maybe even better
# import string
# table = str.maketrans("", "", string.punctuation)
file_with_punc = r"abc!()-[]{};:'\"\,<>./?##$%^&*_~def"
file_no_punc = file_with_punc.lower().translate(table)
# abcdef
where i use str.maketrans and str.translate.
note that python strings are immutable. there is no way to change a given string; every operation you perform on a string will return a new instance.

I can't convert a unicode into a plain string

I'm get stuck triyng to transform only one word from unicode into a plain string. I look for answers but no one help me to solve this simple problem.
I'v already tried the following links:
https://www.oreilly.com/library/view/python-cookbook/0596001673/ch03s18.html
Convert a Unicode string to a string in Python (containing extra symbols)
How to convert unicode string into normal text in python
from bs4 import BeautifulSoup
r = requests.get('https://www.mpgo.mp.br/coliseu/concursos/inscricoes_abertas')
soup = BeautifulSoup(r.content, 'html.parser')
table = soup.find('table', attrs={'class':'grid'})
text = table.get_text()
text_str = text[0:7]
text_str = text_str.encode('utf-8')
test_str = 'Nenhum'
test_str = test_str.encode('utf-8')
if text_str == test_str:
print('Ok they are equal')
else:
print(id(text_str))
print(id(test_str))
print(type(test_str))
print(type(test_str))
print(test_str)
print(test_str)```
My spected result is: text_str being equal test_str
Welcome to SO. You have a typo in your debug output. The last 4 values are all test_str instead of some text_str.
Then you would have noticed that your read in variable contains:
'\nNenhum'
So if you either change your slice to: text_str = text[1:7] or if you set your test string accordingly:
test_str = '\nNenhum'
It works. Happy hacking...

Using Pyperclip in Python 3 Does Not Paste Data in Desired Format

I'm using python 3.7 and I want to:
Copy IPs from a column in excel
Add a comma between each IP separated by a space
Return as one line
Copy back to clipboard using pyperclip.
Below is the desired pasted results:
10.10.10.10, 10.10.10.11, 10.10.10.12, 10.10.10.13, 10.10.10.14
I have looked at some answers I found here and here but it's not printing the desired results. Below, I have tried the following codes but none seem to do it for me:
#code 1
import pyperclip
text = pyperclip.paste()
lines = text.split('\n')
pyperclip.copy('\n'.join(lines))
#code2
import pyperclip
text = pyperclip.paste()
lines = text.split('\n')
a = '{}'.format(', '.join(lines[:-1]))
pyperclip.copy(''.join(a))
#code3
import pyperclip
text = pyperclip.paste()
lines = text.split('\n')
a = '{}'.format(', '.join(lines[:-1]))
pyperclip.copy(a)
Assistance on fixing this and understanding would be greatly appreciated.
Try this way:
import pyperclip
text = pyperclip.paste()
lines = text.split()
pyperclip.copy(', '.join(lines))

Create a list everytime I encounter a certain word in a str

My problem is that I wanted write a code that did that:
input => str_of_words = '<post>30blueyellow<post>2skyearth<post>5summerwinter'
output => post30 = ["blue","yellow"]
post2 = ["sky","earth"]
post5 = ["summer", "winter"]
At first I thought I could do something like
if "<post>" in str_of_words:
occurrence = str_of_words.count("<post>")
#and from there I had no idea how to continue coding it
So I feel like I could ask if anyone knew some tricks to do that
You can use the nltk module:
import re
import nltk
nltk.download('words')
from nltk.corpus import words
def split(a):
for i in range(len(a)):
if a[:i] in words.words() and a[i:] in words.words():
return [a[:i],a[i:]]
str_of_words = '<post>30blueyellow<post>2skyearth<post>5summerwinter'
post = {i:split(j) for i,j in dict(re.findall(r'post>(\d+)(\w+)',str_of_words)).items()}
post['30']
['blue', 'yellow']
post['5']
['summer', 'winter']
post['2']
['sky', 'earth']
this might get you started:
import re
str_of_words = '<post>30blueyellow<post>2skyearth<post>5summerwinter'
posts = {}
lst = str_of_words.split('<post>')
for item in lst:
match = re.match('(\d+)(\D+)', item)
if not match:
continue
posts[int(match.group(1))] = match.group(2)
print(posts)
it prints:
{30: 'blueyellow', 2: 'skyearth', 5: 'summerwinter'}
so posts[30] = 'blueyellow'.
the re module is very helpful when it comes to separating numbers (\d) from non-numbers (\D).
i don't know according to what rules you would like to be able to split the words. do you have a list of words that could appear?

Resources