how to extract n digit numbers in a text including special characters - python-3.x

i have a text full of regular expression and I want to extract the numbers that have 4 digits,
mytext ="""A text including special characters like 1000+(100)=1100 """
numbers = []
seperators=[
'(', ')', '[', ']', '{', '}', ';', ':', '=', '+', '-', '/', '*', '&', '%', '$', '#', '#', '^', '*', '~', '`', '"', '>', '|', '\\', '?', '.', '<', "'"]
how to use split function to extract numbers?
for word in mytext2.split(seperators):
if word.isdigit():
numbers.append(int(word))
#print(numbers)
for mynumbers in numbers:
if mynumbers >999 and 10000>mynumbers: #for 4 digits
print(mynumbers)
#this should print all the 4 digit numbers

text = "A text including special characters like 1000+(100)=1100 "
import re
numbers = [int(number) for number in re.findall(r'\b\d{4}\b', text)]
print(numbers)
# Outputs [1000, 1001]

mytext ="""Alain Fabien Maurice Marcel Delon (French: [al d l ] ɛ̃ ə ɔ̃; born 8 November 1935) is a French actor and businessman. He is known as
one of Europe's most prominent actors and screen sex symbols from the 1960s and 1970s. He achieved critical acclaim for roles in
films such as Rocco and His Brothers (1960), Plein Soleil (1960), L'Eclisse (1962), The Leopard (1963), The Yellow Rolls-
Royce (1965), Lost Command (1966), and Le Samouraï (1967). Over the course of his career Delon worked with many wellknown directors, including Luchino Visconti, Jean-Luc Godard, Jean-Pierre Melville, Michelangelo Antonioni, and Louis Malle. He
acquired Swiss citizenship in 1999"""
numbers = []
seperators=['#','(',')','$','%','^','&','*','+']
mytext2=mytext
mytext2=mytext2.replace('(',' ' )
mytext2=mytext2.replace(')',' ' )
mytext2=mytext2.replace('[',' ' )
mytext2=mytext2.replace(']',' ' )
mytext2=mytext2.replace('{',' ' )
mytext2=mytext2.replace('}',' ' )
mytext2=mytext2.replace(';',' ' )
mytext2=mytext2.replace(':',' ' )
mytext2=mytext2.replace('=',' ' )
mytext2=mytext2.replace('+',' ' )
mytext2=mytext2.replace('-',' ' )
mytext2=mytext2.replace('/',' ' )
mytext2=mytext2.replace('*',' ' )
mytext2=mytext2.replace('&',' ' )
mytext2=mytext2.replace('%',' ' )
mytext2=mytext2.replace('$',' ' )
mytext2=mytext2.replace('#',' ' )
mytext2=mytext2.replace('#',' ' )
mytext2=mytext2.replace('^',' ' )
mytext2=mytext2.replace('*',' ' )
mytext2=mytext2.replace('~',' ' )
mytext2=mytext2.replace('`',' ' )
mytext2=mytext2.replace('"',' ' )
mytext2=mytext2.replace('>',' ' )
mytext2=mytext2.replace('|',' ' )
mytext2=mytext2.replace('\\',' ' )
mytext2=mytext2.replace('?',' ' )
mytext2=mytext2.replace('.',' ' )
mytext2=mytext2.replace('<',' ' )
mytext2=mytext2.replace("'",' ' )
#print(mytext2)
for word in mytext2.split():
if word.isdigit():
numbers.append(int(word))
#print(numbers)
for mynumbers in numbers:
if mynumbers >999 and 10000>mynumbers:
print(mynumbers)
this code prints all the n digit numbers in the text, if your text more special characters you should add them in the first part to be replaced.

Related

Need help to remove punctuation and replace numbers for an nlp task

For example, I have a string:
sentence = ['cracked $300 million','she\'s resolutely, smitten ', 'that\'s creative [r]', 'the market ( knowledge check : prices up!']
I want to remove the punctuation and replace numbers with the '£' symbol.
I have tried this but can only replace one or the other when I try to run them both.
my code is below
import re
s =([re.sub(r'[!":$()[]\',]',' ', word) for word in sentence])
s= [([re.sub(r'\d+','£', word) for word in s])]
s)
I think the problem could be in the square brackets??
thank you!
If you want to replace some specific punctuation symbols with a space and any digit chunks with a £ sign, you can use
import re
rx = re.compile(r'''[][!":$()',]|(\d+)''')
sentence = ['cracked $300 million','she\'s resolutely, smitten ', 'that\'s creative [r]', 'the market ( knowledge check : prices up!']
s = [rx.sub(lambda x: '£' if x.group(1) else ' ', word) for word in sentence]
print(s) # => ['cracked £ million', 'she s resolutely smitten ', 'that s creative r ', 'the market knowledge check prices up ']
See the Python demo.
Note where [] are inside a character class: when ] is at the start, it does not need to be escaped and [ does not have to be escaped at all inside character classes. I also used a triple-quoted string literal, so you can use " and ' as is without extra escaping.
So, here, [][!":$()',]|(\d+) matches ], [, !, ", :, $, (, ), ' or , or matches and captures into Group 1 one or more digits. If Group 1 matched, the replacement is the euro sign, else, it is a space.
Sorry i didn't see the second part of your request but you can to this for the number and the punctuation
sentence = ['cracked $300 million', 'she\'s resolutely, smitten ', 'that\'s creative [r]',
'the market ( knowledge check : prices up!']
def replaceDigitAndPunctuation(newSentence):
new_word = ""
for char in newSentence:
if char in string.digits:
new_word += "£"
elif char in string.punctuation:
pass
else:
new_word += char
return new_word
for i in range(len(sentence)):
sentence[i] = replaceAllDigitInString(sentence[i])
Using your input and pattern:
>>> ([re.sub(r'[!":$()[]\',]',' ', word) for word in sentence])
['cracked $300 million', "she's resolutely, smitten ", "that's creative [r]", 'the market ( knowledge check : prices up!']
>>>
The reason is because [!":$()[] is being treated as a character group, and \',] is a literal pattern, i.e. the engine is looking for ',] exactly.
With the closing bracket in the group escaped:
\]
>>> ([re.sub(r'[!":$()[\]\',]',' ', word) for word in sentence])
['cracked 300 million', 'she s resolutely smitten ', 'that s creative r ', 'the market knowledge check prices up ']
>>>
Edit:
If you're trying to stack multiple actions into a single list comprehension, then place your actions in a function and call the function:
def process_word(word):
word = re.sub(r'[!":$()[\]\',]',' ', word)
word = re.sub(r'\d+','£', word)
return word
Results in:
>>> [process_word(word) for word in sentence]
['cracked £ million', 'she s resolutely smitten ', 'that s creative r ', 'the market knowledge check prices up ']

define a map function that only returns city names longer than 5 characters. Instead of other names - a line with a hyphen ("-")

The names of cities are entered in one line separated by a space.
You need to define a map function that only
returns city names longer than 5 characters.
Instead of other names - a line with a hyphen ("-").
Generate a list of the obtained values ​​and display it
on the screen in one line separated by a space.
cities = [i.replace(i, '-') for i in input().split() if len(i) < 5]
print(cities)
Input
cities = ['Moscow', 'Ufa', 'Vologda', 'Tula', 'Vladivostok', 'Habarovsk']
Output
['Moscow', '-', 'Vologda', '-', 'Vladivostok', 'Habarovsk']
if you can use a lambda function then:
cities = ['Moscow', 'Ufa', 'Vologda', 'Tula', 'Vladivostok', 'Habarovsk']
print(list(map(lambda x: '-' if len(x)<5 else x,cities)))
output:
['Moscow', '-', 'Vologda', '-', 'Vladivostok', 'Habarovsk']
also if you want to do it without function map, only list comprehension:
cities = ['Moscow', 'Ufa', 'Vologda', 'Tula', 'Vladivostok', 'Habarovsk']
print([ i.replace(i, '-') if len(i)<5 else i for i in cities ])
output:
['Moscow', '-', 'Vologda', '-', 'Vladivostok', 'Habarovsk']
as mentioned by #JonSG, this option without the map function would be more efficient without replacing:
['-' if len(i) < 5 else i for i in cities]

Split without separator with diferent arrays

Could you, please, help me? I need to split a string that doesn't have a separator. I need to split the string in different types.
For example, the following strings should generate the same list as output:
"ak = bib+c*(data+1005)
"
" ak= bib +c* (data +1005 )
"
" ak =bib + c * (data + 1005)"
The output should be:
['ak', '=', 'bib', '+', 'c', '*', '(', 'data', '+', '1005', ')']
Thank you!
You can use re.findall with a pattern that matches either a word or a non-space character:
import re
re.findall(r'\w+|\S', "ak = bib+c*(data+1005) ")
This returns:
['ak', '=', 'bib', '+', 'c', '*', '(', 'data', '+', '1005', ')']

Remove punctuation from textfile error - Python 3

I'm attempting to make a program that removes all punctuation from a textfile, however I keep running into an error where it only prints the title of the file rather than the contents within the file.
def removePunctuation(word):
punctuation_list=['.', ',', '?', '!', ';', ':', '\\', '/', "'", '"']
for character in word:
for punctuation in punctuation_list:
if character == punctuation:
word = word.replace(punctuation, "")
return word
print(removePunctuation('phrases.txt'))
Whenever I run the code, it just prints the name of the file; 'phrasestxt' without any punctuation. I want the program to print all the text that is present within the document, which is a few paragraphs long. Any help would be appreciated!
In this case, you must open your file and read it:
def removePunctuation(file_path):
with open(file_path, 'r') as fd:
word = fd.read()
punctuation_list=['.', ',', '?', '!', ';', ':', '\\', '/', "'", '"']
for character in word:
for punctuation in punctuation_list:
if character == punctuation:
word = word.replace(punctuation, "")
return word
print(removePunctuation('phrases.txt'))
If you want, you can replace your double for loop by
word = "".join([i for i in word if i not in punctuation_list])

Hello what should i do to make this diamond shape using python3

--*--
-***-
--*--
bars are blanks
print('', '*', ' \n', '***', ' \n', '', '*', '')
This is what i made and it doesn't work...I thought ''=blank and since there's comma it's one more blank so there should be 2 blanks as a result?
anyway what should i do using only one print f(x)
Just put it in as a single string:
print(' * \n***\n * ')
Output:
*
***
*
You can do this, because Python treats \n as new line character and it will not interfere with the rest of the text, even if it "touches" it. Putting it in a single string makes it more readable. There is no reason to fragment the whole statement with commas, when you can do it all in one string.
Basically:
'' --> empty string
' ' --> one space char (or blank)
So, modifying your print:
Only change the first argument from '' to ' '
print(' ', '*', ' \n', '***', ' \n', '', '*', '')
You can also simplify it passing only 1 argument:
print(' * \n *** \n * ')

Resources