I have tried creating the following function:
def 3utr():
However, I get a SyntaxError. Replacing the "3" by "three" fixes the problem.
My questions are:
Why is it a syntax error?
Is there a way to have a function name start with a number in Python 3?

It is a syntax error because the language specification does not allow identifiers to start with a digit. So it’s not possible to have function names (which are identifiers) that start with digits in Python.
identifier ::= (letter|"_") (letter | digit | "_")*
Python 2 Language Reference
Within the ASCII range (U+0001..U+007F), the valid characters for identifiers are the same as in Python 2.x: the uppercase and lowercase letters A through Z, the underscore _ and, except for the first character, the digits 0 through 9.
Python 3 Language Reference

One workaround is use Roman numerals:
>>> def xxiv():
... print("ok\n")
>>> xxiv()

If you really want to be distinctive.
You can add '_' in front of an identifier
For instance
def _3utr():
Then call the function


How to substitute a repeating character with the same number of a different character in regex python?

Assume there's a string
"An example striiiiiing with other words"
I need to replace the 'i's with '*'s like 'str******ng'. The number of '*' must be same as 'i'. This replacement should happen only if there are consecutive 'i' greater than or equal to 3. If the number of 'i' is less than 3 then there is a different rule for that. I can hard code it:
import re
text = "An example striiiiing with other words"
out_put = re.sub(re.compile(r'i{3}', re.I), r'*'*3, text)
# An example str***iing with other words
But number of i could be any number greater than 3. How can we do that using regex?
The i{3} pattern only matches iii anywhere in the string. You need i{3,} to match three or more is. However, to make it all work, you need to pass your match into a callable used as a replacement argument to re.sub, where you can get the match text length and multiply correctly.
Also, it is advisable to declare the regex outside of re.sub, or just use a string pattern since patterns are cached.
Here is the code that fixes the issue:
import re
text = "An example striiiiing with other words"
rx = re.compile(r'i{3,}', re.I)
out_put = rx.sub(lambda x: r'*'*len(, text)
# => An example str*****ng with other words

python converting strings into three blocks and if not two blocks

I want to write a function that converts the given string T and group them into three blocks.
However, I want to split the last block into two if it can't be broken down to three numbers.
For example, this is my code
import re
def num_format(T):
clean_number = re.sub('[^0-9]+', '', T)
formatted_number = re.sub(r"(\d{3})(?=(\d{3})+(?!\d{3}))", r"\1-", clean_number)
return formatted_number
this returns : '055-537-000-2654' as a result.
However, I want it to be '055-537-000-26-54'.
I used the regular expression, but have no idea how to split the last remaining numbers into two blocks!
I would really appreciate helping me to figure this problem out!!
Thanks in advance.
You can use
def num_format(T):
clean_number = ''.join(c for c in T if c.isdigit())
return re.sub(r'(\d{3})(?=\d{2})|(?<=\d{2})(?=\d{2}$)', r'\1-', clean_number)
See the regex demo.
Note you can get rid of all non-numeric chars using plain Python comprehension, the solution is borrowed from Removing all non-numeric characters from string in Python.
The regex matches
(\d{3}) - Group 1 (\1): three digits...
(?=\d{2}) - followed with two digits
| - or
(?<=\d{2})(?=\d{2}$) - a location between any two digit sequence and two digits that are at the end of string.
See the Python demo:
import re
def num_format(T):
clean_number = ''.join(c for c in T if c.isdigit())
return re.sub(r'(\d{3})(?=\d{2})|(?<=\d{2})(?=\d{2}$)', r'\1-', clean_number)
# => 055-537-000-26-54

Python - Replacing repeated consonants with other values in a string

I want to write a function that, given a string, returns a new string in which occurences of a sequence of the same consonant with 2 or more elements are replaced with the same sequence except the first consonant - which should be replaced with the character 'm'.
The explanation was probably very confusing, so here are some examples:
"hello world" should return "hemlo world"
"Hannibal" should return "Hamnibal"
"error" should return "emror"
"although" should return "although" (returns the same string because none of the characters are repeated in a sequence)
"bbb" should return "mbb"
I looked into using regex but wasn't able to achieve what I wanted. Any help is appreciated.
Thank you in advance!
Regex is probably the best tool for the job here. The 'correct' expression is
test = """
hello world
output = re.sub(r'(.)\1+', lambda g:f'm{[1:]}', test)
# '''
# hemlo world
# Hamnibal
# emror
# although
# mbb
# '''
The only real complicated part of this is the lambda that we give as an argument. re.sub() can accept one as its 'replacement criteria' - it gets passed a regex object (which we call .group(0) on to get the full match, i.e. all of the repeated letters) and should output a string, with which to replace whatever was matched. Here, we use it to output the character 'm' followed by the second character onwards of the match, in an f-string.
The regex itself is pretty straightforward as well. Any character (.), then the same character (\1) again one or more times (+). If you wanted just alphanumerics (i.e. not to replace duplicate whitespace characters), you could use (\w) instead of (.)

Python3 and combining Diacritics

I've been having a problem with Unicode in python3 and I can't seem to understand why that's happening.
symbol= "ῇ̣"
This letter comes from a word: ἐ̣ν̣τ̣ῇ̣[αὐτ]ῇ where I have combining diacritical marks. I want to do the statistical analysis in Python 3 and store the results in a database, the thing is that I also store the character's position (index) in the text. The database-application correctly counts the symbol-variable in the example as one-character, whereas Python counts it as two - throwing off the entire indexing.
The project requires me to keep the diacritics, so I can't simply ignore them or do a .replace("combining diacritical mark","") on the string.
Since Python3 has unicode as default for strings I'm a bit dumbfounded by this.
I have tried to use the base(), strip(), and strip_length() method from Greek-accentuation: but that's not helping either.
Project requirements are:
Detect the alphabet belonging to the character (OK)
Store string-positions (needed for highlighting in the database) (NotOK)
Be able to process multiple languages/alphabets mixed in one string. (OK)
Iterate over CSV-input. (OK)
Ignore set of predefined strings (OK)
Ignore set of strings that match certain conditions (OK)
This is the simplified code for this project:
# -*- coding: utf-8 -*-
import csv
from alphabet_detector import AlphabetDetector
ad = AlphabetDetector()
with open("tbltext.csv", "r", encoding="utf8") as txt:
data = csv.reader(txt)
for row in data:
text = row[1]
### Here I have some string manipulation (lowering everything, replacing the predefined set of strings by equal-length '-',...)
###then I use the ad-module to detect the language by looping over my characters, this is where it goes wrong.
for letter in text:
lang = ad.detect_alphabet(letter)
If I use the word: ἐ̣ν̣τ̣ῇ̣[αὐτ]ῇ as example with a forloop; my result is:
>>> word = "ἐ̣ν̣τ̣ῇ̣[αὐτ]ῇ"
>>> for letter in word:
... print(letter)
How can I make Python see letters with a combining diacritical mark as one letter instead of making it print the letter and the diacritical mark separately?
The string has 2 in length, so this is correct: two code point:
>>> list(hex(ord(c)) for c in symbol)
['0x1fc7', '0x323']
>>> list( for c in symbol)
So you should not use len to count the characters.
You could count the characters that are non-combining, so:
>>> import unicodedata
>>> len(''.join(ch for ch in symbol if unicodedata.combining(ch) == 0))
From: How do I get the "visible" length of a combining Unicode string in Python? (but I ported it to python3).
But this is also not the optimal solution, depending on the scope of counting characters. I think in your case it is enough, but fonts could merge characters into ligatures. On some languages, that are visually new (and very different) characters (and not like ligature in western languages).
As last comment: I think you should normalize strings. With above code, in this case it doesn't matter, but in other cases, you may get different results. Especially if someone used combatibility characters (e.g. mu for units, or Eszett, instead of the true Greek characters).

Count word occurrences in R

Is there a function for counting the number of times a particular keyword is contained in a dataset?
For example, if dataset <- c("corn", "cornmeal", "corn on the cob", "meal") the count would be 3.
Let's for the moment assume you wanted the number of element containing "corn":
length(grep("corn", dataset))
[1] 3
After you get the basics of R down better you may want to look at the "tm" package.
EDIT: I realize that this time around you wanted any-"corn" but in the future you might want to get word-"corn". Over on r-help Bill Dunlap pointed out a more compact grep pattern for gathering whole words:
grep("\\<corn\\>", dataset)
Another quite convenient and intuitive way to do it is to use the str_count function of the stringr package:
dataset <- c("corn", "cornmeal", "corn on the cob", "meal")
# for mere occurences of the pattern:
str_count(dataset, "corn")
# [1] 1 1 1 0
# for occurences of the word alone:
str_count(dataset, "\\bcorn\\b")
# [1] 1 0 1 0
# summing it up
sum(str_count(dataset, "corn"))
# [1] 3
You can also do something like the following:
I'd just do it with string division like:
dataset <- c("corn", "cornmeal", "corn on the cob", "meal")
# for each vector element:
dataset %s/% 'corn'
# for everything:
sum(dataset %s/% 'corn')
You can use the str_count function from the stringr package to get the number of keywords that match a given character vector.
The pattern argument of the str_count function accepts a regular expression that can be used to specify the keyword.
The regular expression syntax is very flexible and allows matching whole words as well as character patterns.
For example the following code will count all occurrences of the string "corn" and will return 3:
sum(str_count(dataset, regex("corn")))
To match complete words use:
sum(str_count(dataset, regex("\\bcorn\\b")))
The "\b" is used to specify a word boundary. When using str_count function, the default definition of word boundary includes apostrophe. So if your dataset contains the string "corn's", it would be matched and included in the result.
This is because apostrophe is considered as a word boundary by default. To prevent words containing apostrophe from being counted, use the regex function with parameter uword = T. This will cause the regular expression engine to use the unicode TR 29 definition of word boundaries. See This definition does not consider apostrophe as a word boundary.
The following code will give the number of time the word "corn" occurs. Words such as "corn's" will not be included.
sum(str_count(dataset, regex("\\bcorn\\b", uword = T)))
