Regex Pattern Matching -a substring in words in CSV File - python-3.x

'Neighborhood,eattend10,eattend11,eattend12,eattend13,mattend10,mattend11,mattend12,mattend13,
hsattend10,hsattend11,hsattend12,hsattend13,eenrol11,eenrol12,eenrol13,menrol11,menrol12,
menrol13,hsenrol11,hsenrol12,hsenrol13,aastud10,aastud11,aastud12,aastud13,wstud10,wstud11,
wstud12,wstud13,hstud10,hstud11,hstud12,hstud13,abse10,abse11,abse12,abse13,absmd10,absmd11,
absmd12,absmd13,abshs10,abshs11,abshs12,abshs13,susp10,susp11,susp12,susp13,farms10,farms11,
farms12,farms13,sped10,sped11,sped12,sped13,ready11,ready12,ready13,math310,math311,math312,
math313,read310,read311,read312,read313,math510,math511,math512,math513,read510,read511,read512,
read513,math810,math811,math812,math813,read810,read811,read812,read813,hsaeng10,hsaeng11,
hsaeng12,hsaeng13,hsabio10,hsabio11,hsabio12,hsabio13,hsagov10,hsagov11,hsagov13,hsaalg10,
hsaalg11,hsaalg12,hsaalg13,drop10,drop11,drop12,drop13,compl10,compl11,compl12,compl13,
sclsw11,sclsw12,sclsw13,sclemp13\
I have this data set. I need to know how many drop words are there and print them.
Or similarly for any word like mattend and print those.
I tried using findall but I think that's not correct
I assume we can use re.search or re.match.
How can I do it in RegEx?

You can use len() on re.findall() to get the length of the returned list:
import re
with open('example.csv') as f:
data = f.read().strip()
print(len(re.findall('drop',data)))

I think re.findall should be correct.
From python re module documentation:
Search:
Scan through string looking for the first location where this regular expression produces a match, and return a corresponding match object.
Match:
If zero or more characters at the beginning of string match this regular expression, return a corresponding match object.
Findall:
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result.
I tried it on your example and it worked for me:
re.findall("drop", str)
If you want to see digits after it you can try something like:
re.findall("drop\d*", str)
If you want to count the words you can use:
len(re.findall("drop\d*", str))

Related

pass regex group to function for substituting [duplicate]

I have a string S = '02143' and a list A = ['a','b','c','d','e']. I want to replace all those digits in 'S' with their corresponding element in list A.
For example, replace 0 with A[0], 2 with A[2] and so on. Final output should be S = 'acbed'.
I tried:
S = re.sub(r'([0-9])', A[int(r'\g<1>')], S)
However this gives an error ValueError: invalid literal for int() with base 10: '\\g<1>'. I guess it is considering backreference '\g<1>' as a string. How can I solve this especially using re.sub and capture-groups, else alternatively?
The reason the re.sub(r'([0-9])',A[int(r'\g<1>')],S) does not work is that \g<1> (which is an unambiguous representation of the first backreference otherwise written as \1) backreference only works when used in the string replacement pattern. If you pass it to another method, it will "see" just \g<1> literal string, since the re module won't have any chance of evaluating it at that time. re engine only evaluates it during a match, but the A[int(r'\g<1>')] part is evaluated before the re engine attempts to find a match.
That is why it is made possible to use callback methods inside re.sub as the replacement argument: you may pass the matched group values to any external methods for advanced manipulation.
See the re documentation:
re.sub(pattern, repl, string, count=0, flags=0)
If repl is a function, it is called for every non-overlapping
occurrence of pattern. The function takes a single match object
argument, and returns the replacement string.
Use
import re
S = '02143'
A = ['a','b','c','d','e']
print(re.sub(r'[0-9]',lambda x: A[int(x.group())],S))
See the Python demo
Note you do not need to capture the whole pattern with parentheses, you can access the whole match with x.group().

How to substitute a repeating character with the same number of a different character in regex python?

Assume there's a string
"An example striiiiiing with other words"
I need to replace the 'i's with '*'s like 'str******ng'. The number of '*' must be same as 'i'. This replacement should happen only if there are consecutive 'i' greater than or equal to 3. If the number of 'i' is less than 3 then there is a different rule for that. I can hard code it:
import re
text = "An example striiiiing with other words"
out_put = re.sub(re.compile(r'i{3}', re.I), r'*'*3, text)
print(out_put)
# An example str***iing with other words
But number of i could be any number greater than 3. How can we do that using regex?
The i{3} pattern only matches iii anywhere in the string. You need i{3,} to match three or more is. However, to make it all work, you need to pass your match into a callable used as a replacement argument to re.sub, where you can get the match text length and multiply correctly.
Also, it is advisable to declare the regex outside of re.sub, or just use a string pattern since patterns are cached.
Here is the code that fixes the issue:
import re
text = "An example striiiiing with other words"
rx = re.compile(r'i{3,}', re.I)
out_put = rx.sub(lambda x: r'*'*len(x.group()), text)
print(out_put)
# => An example str*****ng with other words

How can i use lambda function and re.search to get substrings from a list of filenames in python

I have a list of filenames from a certain directory ,
list_files = [filename_ew1_234_rt, filename_ew1_456_rt, filename_ew1_78946464_rt]
I am trying to use re.search on this as follows
filtered_values = list(filter(lambda v: re.search('.*(ew1.+rt)', v), list_files))
Now when I print filtered values it prints the entire filenames again, how can i get it to print only certain part of filename
Here is what i see
filename_ew1_234_rt
filename_ew1_456_rt
filename_ew1_78946464_rt
Instead i would like to get
ew1_234_rt
ew1_456_rt
ew1_78946464_rt
How can i do that?
Instead of using filter, which will have the same value if the lambda returns true, you can use 2 for comprehensions and re.match extracting the group 1 value.
import re
list_files = ["filename_ew1_234_rt", "filename_ew1_456_rt", "filename_ew1_78946464_rt", "test"]
res = [m.group(1) for file in list_files for m in [re.match(r".*(ew1.+rt)", file)] if m]
print(res)
Output
['ew1_234_rt', 'ew1_456_rt', 'ew1_78946464_rt']
Note that the pattern ew1.+rt for the current examples might also be written a bit more specific matching the underscores and the digits:
.*(_ew1_\d+_rt)$
See a Regex demo.
filter() returns a list of elements which satisfy the condition you set in the lambda i.e. which return true. If the condition returns None, it's interpreted as False, but anything else is True. Do you see the problem here? re.search() returns a match object, which may or may not be None, but this match object won't be the result of the search.
A simpler and better approach is simply to do this:
import re
list_files = ["filename_ew1_234_rt", "filename_ew1_456_rt", "filename_ew1_78946464_rt"]
generated = [re.search(r'(ew1.+rt)', v) for v in list_files]
filtered = [i.group() for i in generated if i != None]
print(filtered)
You can use a basic list comprehension to get the search results from each element, and if the result was found (i.e. not None) you can group the match object to get the result.
or if all the filenames start the same way you could just slice it out.
list_files = ["filename_ew1_234_rt", "filename_ew1_456_rt", "filename_ew1_78946464_rt"]
for filtered in list_files:
print(filtered[9:])

find regex expression based character match

I have a list of strings something like this:
a=['bukt/id=gdhf/year=989/month=98/day=12/hgjhg.csv','bukt/id=76fhfh/year=989/month=08/day=128/hkngjhg.csv']
ids are unique.I want to have a output list which will be something like this
output_list = ['bukt/id=gdhf/','bukt/id=76fhfh/']
So basically need a regex expression to match any id and remove the rest of the part from the string
How can I do that in most efficient way considering the length of the input list is more than 100K
import re
rgx = r'(bukt/id=[a-zA-Z0-9]+/).+'
re.search(rgx, string).group(1)
The result will be in group 1. This captures "bukt/id=", followed by any alphanumeric characters and then a slash, and throws away the rest.
There's no need for regex, you can just split your string on /, discard everything after the second / and then join again with /:
a=['bukt/id=gdhf/year=989/month=98/day=12/hgjhg.csv','bukt/id=76fhfh/year=989/month=08/day=128/hkngjhg.csv']
out = ['/'.join(u.split('/')[:2]) for u in a]
print(out)
Output:
['bukt/id=gdhf', 'bukt/id=76fhfh']
If you want the trailing /, just add an empty string to the end of the split array:
out = ['/'.join(u.split('/')[:2] + ['']) for u in a]
Output:
['bukt/id=gdhf/', 'bukt/id=76fhfh/']

Doubts about string

So, I'm doing an exercise using python, and I tried to use the terminal to do step by step to understand what's happening but I didn't.
I want to understand mainly why the conditional return just the index 0.
Looking 'casino' in [Casinoville].lower() isn't the same thing?
Exercise:
Takes a list of documents (each document is a string) and a keyword.
Returns list of the index values into the original list for all documents containing the keyword.
Exercise solution
def word_search(documents, keyword):
indices = []
for i, doc in enumerate(documents):
tokens = doc.split()
normalized = [token.rstrip('.,').lower() for token in tokens]
if keyword.lower() in normalized:
indices.append(i)
return indices
My solution
def word_search(documents, keyword):
return [i for i, word in enumerate(doc_list) if keyword.lower() in word.rstrip('.,').lower()]
Run
>>> doc_list = ["The Learn Python Challenge Casino.", "They bought a car", "Casinoville"]
Expected output
>>> word_search(doc_list, 'casino')
>>> [0]
Actual output
>>> word_search(doc_list, 'casino')
>>> [0, 2]
Let's try to understand the difference.
The "result" function can be written with list-comprehension:
def word_search(documents, keyword):
return [i for i, word in enumerate(documents)
if keyword.lower() in
[token.rstrip('.,').lower() for token in word.split()]]
The problem happens with the string : "Casinoville" at index 2.
See the output:
print([token.rstrip('.,').lower() for token in doc_list[2].split()])
# ['casinoville']
And here is the matter: you try to ckeck if a word is in the list. The answer is True only if all the string matches (this is the expected output).
However, in your solution, you only check if a word contains a substring. In this case, the condition in is on the string itself and not the list.
See it:
# On the list :
print('casino' in [token.rstrip('.,').lower() for token in doc_list[2].split()])
# False
# On the string:
print('casino' in [token.rstrip('.,').lower() for token in doc_list[2].split()][0])
# True
As result, in the first case, "Casinoville" isn't included while it is in the second one.
Hope that helps !
The question is "Returns list of the index values into the original list for all documents containing the keyword".
you need to consider word only.
In "Casinoville" case, word "casino" is not in, since this case only have word "Casinoville".
When you use the in operator, the result depends on the type of object on the right hand side. When it's a list (or most other kinds of containers), you get an exact membership test. So 'casino' in ['casino'] is True, but 'casino' in ['casinoville'] is False because the strings are not equal.
When the right hand side of is is a string though, it does something different. Rather than looking for an exact match against a single character (which is what strings contain if you think of them as sequences), it does a substring match. So 'casino' in 'casinoville' is True, as would be casino in 'montecasino' or 'casino' in 'foocasinobar' (it's not just prefixes that are checked).
For your problem, you want exact matches to whole words only. The reference solution uses str.split to separate words (the with no argument it splits on any kind of whitespace). It then cleans up the words a bit (stripping off punctuation marks), then does an in match against the list of strings.
Your code never splits the strings you are passed. So when you do an in test, you're doing a substring match on the whole document, and you'll get false positives when you match part of a larger word.

Resources