How can i use lambda function and re.search to get substrings from a list of filenames in python - python-3.x

I have a list of filenames from a certain directory ,
list_files = [filename_ew1_234_rt, filename_ew1_456_rt, filename_ew1_78946464_rt]
I am trying to use re.search on this as follows
filtered_values = list(filter(lambda v: re.search('.*(ew1.+rt)', v), list_files))
Now when I print filtered values it prints the entire filenames again, how can i get it to print only certain part of filename
Here is what i see
filename_ew1_234_rt
filename_ew1_456_rt
filename_ew1_78946464_rt
Instead i would like to get
ew1_234_rt
ew1_456_rt
ew1_78946464_rt
How can i do that?

Instead of using filter, which will have the same value if the lambda returns true, you can use 2 for comprehensions and re.match extracting the group 1 value.
import re
list_files = ["filename_ew1_234_rt", "filename_ew1_456_rt", "filename_ew1_78946464_rt", "test"]
res = [m.group(1) for file in list_files for m in [re.match(r".*(ew1.+rt)", file)] if m]
print(res)
Output
['ew1_234_rt', 'ew1_456_rt', 'ew1_78946464_rt']
Note that the pattern ew1.+rt for the current examples might also be written a bit more specific matching the underscores and the digits:
.*(_ew1_\d+_rt)$
See a Regex demo.

filter() returns a list of elements which satisfy the condition you set in the lambda i.e. which return true. If the condition returns None, it's interpreted as False, but anything else is True. Do you see the problem here? re.search() returns a match object, which may or may not be None, but this match object won't be the result of the search.
A simpler and better approach is simply to do this:
import re
list_files = ["filename_ew1_234_rt", "filename_ew1_456_rt", "filename_ew1_78946464_rt"]
generated = [re.search(r'(ew1.+rt)', v) for v in list_files]
filtered = [i.group() for i in generated if i != None]
print(filtered)
You can use a basic list comprehension to get the search results from each element, and if the result was found (i.e. not None) you can group the match object to get the result.

or if all the filenames start the same way you could just slice it out.
list_files = ["filename_ew1_234_rt", "filename_ew1_456_rt", "filename_ew1_78946464_rt"]
for filtered in list_files:
print(filtered[9:])

Related

trouble with tripling letters [duplicate]

How can I iterate over a string in Python (get each character from the string, one at a time, each time through a loop)?
As Johannes pointed out,
for c in "string":
#do something with c
You can iterate pretty much anything in python using the for loop construct,
for example, open("file.txt") returns a file object (and opens the file), iterating over it iterates over lines in that file
with open(filename) as f:
for line in f:
# do something with line
If that seems like magic, well it kinda is, but the idea behind it is really simple.
There's a simple iterator protocol that can be applied to any kind of object to make the for loop work on it.
Simply implement an iterator that defines a next() method, and implement an __iter__ method on a class to make it iterable. (the __iter__ of course, should return an iterator object, that is, an object that defines next())
See official documentation
If you need access to the index as you iterate through the string, use enumerate():
>>> for i, c in enumerate('test'):
... print i, c
...
0 t
1 e
2 s
3 t
Even easier:
for c in "test":
print c
Just to make a more comprehensive answer, the C way of iterating over a string can apply in Python, if you really wanna force a square peg into a round hole.
i = 0
while i < len(str):
print str[i]
i += 1
But then again, why do that when strings are inherently iterable?
for i in str:
print i
Well you can also do something interesting like this and do your job by using for loop
#suppose you have variable name
name = "Mr.Suryaa"
for index in range ( len ( name ) ):
print ( name[index] ) #just like c and c++
Answer is
M r . S u r y a a
However since range() create a list of the values which is sequence thus you can directly use the name
for e in name:
print(e)
This also produces the same result and also looks better and works with any sequence like list, tuple, and dictionary.
We have used tow Built in Functions ( BIFs in Python Community )
1) range() - range() BIF is used to create indexes
Example
for i in range ( 5 ) :
can produce 0 , 1 , 2 , 3 , 4
2) len() - len() BIF is used to find out the length of given string
If you would like to use a more functional approach to iterating over a string (perhaps to transform it somehow), you can split the string into characters, apply a function to each one, then join the resulting list of characters back into a string.
A string is inherently a list of characters, hence 'map' will iterate over the string - as second argument - applying the function - the first argument - to each one.
For example, here I use a simple lambda approach since all I want to do is a trivial modification to the character: here, to increment each character value:
>>> ''.join(map(lambda x: chr(ord(x)+1), "HAL"))
'IBM'
or more generally:
>>> ''.join(map(my_function, my_string))
where my_function takes a char value and returns a char value.
Several answers here use range. xrange is generally better as it returns a generator, rather than a fully-instantiated list. Where memory and or iterables of widely-varying lengths can be an issue, xrange is superior.
You can also do the following:
txt = "Hello World!"
print (*txt, sep='\n')
This does not use loops but internally print statement takes care of it.
* unpacks the string into a list and sends it to the print statement
sep='\n' will ensure that the next char is printed on a new line
The output will be:
H
e
l
l
o
W
o
r
l
d
!
If you do need a loop statement, then as others have mentioned, you can use a for loop like this:
for x in txt: print (x)
If you ever run in a situation where you need to get the next char of the word using __next__(), remember to create a string_iterator and iterate over it and not the original string (it does not have the __next__() method)
In this example, when I find a char = [ I keep looking into the next word while I don't find ], so I need to use __next__
here a for loop over the string wouldn't help
myString = "'string' 4 '['RP0', 'LC0']' '[3, 4]' '[3, '4']'"
processedInput = ""
word_iterator = myString.__iter__()
for idx, char in enumerate(word_iterator):
if char == "'":
continue
processedInput+=char
if char == '[':
next_char=word_iterator.__next__()
while(next_char != "]"):
processedInput+=next_char
next_char=word_iterator.__next__()
else:
processedInput+=next_char

Regex Pattern Matching -a substring in words in CSV File

'Neighborhood,eattend10,eattend11,eattend12,eattend13,mattend10,mattend11,mattend12,mattend13,
hsattend10,hsattend11,hsattend12,hsattend13,eenrol11,eenrol12,eenrol13,menrol11,menrol12,
menrol13,hsenrol11,hsenrol12,hsenrol13,aastud10,aastud11,aastud12,aastud13,wstud10,wstud11,
wstud12,wstud13,hstud10,hstud11,hstud12,hstud13,abse10,abse11,abse12,abse13,absmd10,absmd11,
absmd12,absmd13,abshs10,abshs11,abshs12,abshs13,susp10,susp11,susp12,susp13,farms10,farms11,
farms12,farms13,sped10,sped11,sped12,sped13,ready11,ready12,ready13,math310,math311,math312,
math313,read310,read311,read312,read313,math510,math511,math512,math513,read510,read511,read512,
read513,math810,math811,math812,math813,read810,read811,read812,read813,hsaeng10,hsaeng11,
hsaeng12,hsaeng13,hsabio10,hsabio11,hsabio12,hsabio13,hsagov10,hsagov11,hsagov13,hsaalg10,
hsaalg11,hsaalg12,hsaalg13,drop10,drop11,drop12,drop13,compl10,compl11,compl12,compl13,
sclsw11,sclsw12,sclsw13,sclemp13\
I have this data set. I need to know how many drop words are there and print them.
Or similarly for any word like mattend and print those.
I tried using findall but I think that's not correct
I assume we can use re.search or re.match.
How can I do it in RegEx?
You can use len() on re.findall() to get the length of the returned list:
import re
with open('example.csv') as f:
data = f.read().strip()
print(len(re.findall('drop',data)))
I think re.findall should be correct.
From python re module documentation:
Search:
Scan through string looking for the first location where this regular expression produces a match, and return a corresponding match object.
Match:
If zero or more characters at the beginning of string match this regular expression, return a corresponding match object.
Findall:
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result.
I tried it on your example and it worked for me:
re.findall("drop", str)
If you want to see digits after it you can try something like:
re.findall("drop\d*", str)
If you want to count the words you can use:
len(re.findall("drop\d*", str))

Python - how to recursively search a variable substring in texts that are elements of a list

let me explain better what I mean in the title.
Examples of strings where to search (i.e. strings of variable lengths
each one is an element of a list; very large in reality):
STRINGS = ['sftrkpilotndkpilotllptptpyrh', 'ffftapilotdfmmmbtyrtdll', 'gftttepncvjspwqbbqbthpilotou', 'htfrpilotrtubbbfelnxcdcz']
The substring to find, which I know is for sure:
contained in each element of STRINGS
is also contained in a SOURCE string
is of a certain fixed LENGTH (5 characters in this example).
SOURCE = ['gfrtewwxadasvpbepilotzxxndffc']
I am trying to write a Python3 program that finds this hidden word of 5 characters that is in SOURCE and at what position(s) it occurs in each element of STRINGS.
I am also trying to store the results in an array or a dictionary (I do not know what is more convenient at the moment).
Moreover, I need to perform other searches of the same type but with different LENGTH values, so this value should be provided by a variable in order to be of more general use.
I know that the first point has been already solved in previous posts, but
never (as far as I know) together with the second point, which is the part of the code I could not be able to deal with successfully (I do not post my code because I know it is just too far from being fixable).
Any help from this great community is highly appreciated.
-- Maurizio
You can iterate over the source string and for each sub-string use the re module to find the positions within each of the other strings. Then if at least one occurrence was found for each of the strings, yield the result:
import re
def find(source, strings, length):
for i in range(len(source) - length):
sub = source[i:i+length]
positions = {}
for s in strings:
# positions[s] = [m.start() for m in re.finditer(re.escape(sub), s)]
positions[s] = [i for i in range(len(s)) if s.startswith(sub, i)] # Using built-in functions.
if not positions[s]:
break
else:
yield sub, positions
And the generator can be used as illustrated in the following example:
import pprint
pprint.pprint(dict(find(
source='gfrtewwxadasvpbepilotzxxndffc',
strings=['sftrkpilotndkpilotllptptpyrh',
'ffftapilotdfmmmbtyrtdll',
'gftttepncvjspwqbbqbthpilotou',
'htfrpilotrtubbbfelnxcdcz'],
length=5
)))
which produces the following output:
{'pilot': {'ffftapilotdfmmmbtyrtdll': [5],
'gftttepncvjspwqbbqbthpilotou': [21],
'htfrpilotrtubbbfelnxcdcz': [4],
'sftrkpilotndkpilotllptptpyrh': [5, 13]}}

Doubts about string

So, I'm doing an exercise using python, and I tried to use the terminal to do step by step to understand what's happening but I didn't.
I want to understand mainly why the conditional return just the index 0.
Looking 'casino' in [Casinoville].lower() isn't the same thing?
Exercise:
Takes a list of documents (each document is a string) and a keyword.
Returns list of the index values into the original list for all documents containing the keyword.
Exercise solution
def word_search(documents, keyword):
indices = []
for i, doc in enumerate(documents):
tokens = doc.split()
normalized = [token.rstrip('.,').lower() for token in tokens]
if keyword.lower() in normalized:
indices.append(i)
return indices
My solution
def word_search(documents, keyword):
return [i for i, word in enumerate(doc_list) if keyword.lower() in word.rstrip('.,').lower()]
Run
>>> doc_list = ["The Learn Python Challenge Casino.", "They bought a car", "Casinoville"]
Expected output
>>> word_search(doc_list, 'casino')
>>> [0]
Actual output
>>> word_search(doc_list, 'casino')
>>> [0, 2]
Let's try to understand the difference.
The "result" function can be written with list-comprehension:
def word_search(documents, keyword):
return [i for i, word in enumerate(documents)
if keyword.lower() in
[token.rstrip('.,').lower() for token in word.split()]]
The problem happens with the string : "Casinoville" at index 2.
See the output:
print([token.rstrip('.,').lower() for token in doc_list[2].split()])
# ['casinoville']
And here is the matter: you try to ckeck if a word is in the list. The answer is True only if all the string matches (this is the expected output).
However, in your solution, you only check if a word contains a substring. In this case, the condition in is on the string itself and not the list.
See it:
# On the list :
print('casino' in [token.rstrip('.,').lower() for token in doc_list[2].split()])
# False
# On the string:
print('casino' in [token.rstrip('.,').lower() for token in doc_list[2].split()][0])
# True
As result, in the first case, "Casinoville" isn't included while it is in the second one.
Hope that helps !
The question is "Returns list of the index values into the original list for all documents containing the keyword".
you need to consider word only.
In "Casinoville" case, word "casino" is not in, since this case only have word "Casinoville".
When you use the in operator, the result depends on the type of object on the right hand side. When it's a list (or most other kinds of containers), you get an exact membership test. So 'casino' in ['casino'] is True, but 'casino' in ['casinoville'] is False because the strings are not equal.
When the right hand side of is is a string though, it does something different. Rather than looking for an exact match against a single character (which is what strings contain if you think of them as sequences), it does a substring match. So 'casino' in 'casinoville' is True, as would be casino in 'montecasino' or 'casino' in 'foocasinobar' (it's not just prefixes that are checked).
For your problem, you want exact matches to whole words only. The reference solution uses str.split to separate words (the with no argument it splits on any kind of whitespace). It then cleans up the words a bit (stripping off punctuation marks), then does an in match against the list of strings.
Your code never splits the strings you are passed. So when you do an in test, you're doing a substring match on the whole document, and you'll get false positives when you match part of a larger word.

How to split a string using 2 parameters - python 3.5

I was wondering if there was a way to use a .split() function to split a string up using 2 parameters.
For example in the maths equation:
x^2+6x-9
Is it possible to split it using the + and -?
So that it ends up as the list:
[x^2, 6x, 9]
I think you need to use regx for solve your problem.
Since .split() returns a list, you'd have to iterate over the returned list one more time. This function (below) lets you split an arbitrary number of times, but as mentioned in another answer, you should consider using the re (regular expressions) module.
from itertools import chain
def split_n_times(s, chars):
if not chars:
# Return a single-item list if chars is empty
return [s]
lst = s.split(chars[0])
for char in chars:
# `chain` concatenates iterables
lst = chain(*(item.split(char) for item in lst))
return lst
The regex version would be considerably shorter.
This will do want you are requesting in the question
def multisplit(text, targets):
input = []
output = [text]
for target in targets:
input = output
output = []
for text in input:
output += text.split(target)
return output
Learning regex will probably help you do what you are intending to do. Play around with something like [+-][^+-]* over at https://regex101.com/ to get a better feel for how they work.

Resources