Find a specific item from a list using python - python-3.x

I have a list of 20000 Products with their Description
This shows the variety of the products
I want to be able to write a code that searches a particular word say 'TAPA'
and give a output of all the TAPAs
I found this Find a specific word from a list in python , but it uses startswith which finds only the first item for example:
new = [x for x in df1['A'] if x.startswith('00320')]
## output ['00320671-01 Guide rail 25N/1660', '00320165S02 - Miniature rolling table']
How shall i find for the second letter, third or any other item
P.S- the list consists of strings, integers, floats

You can use string.find(substring) for this purpose. So in your case this should work:
new = [x for x in df1['A'] if x.find('00320') != -1]
The find() method returns the lowest index of the substring found else returns -1.
To know more about usage of find() refer to Geeksforgeeks.com - Python String | find()
Edit 1:
As suggested by #Thierry in comments, a cleaner way to do this is:
new = [x for x in df1['A'] if '00320' in x]

You can use the built-in functions of Pandas to find partial string matches and generate lists:
new = df1['A'][df1['A'].astype(str).str.contains('00320')]['A'].tolist()
An advantage of pandas str.contains() is that the use of regex is possible.

Related

How to filter a certain type of python list

I have a list of strings. Each string has the same length/number of characters in the format
xyzw01.ext or xyzv02.ext, etc.
For example
list 1: ['ABCJ01.ext','CDEJ02.ext','ADEJ01.ext','CDEJ01.ext','ABCJ02.ext','CDEJ03.ext']
list 2: ['ABCJ01.ext','ADEJ01.ext','CDEJ01.ext','RPNJ01.ext','PLEJ01.ext']
I would like from these lists to build new lists with only the strings with highest number.
So from list 1 I would like to get
['ADEJ01.ext','ABCJ02.ext','CDEJ03.ext']
while for list 2 I would like to get the same list since all numbers are 01.
Is there a "simple" way of achieving this?
You can use defaultdict and max
from collections import defaultdict
def fun(lst):
res = defaultdict(list)
for x in lst:
res[x[:4]].append(x)
return [max(res[x], key=lambda x: x[4:6]) for x in res]
lst = ['ABCJ01.ext','CDEJ02.ext','ADEJ01.ext','CDEJ01.ext','ABCJ02.ext','CDEJ03.ext']
lst2 = ['ABCJ01.ext','ADEJ01.ext','CDEJ01.ext','RPNJ01.ext','PLEJ01.ext']
print(fun(lst))
print(fun(lst2))
Output:
['ABCJ02.ext', 'CDEJ03.ext', 'ADEJ01.ext']
['ABCJ01.ext', 'ADEJ01.ext', 'CDEJ01.ext', 'RPNJ01.ext', 'PLEJ01.ext']
The easiest way is probably to use an intermediate data structure, like a dict - sort the list items into buckets based on the first part of their names, and then take the maximum number for each bucket. We can just use the built-in max() without a key, since as-given lexicographic sorting works to find the largest. If that's not sufficient, you could use more regex to take the number out of the item and use it as the key instead.
import re
def filter_list(lst):
prefixes = {}
for item in lst:
# use regex to isolate the non-numeric characters at the start of the string
prefix = re.match(r'^([^0-9]*)', item).group(1)
# make a bucket based on each prefix, and put the item in it
prefixes.setdefault(prefix, [])
prefixes[prefix].append(item)
# make a list comprehension taking the maximum item from each bucket
return [max(value) for value in prefixes.values()]
>>> a = ['ABCJ01.ext','CDEJ02.ext','ADEJ01.ext','CDEJ01.ext','ABCJ02.ext','CDEJ03.ext']
>>> b = ['ABCJ01.ext','ADEJ01.ext','CDEJ01.ext','RPNJ01.ext','PLEJ01.ext']
>>> filter_list(a)
['ABCJ02.ext', 'CDEJ03.ext', 'ADEJ01.ext']
>>> filter_list(b)
['ABCJ01.ext', 'ADEJ01.ext', 'CDEJ01.ext', 'RPNJ01.ext', 'PLEJ01.ext']
In python 3.7+, this should preserve the order of list from the first occurrence of each prefix (i.e. CDEJ03.ext will precede ADEJ01.ext in the output because CDEJ02.ext precedes it in the input).
To get the output in the exact same order as the original list, behavior, you'd want to explicitly reassign the key instead of using .setdefault(), perhaps with a pattern like prefixes[prefix] = prefixes[prefix] if prefix in prefixes else [].

sort a list based on integer returns wrong sort

I am trying to sort a list that contain in each index an integer and a string. like the one in the example.
I used sort() and split but I get always the wrong ordered that I expect
def takeSecond(elem):
return elem.split("|")[2]
list = ['|val1: 0|0','|val: 0|80','|val1 0|140','|val1: 0|20','|val1: 0|90']
list.sort(key=takeSecond)
print(list)
that returns:
['|val1: 0|90','|val: 0|80','|val1: 0|20','|val1: 0|0','|val1 0|140']
and I expect to get this:
['|val1: 0|140','|val: 0|90','|val1: 0|80','|val1: 20|0','|val1 0|0']
Where is my mistake in here?
Try this:
l = ['|val1: 0|0','|val: 0|80','|val1 0|140','|val1: 0|20','|val1: 0|90']
l.sort(key=lambda x:int(x.rsplit('|')[-1]), reverse=True)
This will sort your list based on what you need. and the expected output is:
In [18]: l
Out[18]: ['|val1 0|140', '|val1: 0|90', '|val: 0|80', '|val1: 0|20', '|val1: 0|0']
In addition note that:
Do not use list as a variable name. list is a built-in name in python, you will override its functionality .

Python: Print entire line of string match and not cut off after the period

See bottom for the solution I came up with.
Hopefully this is a easy question for you guys. Trying to match a string to a list and print just that string matched. I was successful using re, but it is cutting off the rest of the string after the period. The span per re is 0,10 and when i look at the output without using re it is 0,14 not 0,10 so match is cutting off the info after the period. So I would like to learn how to tell it to print the entire span or learn a new way to match a var string to a list and print that exact string. My original attempts printed anything with the TESTPR in it, 3 printed total, the others I do not want printing have a 1 in the front and the last match has an additional R at the end. Here is my current match code:
#OLD See below
for element in catalog:
z = re.match("((TESTPRR )\w+)", element)
if z:
print((z.group()))
Output: TESTPR 105
It should show:
Wanted output: TESTPT 105.465
It will go up to 3 decimal places after the period and no more. I am currently taking a Python class to learn Python and love it so far, but this one has me stumped as I am just now learning about re and matching by reading as we have not gotten to that yet in class.
I am open to learning a different way to search for and match a string and print just that string. For my first attempt that prints 3 results was this:
catalog = [ long list pulled from API then code here to make it a nice column]
prod = 'TESTPR'
print ([s for s in catalog if prod in s])
When I add a space at the end of prod i can get rid of the match with the extra char at the end, but I cannot add a space to do the same thing with the match that has an extra char at the front. This is for the code above and not for the re match code. Thanks!
Answer below!
Since you are interested in learning about ways to match strings and solve your problem: try fuzzywuzzy.
In your case you could try:
from fuzzywuzzy import process
catalog = [long list pulled from API then code here to make it a nice column]
prod = "TESTPR"
hit = process.extractOne(prod, catalog, score_cutoff = 75) #you can adjust this to suit how close the match should be
print(hit[0]) #hit will be sth like ("TESTPT 105.465", 75)
Output: TESTPT 105.465
For information on different ways of using fuzzywuzzy, check out this link.
You can use different ways of matching such as:
fuzz.partial_ratio
fuzz.ratio
token_sort_ratio
fuzz.token_set_ratio
for this from fuzzywuzzy import fuzz
Kept at it with re.match and got the correct regex so the entire match prints and it does not cut off numbers after the period.
my original match as you can see above was re.match("((TESTPRR )\w+)", element), some of the ( were unneeded and needed to add a few more expressions and now it prints the correct match. See above for old code and below for the new code that works.
# New code, replaced w+ with w*\d*[.,]?\d*$
for element in catalog:
z = re.match("STRING\w*\d*[.,]?\d*$", element)
if z:
print(z.group())

finding elements from list with different format of strings

I have a large list with elements as:
#1
#10
(on
)
0.0574
122-124
122A
Cat
Dog
elephant
elephant12
elephant-1
I want to search and be able to find only the following:
Cat
Dog
elephant
elephant12
elephant-1
i.e. elements which have an English alphabet at the beginning.
Use list comprehension:
import string
result = [item for item in my_list if item[0] in string.ascii_letters]
As #Jon commented, check if a character is a letter can simply be:
result = [item for item in my_list if item[0].isalpha()]
The above works when all items are string, and you expect items with leading English character. Change the if part as needed or even write a function if it is too complex.
If you are looking for memory-optimized version, consider generator.
You could work with a list of animals.
for example:
list=["cat","dog","elephant","fish","bird", "snake"]
then search for any of those strings in your input
I know there are better methods but I would do it with
def search(input):
for item in list:
if item in input:
result.append(item)
return result
Then you would add case ignorance precisions.
If you want the number associated with your animal name, you'll need to append the input. In this case, iterate your input.
Of course your list could take the dimension of a database if you need, for example to search for all possible existing word.

Need help working with lists within lists

I'm taking a programming class and have our first assignment. I understand how it's supposed to work, but apparently I haven't hit upon the correct terms to search to get help (and the book is less than useless).
The assignment is to take a provided data set (names and numbers) and perform some manipulation and computation with it.
I'm able to get the names into a list, and know the general format of what commands I'm giving, but the specifics are evading me. I know that you refer to the numbers as names[0][1], names[1][1], etc, but not how to refer to just that record that is being changed. For example, we have to have the program check if a name begins with a letter that is Q or later; if it does, we double the number associated with that name.
This is what I have so far, with ??? indicating where I know something goes, but not sure what it's called to search for it.
It's homework, so I'm not really looking for answers, but guidance to figure out the right terms to search for my answers. I already found some stuff on the site (like the statistics functions), but just can't find everything the book doesn't even mention.
names = [("Jack",456),("Kayden",355),("Randy",765),("Lisa",635),("Devin",358),("LaWanda",452),("William",308),("Patrcia",256)]
length = len(names)
count = 0
while True
count < length:
if ??? > "Q" # checks if first letter of name is greater than Q
??? # doubles number associated with name
count += 1
print(names) # self-check
numberNames = names # creates new list
import statistics
mean = statistics.mean(???)
median = statistics.median(???)
print("Mean value: {0:.2f}".format(mean))
alphaNames = sorted(numberNames) # sorts names list by name and creates new list
print(alphaNames)
first of all you need to iter over your names list. To do so use for loop:
for person in names:
print(person)
But names are a list of tuples so you will need to get the person name by accessing the first item of the tuple. You do this just like you do with lists
name = person[0]
score = person[1]
Finally to get the ASCII code of a character, you use ord() function. That is going to be helpful to know if name starts with a Q or above.
print(ord('A'))
print(ord('Q'))
print(ord('R'))
This should be enough informations to get you started with.
I see a few parts to your question, so I'll try to separate them out in my response.
check if first letter of name is greater than Q
Hopefully this will help you with the syntax here. Like list, str also supports element access by index with the [] syntax.
$ names = [("Jack",456),("Kayden",355)]
$ names[0]
('Jack', 456)
$ names[0][0]
'Jack'
$ names[0][0][0]
'J'
$ names[0][0][0] < 'Q'
True
$ names[0][0][0] > 'Q'
False
double number associated with name
$ names[0][1]
456
$ names[0][1] * 2
912
"how to refer to just that record that is being changed"
We are trying to update the value associated with the name.
In theme with my previous code examples - that is, we want to update the value at index 1 of the tuple stored at index 0 in the list called names
However, tuples are immutable so we have to be a little tricky if we want to use the data structure you're using.
$ names = [("Jack",456), ("Kayden", 355)]
$ names[0]
('Jack', 456)
$ tpl = names[0]
$ tpl = (tpl[0], tpl[1] * 2)
$ tpl
('Jack', 912)
$ names[0] = tpl
$ names
[('Jack', 912), ('Kayden', 355)]
Do this for all tuples in the list
We need to do this for the whole list, it looks like you were onto that with your while loop. Your counter variable for indexing the list is named count so just use that to index a specific tuple, like: names[count][0] for the countth name or names[count][1] for the countth number.
using statistics for calculating mean and median
I recommend looking at the documentation for a module when you want to know how to use it. Here is an example for mean:
mean(data)
Return the sample arithmetic mean of data.
$ mean([1, 2, 3, 4, 4])
2.8
Hopefully these examples help you with the syntax for continuing your assignment, although this could turn into a long discussion.
The title of your post is "Need help working with lists within lists" ... well, your code example uses a list of tuples
$ names = [("Jack",456),("Kayden",355)]
$ type(names)
<class 'list'>
$ type(names[0])
<class 'tuple'>
$ names = [["Jack",456], ["Kayden", 355]]
$ type(names)
<class 'list'>
$ type(names[0])
<class 'list'>
notice the difference in the [] and ()
If you are free to structure the data however you like, then I would recommend using a dict (read: dictionary).
I know that you refer to the numbers as names[0][1], names[1][1], etc, but
not how to refer to just that record that is being changed. For
example, we have to have the program check if a name begins with a
letter that is Q or later; if it does, we double the number associated
with that name.
It's not entirely clear what else you have to do in this assignment, but regarding your concerns above, to reference the ith"record that is being changed" in your names list, simply use names[i]. So, if you want to access the first record in names, simply use names[0], since indexing in Python begins at zero.
Since each element in your list is a tuple (which can also be indexed), using constructs like names[0][0] and names[0][1] are ways to index the values within the tuple, as you pointed out.
I'm unsure why you're using while True if you're trying to iterate through each name and check whether it begins with "Q". It seems like a for loop would be better, unless your class hasn't gotten there yet.
As for checking whether the first letter is 'Q', str (string) objects are indexed similarly to lists and tuples. To access the first letter in a string, for example, see the following:
>>> my_string = 'Hello'
>>> my_string[0]
'H'
If you give more information, we can help guide you with the statistics piece, as well. But I would first suggest you get some background around mean and median (if you're unfamiliar).

Resources