how to extract lists having same element value? - python-3.x

I have a list of list like this data=[["date1","a",14,15],["date1","b",14,15],["date1","c",14,15],["date2","a",14,15],["date2","b",14,15],["date2","c",14,15],["date3","a",14,15],["date3","b",14,15],["date3","c",14,15]] I want to get lists having the same 2nd index. i tried this code but i got 9 lists when i just need 3 lists.
data=[["date1","a",14,15],["date1","b",14,15],["date1","c",14,15],["date2","a",14,15],["date2","b",14,15],["date2","c",14,15],["date3","a",14,15],["date3","b",14,15],["date3","c",14,15]]
for i in data:
a=[]
for j in data:
if (i[1]==j[1]):
a.append(j)
print(a)
i expected to get ["date1","a",14,15],["date2","a",14,15],["date3","a",14,15]
["date1","b",14,15],["date2","b",14,15],["date3","b",14,15]
["date1","c",14,15],["date2","c",14,15],["date3","c",14,15]

data=[["date1","a",14,15],["date1","b",14,15],["date1","c",14,15],["date2","a",14,15],["date2","b",14,15],["date2","c",14,15],["date3","a",14,15],["date3","b",14,15],["date3","c",14,15]]
from itertools import groupby
from operator import itemgetter
print(
[list(v) for k,v in groupby(sorted(data, key=itemgetter(1)), key=itemgetter(1))]
)
In order for groupby to work the data has to be sorted.
Depending on your use case, the list instantiation of the iterator might not be needed. Added it to see proper output instead of <itertools._grouper... >

Related

How to filter a certain type of python list

I have a list of strings. Each string has the same length/number of characters in the format
xyzw01.ext or xyzv02.ext, etc.
For example
list 1: ['ABCJ01.ext','CDEJ02.ext','ADEJ01.ext','CDEJ01.ext','ABCJ02.ext','CDEJ03.ext']
list 2: ['ABCJ01.ext','ADEJ01.ext','CDEJ01.ext','RPNJ01.ext','PLEJ01.ext']
I would like from these lists to build new lists with only the strings with highest number.
So from list 1 I would like to get
['ADEJ01.ext','ABCJ02.ext','CDEJ03.ext']
while for list 2 I would like to get the same list since all numbers are 01.
Is there a "simple" way of achieving this?
You can use defaultdict and max
from collections import defaultdict
def fun(lst):
res = defaultdict(list)
for x in lst:
res[x[:4]].append(x)
return [max(res[x], key=lambda x: x[4:6]) for x in res]
lst = ['ABCJ01.ext','CDEJ02.ext','ADEJ01.ext','CDEJ01.ext','ABCJ02.ext','CDEJ03.ext']
lst2 = ['ABCJ01.ext','ADEJ01.ext','CDEJ01.ext','RPNJ01.ext','PLEJ01.ext']
print(fun(lst))
print(fun(lst2))
Output:
['ABCJ02.ext', 'CDEJ03.ext', 'ADEJ01.ext']
['ABCJ01.ext', 'ADEJ01.ext', 'CDEJ01.ext', 'RPNJ01.ext', 'PLEJ01.ext']
The easiest way is probably to use an intermediate data structure, like a dict - sort the list items into buckets based on the first part of their names, and then take the maximum number for each bucket. We can just use the built-in max() without a key, since as-given lexicographic sorting works to find the largest. If that's not sufficient, you could use more regex to take the number out of the item and use it as the key instead.
import re
def filter_list(lst):
prefixes = {}
for item in lst:
# use regex to isolate the non-numeric characters at the start of the string
prefix = re.match(r'^([^0-9]*)', item).group(1)
# make a bucket based on each prefix, and put the item in it
prefixes.setdefault(prefix, [])
prefixes[prefix].append(item)
# make a list comprehension taking the maximum item from each bucket
return [max(value) for value in prefixes.values()]
>>> a = ['ABCJ01.ext','CDEJ02.ext','ADEJ01.ext','CDEJ01.ext','ABCJ02.ext','CDEJ03.ext']
>>> b = ['ABCJ01.ext','ADEJ01.ext','CDEJ01.ext','RPNJ01.ext','PLEJ01.ext']
>>> filter_list(a)
['ABCJ02.ext', 'CDEJ03.ext', 'ADEJ01.ext']
>>> filter_list(b)
['ABCJ01.ext', 'ADEJ01.ext', 'CDEJ01.ext', 'RPNJ01.ext', 'PLEJ01.ext']
In python 3.7+, this should preserve the order of list from the first occurrence of each prefix (i.e. CDEJ03.ext will precede ADEJ01.ext in the output because CDEJ02.ext precedes it in the input).
To get the output in the exact same order as the original list, behavior, you'd want to explicitly reassign the key instead of using .setdefault(), perhaps with a pattern like prefixes[prefix] = prefixes[prefix] if prefix in prefixes else [].

sort a list based on integer returns wrong sort

I am trying to sort a list that contain in each index an integer and a string. like the one in the example.
I used sort() and split but I get always the wrong ordered that I expect
def takeSecond(elem):
return elem.split("|")[2]
list = ['|val1: 0|0','|val: 0|80','|val1 0|140','|val1: 0|20','|val1: 0|90']
list.sort(key=takeSecond)
print(list)
that returns:
['|val1: 0|90','|val: 0|80','|val1: 0|20','|val1: 0|0','|val1 0|140']
and I expect to get this:
['|val1: 0|140','|val: 0|90','|val1: 0|80','|val1: 20|0','|val1 0|0']
Where is my mistake in here?
Try this:
l = ['|val1: 0|0','|val: 0|80','|val1 0|140','|val1: 0|20','|val1: 0|90']
l.sort(key=lambda x:int(x.rsplit('|')[-1]), reverse=True)
This will sort your list based on what you need. and the expected output is:
In [18]: l
Out[18]: ['|val1 0|140', '|val1: 0|90', '|val: 0|80', '|val1: 0|20', '|val1: 0|0']
In addition note that:
Do not use list as a variable name. list is a built-in name in python, you will override its functionality .

Printing a list method return None

I am an extremely begginer learning python to tackle some biology problems, and I came across lists and its various methods. Basically, when I am running print to my variable I get None as return.
Example, trying to print a sorted list assigned to a variable
list1=[1,3,4,2]
sorted=list1.sort()
print(sorted)
I receive None as return. Shouldn't this provide me with [1,2,3,4]
However, when printing the original list variable (list1), it gives me the sorted list fine.
Because the sort() method will always return None. What you should do is:
list1=[1,3,4,2]
list1.sort()
print(list1)
Or
list1=[1,3,4,2]
list2 = sorted(list1)
print(list2)
You can sort lists in two ways. Using list.sort() and this will sort list, or new_list = sorted(list) and this will return a sorted list new_list and list will not be modified.
So, you can do this:
list1=[1,3,4,2]
sorted=sorted(list1)
print(sorted)
Or you can so this:
list1=[1,3,4,2]
list1.sort()
print(list1)

List, tuples or dictionary, differences and usage, How can I store info in python

I'm very new in python (I usually write in php). I want to understand how to store information in an associative array, and if you can explain me whats the difference of "tuples", "arrays", "dictionary" and "list" will be wonderful (I tried to read different source but I still not caching it).
So This is my code:
#!/usr/bin/python3.4
import csv
import string
nidless_keys = dict()
nidless_keys = ['test_string1','test_string2'] #this contain the string to
# be searched in linesreader
data = {'type':[],'id':[]} #here I want to store my information
with open('path/to/csv/file.csv',newline="") as csvfile:
linesreader = csv.reader(csvfile,delimiter=',',quotechar="|")
for row in linesreader: #every line in this csv have a url like
#www.test.com/?test_string1&id=123456
current_row_string = str(row)
for needle in nidless_keys:
current_needle = str(needle)
if current_needle in current_row_string:
data[current_needle[current_row_string[-8:]]) += 1 # also I
#need to count per every id how much rows there are.
In conclusion:
my_data_stored = [current_needle][current_row_string[-8]]
current_row_string[-8] is a url which the last 8 digit of the url is an ID.
So the array should looks like this at the end of the script:
test_string1 = 123456 = 20
= 256468 = 15
test_string2 = 123155 = 10
Edit 1:
Which type I need here to store the information?
Can you tell me how to resolve this script?
It seems you want to count how many times an ID in combination with a test string occurs.
There can be multiple ID/count combinations associated with every test string.
This suggests that you should use a dictionary indexed by the test strings to store the results. In that dictionary I would suggest to store collections.Counter objects.
This way, you would have to add a special case when a key in the results dictionary isn't found to add an empty Counter. This is a common problem, so there is a specialized form of dictionary in the collections module called defaultdict.
import collections
import csv
# Using a tuple for the keys so it cannot be accidentally modified
keys = ('test_string1', 'test_string2')
result = collections.defaultdict(collections.Counter)
with open('path/to/csv/file.csv',newline="") as csvfile:
linesreader = csv.reader(csvfile,delimiter=',',quotechar="|")
for row in linesreader:
for key in keys:
if key in row:
id = row[-6:] # ID's are six digits in your example.
# The first index is into the dict, the second into the Counter.
result[key][id] += 1
There is an even easier way, by using regular expressions.
Since you seem to treat every row in a CSV file as a string, there is little need to use the CSV reader, so I'll just read the whole file as text.
import re
with open('path/to/csv/file.csv') as datafile:
text = datafile.read()
pattern = r'\?(.*)&id=(\d+)'
The pattern is a regular expression. This is a large topic in and of itself, so I'll only cover briefly what it does. (You might also want to check out the relevant HOWTO) At first glance it looks like complete gibberish, but it is actually a complete language.
In looks for two things in a line. Anything between ? and &id=, and a sequence of digits after &id=.
I'll be using IPython to give an example.
(If you don't know it, check out IPython. It is great for trying things and see if they work.)
In [1]: import re
In [2]: pattern = r'\?(.*)&id=(\d+)'
In [3]: text = """www.test.com/?test_string1&id=123456
....: www.test.com/?test_string1&id=123456
....: www.test.com/?test_string1&id=234567
....: www.test.com/?foo&id=234567
....: www.test.com/?foo&id=123456
....: www.test.com/?foo&id=1234
....: www.test.com/?foo&id=1234
....: www.test.com/?foo&id=1234"""
The text variable points to the string which is a mock-up for the contents of your CSV file.
I am assuming that:
every URL is on its own line
ID's are a sequence of digits.
If these assumptions are wrong, this won't work.
Using findall to extract every match of the pattern from the text.
In [4]: re.findall(pattern, test)
Out[4]:
[('test_string1', '123456'),
('test_string1', '123456'),
('test_string1', '234567'),
('foo', '234567'),
('foo', '123456'),
('foo', '1234'),
('foo', '1234'),
('foo', '1234')]
The findall function returns a list of 2-tuples (that is key, ID pairs). Now we just need to count those.
In [5]: import collections
In [6]: result = collections.defaultdict(collections.Counter)
In [7]: intermediate = re.findall(pattern, test)
Now we fill the result dict from the list of matches that is the intermediate result.
In [8]: for key, id in intermediate:
....: result[key][id] += 1
....:
In [9]: print(result)
defaultdict(<class 'collections.Counter'>, {'foo': Counter({'1234': 3, '123456': 1, '234567': 1}), 'test_string1': Counter({'123456': 2, '234567': 1})})
So the complete code would be:
import collections
import re
with open('path/to/csv/file.csv') as datafile:
text = datafile.read()
result = collections.defaultdict(collections.Counter)
pattern = r'\?(.*)&id=(\d+)'
intermediate = re.findall(pattern, test)
for key, id in intermediate:
result[key][id] += 1
This approach has two advantages.
You don't have to know the keys in advance.
ID's are not limited to six digits.
A brief summary of the python data types you mentioned:
A dictionary is an associative array, aka hashtable.
A list is a sequence of values.
An array is essentially the same as a list, but limited to basic datatypes. My impression is that they only exists for performance reasons, don't think I've ever used one. If performance is that critical to you, you probably don't want to use python in the first place.
A tuple is a fixed-length sequence of values (whereas lists and arrays can grow).
Lets take them one by one.
Lists:
List is a very naive kind of data structure similar to arrays in other languages in terms of the way we write them like:
['a','b','c']
This is a list in python , but seems very similar to array structure.
However there is a very large difference in the way lists are used in python and the usual arrays.
Lists are heterogenous in nature. This means that we can store any kind of data simultaneously inside it like:
ls = [1,2,'a','g',True]
As you can see, we have various kinds of data within a list and is a valid list.
However, one important thing about them is that we can access the list items using zero based indices. So we can write:
print ls[0],ls[3]
output: 1 g
Dictionary:
This datastructure is similar to a hash map data structure. It contains a (key,Value) pair. An empty dictionary looks like:
dc = {}
Now, to store a key,value pair, e.g., ('potato',3),(tomato,5), we can do as:
dc['potato'] = 3
dc['tomato'] = 5
and we saved the data in the dictionary dc.
The important thing is that we can even store another data structure element like a list within a dictionary like:
dc['list1'] = ls , where ls is the list defined above.
This shows the power of using dictionary.
In your case, you have difined a dictionary like this:
data = {'type':[],'id':[]}
This means that your dictionary will consist of only two keys and each key corresponds to a list, which are empty for now.
Talking a bit about your script, the expression :
current_row_string[-8:]
doesn't make a sense. The index should have been -6 instead of -8 that would give you the id part of the current row.
This part is the id and should have been stored in a variable say :
id = current_row_string[-6:]
Further action can be performed as seen the answer given by Roland.

Most efficient way to compare two dictionaries in python

dic1 = {'memory':'4','cpu':'2','disk':{'total':'160','swap':'4','/':'26','/var':'7','/tmp':'2'}}
dic2 = {'memory':8','cpu':'2','disk':{'total':'120,'swap':'4','/':'26','/var':'7','/tmp':'2'}}
Please note that both dictionaries itself contains another dictionary.
What is the most efficient way to compare each items without doing dict1==dict2 ?
Since i have to see some % change in values. So the only option left is iterating thru each dictionary items. something like:
for key1 in dic1:
for key2 in dic2:
if not isinstance(dic1[key1],dict):
#compare cpu & memory here
if int(dic1[key1]) > int(dict2[key2])
else:
#compare disk(internal dictionary here)
You can use itertools.zip_longest to zip the values and compare them in a list comprehension :
>>> from itertools import chain,zip_longest
["do something" if isinstance(i,dict) and all(k<v for k,v in zip_longest(i.values(),j.values())) else "do something" for i,j in zip_longest(dic1.values(),dic2.values())]
Note that here based on your need you can use another function instead of all you may be interest to use any or maybe you want to to some arithmetic operation on the values.

Resources