Python: Symmetrical Difference Between List of Sets of Strings - python-3.x

I have a list that contains multiple sets of strings, and I would like to find the symmetric difference between each string and the other strings in the set.
For example, I have the following list:
targets = [{'B', 'C', 'A'}, {'E', 'C', 'D'}, {'F', 'E', 'D'}]
For the above, desired output is:
[2, 0, 1]
because in the first set, A and B are not found in any of the other sets, for the second set, there are no unique elements to the set, and for the third set, F is not found in any of the other sets.
I thought about approaching this backwards; finding the intersection of each set and subtracting the length of the intersection from the length of the list, but set.intersection(*) does not appear to work on strings, so I'm stuck:
set1 = {'A', 'B', 'C'}
set2 = {'C', 'D', 'E'}
set3 = {'D', 'E', 'F'}
targets = [set1, set2, set3]
>>> set.intersection(*targets)
set()

The issue you're having is that there are no strings shared by all three sets, so your intersection comes up empty. That's not a string issue, it would work the same with numbers or anything else you can put in a set.
The only way I see to do a global calculation over all the sets, then use that to find the number of unique values in each one is to first count all the values (using collections.Counter), then for each set, count the number of values that showed up only once in the global count.
from collections import Counter
def unique_count(sets):
count = Counter()
for s in sets:
count.update(s)
return [sum(count[x] == 1 for x in s) for s in sets]

Try something like below:
Get symmetric difference with every set. Then intersect with the given input set.
def symVal(index,targets):
bseSet = targets[index]
symSet = bseSet
for j in range(len(targets)):
if index != j:
symSet = symSet ^ targets[j]
print(len(symSet & bseSet))
for i in range(len(targets)):
symVal(i,targets)

Your code example doesn't work because it's finding the intersection between all of the sets, which is 0 (since no element occurs everywhere). You want to find the difference between each set and the union of all other sets. For example:
set1 = {'A', 'B', 'C'}
set2 = {'C', 'D', 'E'}
set3 = {'D', 'E', 'F'}
targets = [set1, set2, set3]
result = []
for set_element in targets:
result.append(len(set_element.difference(set.union(*[x for x in targets if x is not set_element]))))
print(result)
(note that the [x for x in targets if x != set_element] is just the set of all other sets)

Related

find the lengths of all sublists containing common repeated element

I need to find all the sublists from a list where the element is 'F' and that must come one after other
g= ['T','F','F,'F','F','T','T','T','F,'F','F','T]
so, here in this case there are two sublists present in this list which contains element 'F' in repeat
i.e; ['F','F,'F','F'] in index 1,2,3,4 which is in repeat ,so answer is 4
and
['F','F,'F'] in index 8,9,10 which is again in continuous index,so answer is 3
Note:
The list contains only two elements 'T' and 'F' and every time we are doing these operations for element 'F'
You can get the lengths of consecutive sequences with itertools.groupby:
from itertools import groupby
data = ['T','F','F','F','F','T','T','T','F','F','F','T']
# Consecutive sequences of "F".
# "groupby(data)" produces an iterator that calculates on-the-fly.
# The iterator returns consecutive keys and groups from the iterable "data".
seqs = [list(g) for k, g in groupby(data) if k == 'F']
print(seqs)
# [['F', 'F', 'F', 'F'], ['F', 'F', 'F']]
seq_lens = [len(k) for k in seqs]
print(seq_lens)
# [4, 3]
Also cool is max length of such consecutive sequences:
max_len_seq = len(max(seqs, key=len))
print(max_len_seq)
# 4
See itertools.groupby for more info:
class groupby:
# [k for k, g in groupby('AAAABBBCCDAABBB')] --> A B C D A B
# [list(g) for k, g in groupby('AAAABBBCCD')] --> AAAA BBB CC D
...
etc
You can create 2 variable to keep count of the repeated letter. Traverse the array and when you found t increase t, when you find a f check the tcount first if it is bigger than 1 it means there is a repeat print the count of the repetition.
tcount = 0;
fcount = 0;
for e in g:
if e=="T":
tcount++
if fcount>1
print(fcount)
fcount=0
//do same operation for F

Find cartesian product of the elements in a program generated dynamic "sub-list"

I have a program which producing and modifying a list of "n" elements/members, n remaining constant throughout a particular run of the program. (The value of "n" might change in the next run).
Each member in the list is a "sub-list"! Each of these sub-list elements are not only of variable lengths, but are also dynamic and might keep changing while the program keeps running.
So, eventually, at some given point, my list would look something like (assuming n=3):
[['1', '2'], ['a', 'b', 'c', 'd'], ['x', 'y', 'z']]
I want the output to be like the following:
['1ax', '1ay', '1az', '1bx', '1by', '1bz',
'1cx', '1cy', '1cz', '1dx', '1dy', '1dz',
'2ax', '2ay', '2az', '2bx', '2by', '2bz',
'2cx', '2cy', '2cz', '2dx', '2dy', '2dz']
i.e. a list with exactly (2 * 3 * 4) elements where each element is of length exactly 3 and has exactly 1 member from each of the "sub-lists".
Easiest is itertools.product:
from itertools import product
lst = [['1', '2'], ['a', 'b', 'c', 'd'], ['x', 'y', 'z']]
output = [''.join(p) for p in product(*lst)]
# OR
output = list(map(''.join, product(*lst)))
# ['1ax', '1ay', '1az', '1bx', '1by', '1bz',
# '1cx', '1cy', '1cz', '1dx', '1dy', '1dz',
# '2ax', '2ay', '2az', '2bx', '2by', '2bz',
# '2cx', '2cy', '2cz', '2dx', '2dy', '2dz']
A manual implementation specific to strings could look like this:
def prod(*pools):
if pools:
*rest, pool = pools
for p in prod(*rest):
for el in pool:
yield p + el
else:
yield ""
list(prod(*lst))
# ['1ax', '1ay', '1az', '1bx', '1by', '1bz',
# '1cx', '1cy', '1cz', '1dx', '1dy', '1dz',
# '2ax', '2ay', '2az', '2bx', '2by', '2bz',
# '2cx', '2cy', '2cz', '2dx', '2dy', '2dz']

Check if a list is in a custom order

I am trying to validate a list say:
X = ['a','c', 'c', 'b', 'd','d','d']
against a custom ordered list:
Y = ['a',b','d']
In this case X validated against Y should return True regardless of the extra elements and duplicates in it as long as it goes with the order in Y and contains at least two elements.
Case Examples:
X = ['a','b'] # Returns True
X = ['d','a', 'a', 'c','b'] # Returns False
X = ['c','a','b', 'b', 'c'] # Returns True
The most I can do right now is remove the duplicates and extra elements. I am not trying to sort them using the custom list. I just need to validate the order. What I done or at least tried is to create a dictionary where the value is the index of the order. Can anyone point me in the right direction?
from itertools import zip_longest, groupby
okay = list(x == y for y, (x, _) in zip_longest(
(y for y in Y if y in X), groupby(x for x in X if x in Y)))
print(len(okay) >= 2 and all(okay))
First we discard unnecessary elements from both lists. Then we can use groupby to collapse sequences of the same elements of X. For example, your first example ['a', 'c', 'c', 'b', 'd', 'd', 'd'] first becomes ['a', 'c', 'c', 'b'] (by discarding the unnecessary'd'), then[('a', _), ('c', _), ('b', _)]. If we compare its keys element by element to the Y without the unnecessary bits, and there are at least 2 of them, we have a match. If the order was violated (e.g. ['b', 'c', 'c', 'a', 'd', 'd', 'd'], there would have been a False in okay, and it would fail. If an extra element appeared somewhere, there would be a comparison with None (thanks to zip_longest), and again a False would have been in okay.
This can be improved by use of sets to speed up the membership lookup.
Create a new list from X that only contains the elements from Y without duplicates. Then, similarly, remove all elements from Y not contained in X and deduplicate. Then your check is just a simple equality check.
def deduplicate(iterable):
seen = set()
return [seen.add(x) or x for x in iterable if x not in seen]
def goes_with_order(X, Y):
Xs = set(X); Ys = set(Y)
X = deduplicate(x for x in X if x in Ys)
Y = deduplicate(y for y in Y if y in Xs)
return X == Y

Efficient way of calculating specific length combinations of adjacent data?

I have a list of elements, of which I'd like to determine all possible combinations that can be arranged - preserving their order - to arrive at 'n' groups
So as an example, if I have an ordered list of A, B, C, D, E, and only want 2 groups, the four solutions would be;
ABCD, E
ABC, DE
AB, CDE
A, BCDE
Now, with some help from another StackOverflow post I've come up with a workable brute-force solution that calculates all possible combinations of all possible groupings from which I simply extract those cases that meet my target number of groupings.
For reasonable numbers of elements, this is just fine, but as I extend the numbers of elements, the number of combinations increases very very quickly, and I was wondering if there might be a clever way to limit the solutions calculated to only those that meet my target groupings number?
Code so far is as follows;
import itertools
import string
import collections
def generate_combination(source, comb):
res = []
for x, action in zip(source,comb + (0,)):
res.append(x)
if action == 0:
yield "".join(res)
res = []
#Create a list of first 20 letters of the alphabet
seq = list(string.ascii_uppercase[0:20])
seq
['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T']
#Generate all possible combinations
combinations = [list(generate_combination(seq,c)) for c in itertools.product((0,1), repeat=len(seq)-1)]
len(combinations)
524288
#Create a list that counts the number of groups in each solution,
#and counter to allow easy query
group_counts = [len(i) for i in combinations]
count_dic = collections.Counter(group_counts)
count_dic[1], count_dic[2], count_dic[3], count_dic[4], count_dic[5], count_dic[6]
(1, 19, 171, 969, 3876, 11628)
So as you can see, while over half a million combinations were calculated, if I had only wanted ones of length = 5, only 3,876 need have been calculated
Any suggestions?
A partition of seq into 5 parts is equivalent to a choice of 4 locations in range(1, len(seq)) at which to cut seq.
Thus you could use itertools.combinations(range(1, len(seq)), 4) to generate all the partitions of seq into 5 parts:
import itertools as IT
import string
def partition_into_n(iterable, n, chain=IT.chain, map=map):
"""
Return a generator of all partitions of iterable into n parts.
Based on http://code.activestate.com/recipes/576795/ (Raymond Hettinger)
which generates all partitions.
"""
s = iterable if hasattr(iterable, '__getitem__') else tuple(iterable)
size = len(s)
first, middle, last = [0], range(1, size), [size]
getitem = s.__getitem__
return (map(getitem, map(slice, chain(first, div), chain(div, last)))
for div in IT.combinations(middle, n-1))
seq = list(string.ascii_uppercase[0:20])
ngroups = 5
for partition in partition_into_n(seq, ngroups):
print(' '.join([''.join(grp) for grp in partition]))
print(len(list(partition_into_n(seq, ngroups))))
yields
A B C D EFGHIJKLMNOPQRST
A B C DE FGHIJKLMNOPQRST
A B C DEF GHIJKLMNOPQRST
A B C DEFG HIJKLMNOPQRST
...
ABCDEFGHIJKLMNO P Q RS T
ABCDEFGHIJKLMNO P QR S T
ABCDEFGHIJKLMNO PQ R S T
ABCDEFGHIJKLMNOP Q R S T
3876

find elements in lists using For Loop

keys = ['a','H','c','D','m','l']
values = ['a','c','H','D']
category = []
for index, i in enumerate(keys):
for j in values:
if j in i:
category.append(j)
break
if index == len(category):
category.append("other")
print(category)
My expected output is ['a', 'H', 'c', 'D', 'other', 'other']
But i am getting ['a', 'other', 'H', 'c', 'D', 'other']
EDIT: OP edited his question multiple times.
Python documentation break statement:
It terminates the nearest enclosing loop.
You break out of the outer loop using the "break" statement. The execution never even reaches the inner while loop.
Now.. To solve your problem of categorising strings:
xs = ['Am sleeping', 'He is walking','John is eating']
ys = ['walking','eating','sleeping']
categories = []
for x in xs:
for y in ys:
if y in x:
categories.append(y)
break
categories.append("other")
print(categories) # ['sleeping', 'walking', 'eating']
Iterate over both lists and check if any categories match. If they do append to the categories list and continue with the next string to categorise. If didn't find any matching category (defined by the count of matched categories being less than the current index (index is 0 based, so they are shifted by 1, which means == is less than in this case) then categorise as "other.

Resources