Most efficient way to remove duplicates - python-3.x

I have a log file that I need to remove duplicate entries from. Each line in the file consists of three parts separated by commas, let's call them A, B and C respectively.
Two entries are duplicates if and only if their A's and C's are equal. If duplicates are found, the one with the greatest B shall remain.
The real log file has a large number of lines, the following serves only as a simplified example:
Log file (input):
hostA, 1507300700.0, xyz
hostB, 1507300700.0, abc
hostB, 1507300800.0, xyz
hostA, 1507300800.0, xyz
hostA, 1507300900.0, xyz
Log file after duplicates have been removed (output):
hostB, 1507300700.0, abc
hostB, 1507300800.0, xyz
hostA, 1507300900.0, xyz
I've tried reading in the file as two lists, then comparing them along the lines of:
for i in full_log_list_a:
for j in full_log_list_b:
if i[0] == j[0] and i[2] == j[2] and i[1] > j[1]:
print(', '.join(i[0]), file=open(new_file, 'a'))
I've also tried a few other things, but whatever I do it ends up iterating over the list too many times and creating a bunch of repeat entries, or it fails to find ONLY the item with the greatest B. I know there's probably an obvious answer, but I'm stuck. Can someone please point me in the right direction?

I think a dict is what you're looking for, instead of lists.
As you read the log file you add entries to the dict, where each entry consists of a key (A, C) and a value B. If a key already exists, you compare B with the value mapped to the key, and remap the key if necessary (i.e. if B is greater than the value currently mapped to the key).
Example (do use better names for variables a, b and c):
log_file_entries = {}
with open(log_file, 'r') as f:
for line in f:
a, b_str, c = line.split(', ')
b = int(b_str)
if (a, c) in log_file_entries:
if b < log_file_entries[(a, c)]:
continue
log_file_entries[(a, c)] = b
It's one loop. Since the required operations on dicts are (typically) constant in time, i.e. O(1), the overall time complexity will be O(n), much better than your nested loops' time complexity of O(n²).
When you later rewrite the file, you can just loop over the dict like so:
with open(new_file, 'a') as f:
for (a, c), b in log_file_entries.items():
print('{0}, {1}, {2}'.format(a, b, c), file=f)
Apologies if any code or terms are incorrect, I haven't touched Python in a while.
(P.S. In your example code you use two lists, whereas you could have used the same list in both loops.)
UPDATE
If you want the value of a key to contain every part of a line in the log file, you could rewrite the above code like so:
log_file_entries = {}
with open(log_file, 'r') as f:
for line in f:
a, b_str, c = line.split(', ')
b = int(b_str)
if (a, c) in log_file_entries:
if b < log_file_entries[(a, c)][1]:
continue
log_file_entries[(a, c)] = (a, b, c)
with open(new_file, 'a') as f:
for entry in log_file_entries.values():
print(', '.join(entry), file=f)

Related

I need to add the items from list A to list B, only when the item already exists in list B

I need to add all the items in list A in to list B, if and only if the items already exist in list B, even if there are multiple items that are same in list A. How can I do this? Right now, it stops after the first of the same words. Say for example, List A is [the, a, sure, book, is, the, best], and list B is [the, rock], I need to add the two "the"s in list A to list B, to make a total of three "the"s in list B
It's for a beginner's class in python. I've tried different formats of the for loop, but it doesn't seem to work. There is a for loop before this, that creates the list B, I think it maybe because of the back to back for loops.
for word in list_a:
if word[0].isupper() == True:
list_b.append(word)
list_b = [word.lower() for word in list_b]
for word in list_a:
if word in list_b:
list_b.append(word)
list_b = [word.capitalize() for word in list_b]
The second for loop is the one giving me trouble. I have pasted the larger code which I am trying to work with. So my main objective is to first separate the words that are capitalized in list A in to list B.
Then lowercase list B, so I can find the same words in list A that are not capitalized, and then add those to list B as well. Then capitalize all of list B again to print the count.
I know that there is an easier way to do this, where I can just make the original string lowercase and then work from there. However, in this case I need to keep track of the words that were originally capitalized, because when I print the words with their total counts, the words that were originally capitalized, need to print in their capitalized form rather than the lowercase form.
Therefore, the expected outcome I am looking for is list B as [the, rock, the, the].
a = ['the', 'a', 'sure', 'book', 'is', 'the', 'best']
b = ['the', 'rock']
b = b + [text for text in a if text in b]
list_b = list_b + ([x for x in list_a if x in list_b])
result:
['the', 'rock', 'the', 'the']

How to remove a character nested in a list?

I am given a sample string AABCAAADA. I then split it into 3 parts: AAB, CAA, ADA.
I have nested these 3 elements into a list. In each part, I should check whether a duplicate character is present and delete the duplicate character. I know strings are immutable, but is there any trick to do that?
Below is the sample approach I tried but I am unable to use del and pop method to delete that duplicate character.
s='AABCAAADA'
x = int(input())
l=[]
#for i in range(0,len(s),x):
for j in range(0,len(s),3):
l.append(s[j:j+3])
j=0
for i in range(0,len(s)//x):
for j in range(0,len(l[j])-1):
if(l[i][j] == l[i][j+1]):
pass
#need to remove the (j+1)th term if it is duplicate
The output should be AB, CA, AD.
delete duplicate character in nested list
from functools import reduce
l = ['AAB','CAA','ADA']
print([''.join(reduce(lambda a, b: a if b in a else a + b, s, '')) for s in l])
Or, for Python 3.6+:
print([''.join({a: 1 for a in s}) for s in l])
Both output:
['AB', 'CA', 'AD']

How to determine if two elements from a list appear consecutively in a string? Python

I am trying to solve a problem that can be modelled most simply as follows.
I have a large collection of letter sequences. The letters come from two lists: (1) member list (2) non-member list. The sequences are of different compositions and lengths (e.g. AQFG, CCPFAKXZ, HBODCSL, etc.). My goal is to insert the number '1' into these sequences when any 'member' is followed by any two 'non-members':
Rule 1: Insert '1' after the first member letter that is followed
by 2 or more non-members letters.
Rule 2: Insert not more than one '1' per sequence.
The 'Members': A, B, C, D
'Non-members': E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z
In other words, once a member letter is followed by 2 non-member letters, insert a '1'. In total, only one '1' is inserted per sequence. Examples of what I am trying to achieve are this:
AQFG ---> A1QFG
CCPFAKXZ ---> CC1PFAKXZ
BDDCCA ---> BDDCCA1
HBODCSL ---> HBODC1SL
ABFCC ---> ABFCC
ACKTBB ---> AC1KTBB # there is no '1' to be inserted after BB
I assume the code will be something like this:
members = ['A','B','C','D']
non_members = ['A','B','C','D','E','F','G','H','I','J','K','L','M','N',
'O','P','Q','R','S','T','U','V','W','X','Y','Z']
strings = ['AQFG', 'CCPFAKXZ', 'BDDCCA', 'HBODCSL', 'ABFCC']
for i in members:
if i in strings:
if member is followed by 2 non-members: # Struggling here
i.insert(index_member, '1')
return i
return ''
EDIT
I have found that one solution could be to generate a list of all permutations of two 'non-member' items using itertools.permutations(non_members, 2), and then test for their presence in the string.
But is there a more elegant solution for this problem?
Generating all permutations is going to explode the number of things you are checking. you need to change how you are iterating something like:
members = ...
non_members = ...
s = 'AQFG'
out = ""
look = 2
for i in range(len(s)-look):
out += s[i]
if (s[i] in members) & \
(s[i+1] in non_members) & \
(s[i+2] in non_members):
out += '1' + s[i+1:]
break
This way you only need to go through the target string once, and you don't need to generate permutations, this method could be extended to look ahead many more than your method.
I believe can be done via regex also.
s = 'AQFG'
x = re.sub(r'([ABCD])([EFGHIJKLMNOPQRSTUVWXYZ])',r'\g<1>1\2',s)
print(x)
This will print A1QFG
Sorry. I missed that. re.sub can take an optional count parameter that can stop after the given number of replacements are made.
s = 'HBODCSL'
x = re.sub(r'([ABCD]+)([EFGHIJKLMNOPQRSTUVWXYZ])',r'\g<1>1\2',s,count=1)
print(x)
This will print HB1ODCSL

How to create a dictionary from a file with multiple lines

I'm trying to create a dictionary from multiple lines in a file, for i.e.
grocery store
apples
banana
bread
shopping mall
movies
clothing stores
shoe stores
What I'm trying to do is make the first row of each section (i.e. grocery store and shopping mall) the keys and everything underneath (apple, banana, bread & movies, clothing stores, shoe stores respectively) the values. I've been fiddling around with the readline approach + while loop, but I haven't been able to figure it out. If anyone knows, please help. Thanks.
One solution is to store in a variable the boolean value for whether you're at the start of a section. I don't want to give away the exciting (?) ending, but you could start with is_first=True.
OK, I guess I do want to give away the ending after all. Here's what I had in mind, more or less:
with open(fname) as f:
content = f.readlines()
is_first = True
d = {}
for line in content:
if line == '\n':
is_first = True
elif is_first:
key = line
is_first = False
else:
if key not in d:
d.put(key, '')
d.put(key, d.get(key) + line)
is_first = False
I find it easier to plan the code that way. Of course you could also solve this without an is_first variable, especially if you've already gone through the exercise of doing it with an is_first variable. I think the following is correct, but I wasn't incredibly careful:
with open(fname) as f:
content = f.readlines()
d = {}
while content:
key, content = content[0], content[1:]
if key != '\n':
value, content = content[0], content[1:]
while value != '\n':
if key not in d:
d.put(key, '')
d.put(key, d.get(key) + value)
value, content = content[0], content[1:]
#minopret has already given a pedagogically useful answer, and one that's important for beginners to understand. In a sense, even some more seemingly-sophisticated approaches are often doing that under the hood -- using a kind of state machine, I mean -- so it's important to know.
But for the heck of it, I'll describe a higher-level approach. There's a handy function itertools.groupby which groups sequences into contiguous groups. In this case, we can define a group by a bunch of lines which aren't all empty -- bool(line) is False if the line is empty and True otherwise, and then build a dict from them.
from itertools import groupby
with open("shopdict.txt") as fin:
stripped = map(str.strip, fin)
grouped = (list(g) for k,g in groupby(stripped, bool) if k)
d = {g[0]: g[1:] for g in grouped}
from itertools import groupby
with open("shopdict.txt") as fin:
stripped = map(str.strip, fin)
d = {k: g for b, (k, *g) in groupby(stripped, bool) if b}
And here's a way just using for loops
d={}
with open("shopdict.txt") as fin:
for key in fin:
key = key.strip()
d[key] = []
for item in fin:
if item.isspace():
break
d[key].append(item.strip())

set from the union of elements contained in two lists

this is for a pre-interview questioner. i believe i have the answer just wanted to get confirmation that im right.
Part 1 - Tell me what this code does, and its big-O performance
Part 2 - Re-write it yourself and tell me the big-O performance of your solution
def foo(a, b):
""" a and b are both lists """
c = []
for i in a:
if is_bar(b, i):
c.append(i)
return unique(c)
def is_bar(a, b):
for i in a:
if i == b:
return True
return False
def unique(arr):
b = {}
for i in arr:
b[i] = 1
return b.keys()
ANSWERS:
It creates a set from the union of elements contained in two lists. It big O performance is O(n2)
my solution which i believe achieves O(n)
Set A = getSetA();
Set B = getSetB();
Set UnionAB = new Set(A);
UnionAB.addAll(B);
for (Object inA : a)
if(B.contains(inA))
UnionAB.remove(inA);
It seems like the original code is doing an intersection not a union. It's traversing all the elements in the first list (a) and checking if it exists in the second list (b), in which case it is adding it to list c. Then it is returning the unique elements from c. Performance of O(n^2) seems right.

Resources