Produce the most unique elements with least lists - python-3.x

I am new here. I hope I can explain briefly below after giving an example.
example1: What is your name?
example1: Where are you from?
example1: How are you doing?
example2: What is your name?
example2: Where are you from?
example2: How are you doing?
example2: When did you move here?
example9: What is your name?
example3: Where are you from?
example23: Who gave you this book?
In the above example, I would like to print the unique questions by considering the number of example. So trying something like
expected output
example2: What is your name?
example2: Where are you from?
example2: How are you doing?
example2: When did you move here?
example23: Who gave you this book?
Here, I am searching for the unique questions in a file by considering fewer examples.
I played around something and placing that below.
import collections
s = collections.defaultdict(list)
u_s = set()
with open ('file.txt', 'r') as s1:
for line in s1:
data = line.split(':', maxsplit=1)
start = data[0]
end = data[-1]
if end not in u_s:
u_s.add(end)
s[start] += [end]
for start, ends in s.items():
print(start, ends[0])
for end in ends[1:]:
print(start, end)
Result that I am getting:
example1 What is your name?
example1 Where are you from?
example1 How are you doing?
example2 When did you move here?
example23 Who gave you this book?
Here, Instead of going to print example1, I want to consider example2 because it is giving more questions.
I tried by sorting the lines based on the repetitions of the line. I couldn't pass through it. I appreciate your help. Thanks

What your code achieved is to print all unique questions but cannot compare or print them in a whole set.
Apart from sorting, I would formulate the problem as to compare the combinations of example sets and select the one that contains the most unique questions with the least sets, so your question is more about the algorithm to me.
import collections
def calculate_contrib(values, set):
'''To calculate the contribution on the unique questions' number, based on values to add.
values: the list of question set to choose.
set: the already-added question set.'''
contrib = 0
for value in values:
if value not in set:
contrib += 1
return contrib
def print_result(x):
'''To print the result, x, as a dictionary, without repetition.'''
u_s = set()
for key, values in x.items():
for value in values:
if value not in u_s:
print(key,value)
u_s.add(value)
s = collections.defaultdict(list)
# get all questions in examples
with open('file.txt', 'r') as s1:
for line in s1:
data = line.split(':', maxsplit=1)
start = data[0]
end = data[-1]
s[start] += [end]
# Get the initial contribution on the unique questions' number for each example set
contrib = dict()
u_s = set()
result = dict()
for key,values in s.items():
contrib.update({key: calculate_contrib(s[key], u_s)})
# Execute the while loop when there are unique questions to add to u_s
while not(all([x == 0 for x in contrib.values()])):
# Add the example set with maximum contribution
max_contrib = 0
max_key = ""
for key, value in contrib.items():
if max_contrib < value:
max_key = key
max_contrib = value
result.update({max_key: s[max_key]})
u_s.update(s[max_key])
del s[max_key]
del contrib[max_key]
for key, values in s.items():
contrib[key] = calculate_contrib(values, u_s)
# print the result
print_result(result)
Above is a straightforward implementation, that is adding the example set with the most increase on the unique's number each time until no unique question remains.
Further improvement can be conducted. Hope it could give you some insight.

Related

Iterating thourgh a SRT file until index is found

This might sound like "Iterate through file until condition is met" question (which I have already checked), but it doesn't work for me.
Given a SRT file (any) as srtDir, I want to go to the index choice and get timecode values and caption values.
I did the following, which is supposed to iterate though the SRT file until condition is met:
import os
srtDir = "./media/srt/001.srt"
index = 100 #Index. Number is an examaple
found = False
with open(srtDir, "r") as SRT:
print(srtDir)
content = SRT.readlines()
content = [x.strip() for x in content]
for x in content:
print(x)
if x == index:
print("Found")
found = True
break
if not found:
print("Nothing was found")
As said, it is supposed to iterate until Index is found, but it returns "Nothing is found", which is weird, because I can see the number printed on screen.
What did I do wrong?
(I have checked libraries, AFAIK, there's no one that can return timecode and captions given the index)
You have a type mismatch in your code: index is an int but x in your loop is a str. In Python, 100 == "100" evaluates to False. The solution to this kind of bug is to adopt a well-defined data model and write library methods that apply it consistently.
However, with something like this, it's best not to reinvent the wheel and let other people do the boring work for you.
import srt
# Sample SRT file
raw = '''\
1
00:31:37,894 --> 00:31:39,928
OK, look, I think I have a plan here.
2
00:31:39,931 --> 00:31:41,931
Using mainly spoons,
3
00:31:41,933 --> 00:31:43,435
we dig a tunnel under the city and release it into the wild.
'''
# Parse and get index
subs = list(srt.parse(raw))
def get_index(n, subs_list):
for i in subs_list:
if i.index == n:
return i
return None
s = get_index(2, subs)
print(s)
See:
https://github.com/cdown/srt
https://srt.readthedocs.io/en/latest/quickstart.html
https://srt.readthedocs.io/en/latest/api.html

Is there any ways to make this more efficient?

I have 24 more attempts to submit this task. I spent hours and my brain does not work anymore. I am a beginner with Python can you please help to figure out what is wrong? I would love to see the correct code if possible.
Here is the task itself and the code I wrote below.
Note that you can have access to all standard modules/packages/libraries of your language. But there is no access to additional libraries (numpy in python, boost in c++, etc).
You are given a content of CSV-file with information about set of trades. It contains the following columns:
TIME - Timestamp of a trade in format Hour:Minute:Second.Millisecond
PRICE - Price of one share
SIZE - Count of shares executed in this trade
EXCHANGE - The exchange that executed this trade
For each exchange find the one minute-window during which the largest number of trades took place on this exchange.
Note that:
You need to send source code of your program.
You have only 25 attempts to submit a solutions for this task.
You have access to all standart modules/packages/libraries of your language. But there is no access to additional libraries (numpy in python, boost in c++, etc).
Input format
Input contains several lines. You can read it from standart input or file “trades.csv”
Each line contains information about one trade: TIME, PRICE, SIZE and EXCHANGE. Numbers are separated by comma.
Lines are listed in ascending order of timestamps. Several lines can contain the same timestamp.
Size of input file does not exceed 5 MB.
See the example below to understand the exact input format.
Output format
If input contains information about k exchanges, print k lines to standart output.
Each line should contain the only number — maximum number of trades during one minute-window.
You should print answers for exchanges in lexicographical order of their names.
Sample
Input Output
09:30:01.034,36.99,100,V
09:30:55.000,37.08,205,V
09:30:55.554,36.90,54,V
09:30:55.556,36.91,99,D
09:31:01.033,36.94,100,D
09:31:01.034,36.95,900,V
2
3
Notes
In the example four trades were executed on exchange “V” and two trades were executed on exchange “D”. Not all of the “V”-trades fit in one minute-window, so the answer for “V” is three.
X = []
with open('trades.csv', 'r') as tr:
for line in tr:
line = line.strip('\xef\xbb\xbf\r\n ')
X.append(line.split(','))
dex = {}
for item in X:
dex[item[3]] = []
for item in X:
dex[item[3]].append(float(item[0][:2])*60.+float(item[0][3:5])+float(item[0][6:8])/60.+float(item[0][9:])/60000.)
for item in dex:
count = 1
ccount = 1
if dex[item][len(dex[item])-1]-dex[item][0] <1:
count = len(dex[item])
else:
for t in range(len(dex[item])-1):
for tt in range(len(dex[item])-t-1):
if dex[item][tt+t+1]-dex[item][t] <1:
ccount += 1
else: break
if ccount>count:
count=ccount
ccount=1
print(count)
First of all it is not necessary to use datetime and csv modules for such a simple case (like in Ed-Ward's example).
If we remove colon and dot signs from the time strings it could be converted to int() directly - easier way than you tried in your example.
CSV features like dialect and special formatting not used so i suggest to use simple split(",")
Now about efficiency. Efficiency means time complexity.
The more times you go through your array with dates from the beginning to the end, the more complicated the algorithm becomes.
So our goal is to minimize cycles count, best to make only one pass by all rows and especially avoid nested loops and passing through collections from beginning to the end.
For such a task it is better to use deque, instead of tuple or list, because you can pop() first element and append last element with complexity of O(1).
Just append every time for needed exchange to the end of the exchange's queue until difference between current and first elements becomes more than 1 minute. Then just remove first element with popleft() and continue comparison. After whole file done - length of each queue will be the max 1min window.
Example with linear time complexity O(n):
from collections import deque
ex_list = {}
s = open("trades.csv").read().replace(":", "").replace(".", "")
for line in s.splitlines():
s = line.split(",")
curr_tm = int(s[0])
curr_ex = s[3]
if curr_ex not in ex_list:
ex_list[curr_ex] = deque()
ex_list[curr_ex].append(curr_tm)
if curr_tm >= ex_list[curr_ex][0] + 100000:
ex_list[curr_ex].popleft()
print("\n".join([str(len(ex_list[k])) for k in sorted(ex_list.keys())]))
This code should work:
import csv
import datetime
diff = datetime.timedelta(minutes=1)
def date_calc(start, dates):
for i, date in enumerate(dates):
if date >= start + diff:
return i
return i + 1
exchanges = {}
with open("trades.csv") as csvfile:
reader = csv.reader(csvfile)
for row in reader:
this_exchange = row[3]
if this_exchange not in exchanges:
exchanges[this_exchange] = []
time = datetime.datetime.strptime(row[0], "%H:%M:%S.%f")
exchanges[this_exchange].append(time)
ex_max = {}
for name, dates in exchanges.items():
ex_max[name] = 0
for i, d in enumerate(dates):
x = date_calc(d, dates[i:])
if x > ex_max[name]:
ex_max[name] = x
print('\n'.join([str(ex_max[k]) for k in sorted(ex_max.keys())]))
Output:
2
3
( obviously please check it for yourself before uploading it :) )
I think the issue with your current code is that you don't put the output in lexicographical order of their names...
If you want to use your current code, then here is a (hopefully) fixed version:
X = []
with open('trades.csv', 'r') as tr:
for line in tr:
line = line.strip('\xef\xbb\xbf\r\n ')
X.append(line.split(','))
dex = {}
counts = []
for item in X:
dex[item[3]] = []
for item in X:
dex[item[3]].append(float(item[0][:2])*60.+float(item[0][3:5])+float(item[0][6:8])/60.+float(item[0][9:])/60000.)
for item in dex:
count = 1
ccount = 1
if dex[item][len(dex[item])-1]-dex[item][0] <1:
count = len(dex[item])
else:
for t in range(len(dex[item])-1):
for tt in range(len(dex[item])-t-1):
if dex[item][tt+t+1]-dex[item][t] <1:
ccount += 1
else: break
if ccount>count:
count=ccount
ccount=1
counts.append((item, count))
counts.sort(key=lambda x: x[0])
print('\n'.join([str(x[1]) for x in counts]))
Output:
2
3
I do think you can make your life easier in the future by using Python's standard library, though :)

Duplicate entries in the list

Having to enter 8 unique entries, logic should detect if any entry is duplicate otherwise continue until all 8 entries are entered.
I've played around with the For loop but it seems to not give me desired output, i want to go back to last entry if duplicate entry is scanned instead it will give a message "Duplicate SCAN, please rescan" but also the counter moves on.
sorry, i'm new to this i thought i've included code. Hoping it goes through this time.
x=1
mac_list = []
while (x <=8):
MAC1 = input("SCAN MAC"+str(x)+":")
for place in mac_list:
print (mac_list)
if place==MAC1:
print ("place"+place)
print ("Duplicate SCAN, please rescan")
else:
mac_list.append(MAC1)
x+=1
Python's in comparison should do what you need:
values = []
while True:
value = input('Input value: ')
if value in values:
print('Duplicate, please try again')
else:
values.append(value)
if len(values) > 7:
break
print(values)
Would something like this not work?
Sets can only hold unique elements and thus remove any duplicates by default - that should solve a lot of your worries. This should work better for larger datasets over elementwise comparison.
entries = set()
while len(entries)<8:
entries = entries ^ set([input("You do not have enough unique items yet, add another")])
To detect a change you can have an old list and a new one:
entries = set()
new=set()
while True:
latest = input("You do not have enough unique items yet, add another")
new = entries ^ set([latest])
if len(new) == len(entries):
print("you entered a duplicate:",latest, " Please rescan")
else:
entries = new
if len(entries) == 8 : break
Store the entries in a set, check if the set has fewer than 8 elements. If it does not, break the loop.
entries = set()
counter = 0
while len(entries) < 8:
counter += 1
entries.add(input("Enter an item: "))

Matching the value of a word in a list with the place value of another list

I am trying to work out how I can compare a list of words against a string and report back the word number from list one when they match. I can easily get the unique list of words from a sentence - just removing duplicates, and with enumerate I can get a value for each word, so Mary had a little lamb becomes 1, Mary, 2, had, 3, a etc. But I cannot work out how to then search the original list again and replace each word with its number value (so it becomes 1 2 3 etc).
Any ideas greatly received!
my_list.index(word)
will return the index of the item word within my_list. You can start digging into the documentation here
Thank you for this info. I can see the logic for this and it should work, however I get: line 27, in output=words.index(result) ValueError: ['word1', 'word2'] is not in list With the following code:
def remove_duplicates(words):
output = []
seen = set()
for value in words:
# If value has not been encountered yet,
# ... add it to both list and set.
if value not in seen:
output.append(value)
seen.add(value)
return output
# Remove duplicates from this list.
sentence = input("Enter a sentence ")
words = sentence.split(' ')
result = remove_duplicates(words)
print(result)
Very confusing :(
I have found an answer on here:
positions = [ i+1 for i in range(len(result)) if each == result[i]]
Which works well.

Generators for processing large result sets

I am retrieving information from a sqlite DB that gives me back around 20 million rows that I need to process. This information is then transformed into a dict of lists which I need to use. I am trying to use generators wherever possible.
Can someone please take a look at this code and suggest optimization please? I am either getting a “Killed” message or it takes a really long time to run. The SQL result set part is working fine. I tested the generator code in the Python interpreter and it doesn’t have any problems. I am guessing the problem is with the dict generation.
EDIT/UPDATE FOR CLARITY:
I have 20 million rows in my result set from my sqlite DB. Each row is of the form:
(2786972, 486255.0, 4125992.0, 'AACAGA', '2005’)
I now need to create a dict that is keyed with the fourth element ‘AACAGA’ of the row. The value that the dict will hold is the third element, but it has to hold the values for all the occurences in the result set. So, in our case here, ‘AACAGA’ will hold a list containing multiple values from the sql result set. The problem here is to find tandem repeats in a genome sequence. A tandem repeat is a genome read (‘AACAGA’) that is repeated atleast three times in succession. For me to calculate this, I need all the values in the third index as a list keyed by the genome read, in our case ‘AACAGA’. Once I have the list, I can subtract successive values in the list to see if there are three consecutive matches to the length of the read. This is what I aim to accomplish with the dictionary and lists as values.
#!/usr/bin/python3.3
import sqlite3 as sql
sequence_dict = {}
tandem_repeat = {}
def dict_generator(large_dict):
dkeys = large_dict.keys()
for k in dkeys:
yield(k, large_dict[k])
def create_result_generator():
conn = sql.connect('sequences_mt_test.sqlite', timeout=20)
c = conn.cursor()
try:
conn.row_factory = sql.Row
sql_string = "select * from sequence_info where kmer_length > 2"
c.execute(sql_string)
except sql.Error as error:
print("Error retrieving information from the database : ", error.args[0])
result_set = c.fetchall()
if result_set:
conn.close()
return(row for row in result_set)
def find_longest_tandem_repeat():
sortList = []
for entry in create_result_generator():
sequence_dict.setdefault(entry[3], []).append(entry[2])
for key,value in dict_generator(sequence_dict):
sortList = sorted(value)
for i in range (0, (len(sortList)-1)):
if((sortList[i+1]-sortList[i]) == (sortList[i+2]-sortList[i+1])
== (sortList[i+3]-sortList[i+2]) == (len(key))):
tandem_repeat[key] = True
break
print(max(k for k, v in tandem_repeat.items() if v))
if __name__ == "__main__":
find_longest_tandem_repeat()
I got some help with this on codereview as #hivert suggested. Thanks. This is much better solved in SQL rather than just code. I was new to SQL and hence could not write complex queries. Someone helped me out with that.
SELECT *
FROM sequence_info AS middle
JOIN sequence_info AS preceding
ON preceding.sequence_info = middle.sequence_info
AND preceding.sequence_offset = middle.sequence_offset -
length(middle.sequence_info)
JOIN sequence_info AS following
ON following.sequence_info = middle.sequence_info
AND following.sequence_offset = middle.sequence_offset +
length(middle.sequence_info)
WHERE middle.kmer_length > 2
ORDER BY length(middle.sequence_info) DESC, middle.sequence_info,
middle.sequence_offset;
Hope this helps someone with around the same idea. Here is a link to the thread on codereview.stackexchange.com

Resources