Generators for processing large result sets - python-3.x

I am retrieving information from a sqlite DB that gives me back around 20 million rows that I need to process. This information is then transformed into a dict of lists which I need to use. I am trying to use generators wherever possible.
Can someone please take a look at this code and suggest optimization please? I am either getting a “Killed” message or it takes a really long time to run. The SQL result set part is working fine. I tested the generator code in the Python interpreter and it doesn’t have any problems. I am guessing the problem is with the dict generation.
EDIT/UPDATE FOR CLARITY:
I have 20 million rows in my result set from my sqlite DB. Each row is of the form:
(2786972, 486255.0, 4125992.0, 'AACAGA', '2005’)
I now need to create a dict that is keyed with the fourth element ‘AACAGA’ of the row. The value that the dict will hold is the third element, but it has to hold the values for all the occurences in the result set. So, in our case here, ‘AACAGA’ will hold a list containing multiple values from the sql result set. The problem here is to find tandem repeats in a genome sequence. A tandem repeat is a genome read (‘AACAGA’) that is repeated atleast three times in succession. For me to calculate this, I need all the values in the third index as a list keyed by the genome read, in our case ‘AACAGA’. Once I have the list, I can subtract successive values in the list to see if there are three consecutive matches to the length of the read. This is what I aim to accomplish with the dictionary and lists as values.
#!/usr/bin/python3.3
import sqlite3 as sql
sequence_dict = {}
tandem_repeat = {}
def dict_generator(large_dict):
dkeys = large_dict.keys()
for k in dkeys:
yield(k, large_dict[k])
def create_result_generator():
conn = sql.connect('sequences_mt_test.sqlite', timeout=20)
c = conn.cursor()
try:
conn.row_factory = sql.Row
sql_string = "select * from sequence_info where kmer_length > 2"
c.execute(sql_string)
except sql.Error as error:
print("Error retrieving information from the database : ", error.args[0])
result_set = c.fetchall()
if result_set:
conn.close()
return(row for row in result_set)
def find_longest_tandem_repeat():
sortList = []
for entry in create_result_generator():
sequence_dict.setdefault(entry[3], []).append(entry[2])
for key,value in dict_generator(sequence_dict):
sortList = sorted(value)
for i in range (0, (len(sortList)-1)):
if((sortList[i+1]-sortList[i]) == (sortList[i+2]-sortList[i+1])
== (sortList[i+3]-sortList[i+2]) == (len(key))):
tandem_repeat[key] = True
break
print(max(k for k, v in tandem_repeat.items() if v))
if __name__ == "__main__":
find_longest_tandem_repeat()

I got some help with this on codereview as #hivert suggested. Thanks. This is much better solved in SQL rather than just code. I was new to SQL and hence could not write complex queries. Someone helped me out with that.
SELECT *
FROM sequence_info AS middle
JOIN sequence_info AS preceding
ON preceding.sequence_info = middle.sequence_info
AND preceding.sequence_offset = middle.sequence_offset -
length(middle.sequence_info)
JOIN sequence_info AS following
ON following.sequence_info = middle.sequence_info
AND following.sequence_offset = middle.sequence_offset +
length(middle.sequence_info)
WHERE middle.kmer_length > 2
ORDER BY length(middle.sequence_info) DESC, middle.sequence_info,
middle.sequence_offset;
Hope this helps someone with around the same idea. Here is a link to the thread on codereview.stackexchange.com

Related

Is there any ways to make this more efficient?

I have 24 more attempts to submit this task. I spent hours and my brain does not work anymore. I am a beginner with Python can you please help to figure out what is wrong? I would love to see the correct code if possible.
Here is the task itself and the code I wrote below.
Note that you can have access to all standard modules/packages/libraries of your language. But there is no access to additional libraries (numpy in python, boost in c++, etc).
You are given a content of CSV-file with information about set of trades. It contains the following columns:
TIME - Timestamp of a trade in format Hour:Minute:Second.Millisecond
PRICE - Price of one share
SIZE - Count of shares executed in this trade
EXCHANGE - The exchange that executed this trade
For each exchange find the one minute-window during which the largest number of trades took place on this exchange.
Note that:
You need to send source code of your program.
You have only 25 attempts to submit a solutions for this task.
You have access to all standart modules/packages/libraries of your language. But there is no access to additional libraries (numpy in python, boost in c++, etc).
Input format
Input contains several lines. You can read it from standart input or file “trades.csv”
Each line contains information about one trade: TIME, PRICE, SIZE and EXCHANGE. Numbers are separated by comma.
Lines are listed in ascending order of timestamps. Several lines can contain the same timestamp.
Size of input file does not exceed 5 MB.
See the example below to understand the exact input format.
Output format
If input contains information about k exchanges, print k lines to standart output.
Each line should contain the only number — maximum number of trades during one minute-window.
You should print answers for exchanges in lexicographical order of their names.
Sample
Input Output
09:30:01.034,36.99,100,V
09:30:55.000,37.08,205,V
09:30:55.554,36.90,54,V
09:30:55.556,36.91,99,D
09:31:01.033,36.94,100,D
09:31:01.034,36.95,900,V
2
3
Notes
In the example four trades were executed on exchange “V” and two trades were executed on exchange “D”. Not all of the “V”-trades fit in one minute-window, so the answer for “V” is three.
X = []
with open('trades.csv', 'r') as tr:
for line in tr:
line = line.strip('\xef\xbb\xbf\r\n ')
X.append(line.split(','))
dex = {}
for item in X:
dex[item[3]] = []
for item in X:
dex[item[3]].append(float(item[0][:2])*60.+float(item[0][3:5])+float(item[0][6:8])/60.+float(item[0][9:])/60000.)
for item in dex:
count = 1
ccount = 1
if dex[item][len(dex[item])-1]-dex[item][0] <1:
count = len(dex[item])
else:
for t in range(len(dex[item])-1):
for tt in range(len(dex[item])-t-1):
if dex[item][tt+t+1]-dex[item][t] <1:
ccount += 1
else: break
if ccount>count:
count=ccount
ccount=1
print(count)
First of all it is not necessary to use datetime and csv modules for such a simple case (like in Ed-Ward's example).
If we remove colon and dot signs from the time strings it could be converted to int() directly - easier way than you tried in your example.
CSV features like dialect and special formatting not used so i suggest to use simple split(",")
Now about efficiency. Efficiency means time complexity.
The more times you go through your array with dates from the beginning to the end, the more complicated the algorithm becomes.
So our goal is to minimize cycles count, best to make only one pass by all rows and especially avoid nested loops and passing through collections from beginning to the end.
For such a task it is better to use deque, instead of tuple or list, because you can pop() first element and append last element with complexity of O(1).
Just append every time for needed exchange to the end of the exchange's queue until difference between current and first elements becomes more than 1 minute. Then just remove first element with popleft() and continue comparison. After whole file done - length of each queue will be the max 1min window.
Example with linear time complexity O(n):
from collections import deque
ex_list = {}
s = open("trades.csv").read().replace(":", "").replace(".", "")
for line in s.splitlines():
s = line.split(",")
curr_tm = int(s[0])
curr_ex = s[3]
if curr_ex not in ex_list:
ex_list[curr_ex] = deque()
ex_list[curr_ex].append(curr_tm)
if curr_tm >= ex_list[curr_ex][0] + 100000:
ex_list[curr_ex].popleft()
print("\n".join([str(len(ex_list[k])) for k in sorted(ex_list.keys())]))
This code should work:
import csv
import datetime
diff = datetime.timedelta(minutes=1)
def date_calc(start, dates):
for i, date in enumerate(dates):
if date >= start + diff:
return i
return i + 1
exchanges = {}
with open("trades.csv") as csvfile:
reader = csv.reader(csvfile)
for row in reader:
this_exchange = row[3]
if this_exchange not in exchanges:
exchanges[this_exchange] = []
time = datetime.datetime.strptime(row[0], "%H:%M:%S.%f")
exchanges[this_exchange].append(time)
ex_max = {}
for name, dates in exchanges.items():
ex_max[name] = 0
for i, d in enumerate(dates):
x = date_calc(d, dates[i:])
if x > ex_max[name]:
ex_max[name] = x
print('\n'.join([str(ex_max[k]) for k in sorted(ex_max.keys())]))
Output:
2
3
( obviously please check it for yourself before uploading it :) )
I think the issue with your current code is that you don't put the output in lexicographical order of their names...
If you want to use your current code, then here is a (hopefully) fixed version:
X = []
with open('trades.csv', 'r') as tr:
for line in tr:
line = line.strip('\xef\xbb\xbf\r\n ')
X.append(line.split(','))
dex = {}
counts = []
for item in X:
dex[item[3]] = []
for item in X:
dex[item[3]].append(float(item[0][:2])*60.+float(item[0][3:5])+float(item[0][6:8])/60.+float(item[0][9:])/60000.)
for item in dex:
count = 1
ccount = 1
if dex[item][len(dex[item])-1]-dex[item][0] <1:
count = len(dex[item])
else:
for t in range(len(dex[item])-1):
for tt in range(len(dex[item])-t-1):
if dex[item][tt+t+1]-dex[item][t] <1:
ccount += 1
else: break
if ccount>count:
count=ccount
ccount=1
counts.append((item, count))
counts.sort(key=lambda x: x[0])
print('\n'.join([str(x[1]) for x in counts]))
Output:
2
3
I do think you can make your life easier in the future by using Python's standard library, though :)

Slow loop aggregating rows and columns

I have a DataFrame with a column named 'UserNbr' and a column named 'Spclty', which is composed of elements like this:
[['104', '2010-01-31'], ['215', '2014-11-21'], ['352', '2016-07-13']]
where there can be 0 or more elements in the list.
Some UserNbr keys appear in multiple rows, and I wish to collapse each such group into a single row such that 'Spclty' contains all the unique dicts like those in the list shown above.
To save overhead on appending to a DataFrame, I'm appending each output row to a list, instead to the DataFrame.
My code is working, but it's taking hours to run on 0.7M rows of input. (Actually, I've never been able to keep my laptop open long enough for it to finish executing.)
Is there a better way to aggregate into a structure like this, maybe using a library that provides more data reshaping options instead looping over UserNbr? (In R, I'd use the data.table and dplyr libraries.)
# loop over all UserNbr:
# consolidate specialty fields into dict-like sets (to remove redundant codes);
# output one row per user to new data frame
out_rows = list()
spcltycol = df_tmp.column.get_loc('Spclty')
all_UserNbr = df_tmp['UserNbr'].unique()
for user in all_UserNbr:
df_user = df_tmp.loc[df_tmp['UserNbr'] == user]
if df_user.shape[0] > 0:
open_combined = df_user_open.iloc[0, spcltycol] # capture 1st row
for row in range(1, df_user.shape[0]): # union with any subsequent rows
open_combined = open_combined.union(df_user.iloc[row, spcltycol])
new_row = df_user.drop(['Spclty', 'StartDt'], axis = 1).iloc[0].tolist()
new_row.append(open_combined)
out_rows.append(new_row)
# construct new dataframe with no redundant UserID rows:
df_out = pd.DataFrame(out_rows,
columns = ['UserNbr', 'Spclty'])
# convert Spclty sets to dicts:
df_out['Spclty'] = [dict(df_out['Spclty'][row]) for row in range(df_out.shape[0])]
The conversion to dict gets rid of specialties that are repeated between rows, In the output, a Spclty value should look like this:
{'104': '2010-01-31', '215': '2014-11-21', '352': '2016-07-13'}
except that there may be more key-value pairs than in any corresponding input row (resulting from aggregation over UserNbr).
I withdraw this question.
I had hoped there was an efficient way to use groupby with something else, but I haven't found any examples with a complex data structure like this one and have received no guidance.
For anyone who gets similarly stuck with very slow aggregation problems in Python, I suggest stepping up to PySpark. I am now tackling this problem with a Databricks notebook and am making headway with the pyspark.sql.window Window functions. (Now, it only takes minutes to run a test instead of hours!)
A partial solution is in the answer here:
PySpark list() in withColumn() only works once, then AssertionError: col should be Column

Python: how to append an empty row in a cycle with try-except

This is my actual function:
def Function(con):
details = []
for ID in con:
print(ID)
try:
req = requests.get('https://XYZ').json()['Data']
details.append(req)
except:
pass
return pd.DataFrame(details)
Since many data are missing during the download, I created a try-except to manage possible errors and keep the cycle going on.
The problem is that in this way many rows are completely skipped and I cannot figure out anymore which line is associated with which ID (of con) used as an input.
Two possible solution in my book:
Append a zero row when the cycle pass
Keep track, during the cycle, the IDs by appending a ID column
Can you help me?

Can sqlite3 reference primary keys during a for loop?

I have a database where I have imported texts as a primary keys.
I then have columns with keywords that can pertain to the texts, for example column "arson". Each of these columns has a default value of 0.
I am trying to get the SQLite3 database to read the texts, check for the presence of specific keywords, and then assign a 1 value to the keywords column, for the row where the text contained the keyword.
The below example is of me trying to change the values in the arson column only for rows where the text contains the words "Arson".
The program is reading the texts and printing yes 3 times, indicating that three of the texts have the words "Arson" in them. However, I cannot get the individual rows to update with 1's. I have tried a few variations of the code below but seem to be stuck on this one.
!# Python3
#import sqlite3
sqlite_file = 'C:\\Users\\xxxx\\AppData\\Local\\Programs\\Python\\Python35-32\\database.sqlite'
conn = sqlite3.connect(sqlite_file)
c = conn.cursor()
texts = c.execute("SELECT texts FROM database")
for articles in texts:
for words in articles:
try:
if "Arson" in words:
print('yes')
x = articles
c.execute("UPDATE database SET arson = 1 WHERE ID = ?" (x))
except TypeError:
pass
conn.commit()
conn.close()
This expression:
c.execute("UPDATE database SET arson = 1 WHERE ID = ?" (x))
always will raise a TypeError, because you are trying to treat the string as a function. You are basically doing "..."(argument), as if "..." were callable.
You'd need to add some commas for it to be an attempt to pass in x as a SQL parameter:
c.execute("UPDATE database SET arson = 1 WHERE ID = ?", (x,))
The first comma separates the two arguments passed to c.execute(), so now you pass a query string, and a separate sequence of parameters.
The second comma makes (..,) a tuple with one element in it. It is the comma that matters there, although the (...) parentheses are still needed to disambiguate what the comma represents.
You can drop the try...except TypeError altogether. If the code is still raising TypeError exceptions, you still have a bug.
Four hours later I have finally been able to fix this. I added the commas as recommended above; however, this led to other issues, as the code did not execute the entire loop correctly. To do this, I had to add another cursor object and use the second cursor inside my loop. The revised code may be seen below:
!# Python3
import sqlite3
sqlite_file = 'C:\\Users\\xxxx\\AppData\\Local\\Programs\\Python\\Python35-32\\database.sqlite'
conn = sqlite3.connect(sqlite_file)
c = conn.cursor()
c2 = conn.cursor()
atexts = c.execute("SELECT texts FROM database")
for articles in atexts:
for words in articles:
if "arson" in words:
print('yes')
c2.execute("UPDATE database SET arson = 1 WHERE texts = ?", (words,))
conn.commit()
conn.close()

How to Transpose an rdd in pyspark (list is not a matrix)

I have an rdd where the list of strings are ['abc', 'ccd', 'xyz'...'axd']
When I "print rdd.take(2), I am expecting it to return me ['abc', 'ccd'], but instead it gives me everything. I am very new to spark or python so please forgive me if this is a dumb question. is there a way to transpose this list to rows?
Eventually i need to convert this into a dataframe and insert into a hive table.
Here is a pice of my code
domainsrdd = zonerdd.reduceByKey(lambda x,y: x + ' ' + y).map(lambda a: (a[0], a[1].split(' ')))
print domainsrdd.take(2)
[(u'COOL', [u'shirtmaker.cool', u'videocandy.cool', u'the-happy-factory.cool', u'vic.cool', u'atl.cool',...... u'booze.cool'])]
def sampler(l, tldvar):
tld = l[0]
domain_data = l[1]
domains = []
ct = tldvar.value[tld]
for item in domain_data:
domains.extend([item])
if len(domains) == ct:
break
return domains
domainslist = domainsrdd.map(lambda l: sampler(l, tldvar))
print domainslist.take(2) # still returns everything
[[u'shirtmaker.cool', u'videocandy.cool', u'the-happy-factory.cool',...])]
Long story short, i am trying to loop thru a set of domains grouped by tld's and producing a sample of those domain names, tldvar is a dictionary which has set of domains i need to return for a specific tld. TLD = com, net, org etc!
domainslist is of type RDD[Array[String]], so when you do a take, you're going to get an Array[Array[String]]. Which, in your case is filled with arrays that are never limited down based on what you are saying (len(domains) == ct is never true)
This is resolved - I have used a flatmap instead of map. Basically this worked
domainslist = domainsrdd.flatmap(lambda l: sampler(l, tldvar))

Resources