Can't figure out update many on mongoengine - python-3.x

This code is returning an error that I don't understand:
query = Analytic.objects(uid__type="binData")
analytics = []
for analytic in query:
analytic.sessionId = str(analytic.sessionId)
analytic.uid = str(analytic.uid)
analytics.append(analytic)
if len(analytics) % 10000 == 0:
print(".")
if len(analytics) == 100000:
Analytic.objects.update(analytics, upsert=False)
analytics = []
TypeError: update() got multiple values for argument 'upsert'

Updating multiple documents at the same time, I was able to get it working using the atomic updates section in the User guide in the documents.
atomic-updates
So your update should look something a little like
Analytic.objects(query_params='value').update(set__param='value')
or
query = Analytic.objects(query_params='value')
query.update(set__param='value')
The section has a list of modifies that you might want to look at. You still might want to do the update outside of your loop, as you'll be updating your query many times over.

It looks like you are already looping through all the objects it the queryset.
query = Analytic.objects(uid__type="binData")
Then for every iteration of the loop that satisfies~
if len(analytics) == 100000:
Analytic.objects.update(analytics, upsert=False)
analytics = []
You start another query and set analytics to an empty array. Here you are retrieving many objects in a query. Since you are already in a loop
I think you want to~
analytics_array= []
...
if len(analytics) == 100000:
analytics.save()
analytics_array.append(analytics)
The save will update objects that are already created. Not sure if that's what you wanted but the error is definitely coming from the line that reads "Analytic.objects.update(analytics, upsert=False). Hope this helps!

Related

Neo4j graph blind search for any node and relationship containing an expression?

I am trying to build a blind search, given an expression/string.
Using Python Neo4j driver I am running:
from neo4j import GraphDatabase
driver = GraphDatabase.driver("neo4j://localhost:7687")
def query_engine(tx, query):
res = tx.run(query)
values = [record for record in res]
return values
def fuzzy_search(tx, search_expression):
query = f"MATCH (n) WHERE ANY(x in keys(n) WHERE n[x] =~ '(i?){search_expression}.*') RETURN n"
res = query_engine(tx, query)
return res
with driver.session() as session:
result = session.read_transaction(fuzzy_search, "kuku.*")
driver.close()
I know I need to add full text index to make it faster, please advise what is the best practice to define the full text index in Neo4j when I want to perform full graph search on the nodes/relations params?
For example, I am searching for 'kuku' in my graph across all nodes and relations and if there are any nodes/relations that contain kuku, I would like to be able to return it as a result.
Additional info:
I have added to all my nodes an additional label (FTIndex) and I am able to create a full text index, BUT(!), how can I config it to index ALL nodes available params + to be sure it will be updated if I will add new ones?
You would have to enumerate the properties you want to search for in the full-text index.
Unfortunately there is no way around that.
So basically create an index for your label FTIndex and all properties then that index should efficiently find your results.
In general please don't use string interpolation but parameters, i.e. $search_expression to avoid injection security issues.
and then
MATCH (n)
WHERE ANY(x in keys(n)
WHERE n[x] =~ '(i?)'+$search_expression+'.*')
RETURN n

Concatenating FOR loop output

I am very new to Python (first week of active use). I have some bash scripting experience but have decided to learn Python.
I have a variable of multiple strings which I am using to build a URL in FOR loop. The output of URL is JSON and I would like to concatenate complete output into one file.
I will put random URL for privacy reasons.
The code looks like this:
==================
numbers = ['24246', '83367', '37643', '24245', '24241', '77968', '63157', '76004', '71665']
for id in numbers:
restAPI = s.get(urljoin(baseurl, '/test/' + id + '&test2'))
result = restAPI.json
==================
the problem is that if I do print(result) I will get only output of last iteration, i.e. www.google.com/test/71665&test2
Creating a list by adding text = [] worked (content was concatenated) but I would like to keep the original format.
text = []
for id in numbers:
restAPI = s.get(urljoin(baseurl, '/test/' + id + '&test2'))
Does anyone have idea how to do this
When the for loop ends, the variable assigned inside the for loop only keeps the last value. I.e. Every time your code for loops through, the restAPI variable gets reset each time.
If you wanted to keep each URL, you could append to a list outside the scope of the for loop every time, i.e.
restAPI = s.get(urljoin(baseurl, ...
url_list.append(restApi.json)
Or if you just wanted to print...
for id in numbers:
restAPI = s.get(urljoin(baseurl, ...
print(restAPI.json)
If you added them to a list, you could perform seperate functions with the new list of URLs.
If you think there might be duplicates, feel free to use a set() instead (which automatically removes the dupes inside the iterable as new values are added). You can use set_name.add(restAPI.json)
To be better, you could implement a dict and assign the id as the key and the json object as the value. So you could:
dict_obj = dict()
for id in numbers:
restAPI = s.get(urljoin(baseurl, ...
dict_obj[id] = restAPI.json
That way you can query the dictionary later in the script.
Note that if you're querying many URLs, storing the JSON's in memory might be intensive depending on your hardware.

Python: how to append an empty row in a cycle with try-except

This is my actual function:
def Function(con):
details = []
for ID in con:
print(ID)
try:
req = requests.get('https://XYZ').json()['Data']
details.append(req)
except:
pass
return pd.DataFrame(details)
Since many data are missing during the download, I created a try-except to manage possible errors and keep the cycle going on.
The problem is that in this way many rows are completely skipped and I cannot figure out anymore which line is associated with which ID (of con) used as an input.
Two possible solution in my book:
Append a zero row when the cycle pass
Keep track, during the cycle, the IDs by appending a ID column
Can you help me?

Generators for processing large result sets

I am retrieving information from a sqlite DB that gives me back around 20 million rows that I need to process. This information is then transformed into a dict of lists which I need to use. I am trying to use generators wherever possible.
Can someone please take a look at this code and suggest optimization please? I am either getting a “Killed” message or it takes a really long time to run. The SQL result set part is working fine. I tested the generator code in the Python interpreter and it doesn’t have any problems. I am guessing the problem is with the dict generation.
EDIT/UPDATE FOR CLARITY:
I have 20 million rows in my result set from my sqlite DB. Each row is of the form:
(2786972, 486255.0, 4125992.0, 'AACAGA', '2005’)
I now need to create a dict that is keyed with the fourth element ‘AACAGA’ of the row. The value that the dict will hold is the third element, but it has to hold the values for all the occurences in the result set. So, in our case here, ‘AACAGA’ will hold a list containing multiple values from the sql result set. The problem here is to find tandem repeats in a genome sequence. A tandem repeat is a genome read (‘AACAGA’) that is repeated atleast three times in succession. For me to calculate this, I need all the values in the third index as a list keyed by the genome read, in our case ‘AACAGA’. Once I have the list, I can subtract successive values in the list to see if there are three consecutive matches to the length of the read. This is what I aim to accomplish with the dictionary and lists as values.
#!/usr/bin/python3.3
import sqlite3 as sql
sequence_dict = {}
tandem_repeat = {}
def dict_generator(large_dict):
dkeys = large_dict.keys()
for k in dkeys:
yield(k, large_dict[k])
def create_result_generator():
conn = sql.connect('sequences_mt_test.sqlite', timeout=20)
c = conn.cursor()
try:
conn.row_factory = sql.Row
sql_string = "select * from sequence_info where kmer_length > 2"
c.execute(sql_string)
except sql.Error as error:
print("Error retrieving information from the database : ", error.args[0])
result_set = c.fetchall()
if result_set:
conn.close()
return(row for row in result_set)
def find_longest_tandem_repeat():
sortList = []
for entry in create_result_generator():
sequence_dict.setdefault(entry[3], []).append(entry[2])
for key,value in dict_generator(sequence_dict):
sortList = sorted(value)
for i in range (0, (len(sortList)-1)):
if((sortList[i+1]-sortList[i]) == (sortList[i+2]-sortList[i+1])
== (sortList[i+3]-sortList[i+2]) == (len(key))):
tandem_repeat[key] = True
break
print(max(k for k, v in tandem_repeat.items() if v))
if __name__ == "__main__":
find_longest_tandem_repeat()
I got some help with this on codereview as #hivert suggested. Thanks. This is much better solved in SQL rather than just code. I was new to SQL and hence could not write complex queries. Someone helped me out with that.
SELECT *
FROM sequence_info AS middle
JOIN sequence_info AS preceding
ON preceding.sequence_info = middle.sequence_info
AND preceding.sequence_offset = middle.sequence_offset -
length(middle.sequence_info)
JOIN sequence_info AS following
ON following.sequence_info = middle.sequence_info
AND following.sequence_offset = middle.sequence_offset +
length(middle.sequence_info)
WHERE middle.kmer_length > 2
ORDER BY length(middle.sequence_info) DESC, middle.sequence_info,
middle.sequence_offset;
Hope this helps someone with around the same idea. Here is a link to the thread on codereview.stackexchange.com

Python whois like function

Okay so I have a file called 'whois.txt' which contains
["96363612", "#a2743, coil, charge"]
["12101258", "#a0272, climate, vault"]
["83157521", "sith"]
["33907120", "#a1321, missile, wired"]
["55553768", "#a2722, legal, illegal"]
["22686400", "#a5619, mindless, #a5637, bank"]
["97436430", "jedi, #a5770, charge, lantern, #a9491, legal"]
["91645905", "sith"]
["89514799", "lantern, #a2563, #a2693"]
["19658307", "Umbrechu"]
["56112504", "#a0473, lantern, kryptonian"]
["12195491", "riyoken"]
["53281943", "#a5135, gateway, jedi"]
["76515035", "#a4023, gateway, wired"]
["79444876", "#a2716, loyalty"]
What I'm doing here is using json and using the first numbers as an ID and the accounts that are associated with the ID are linked by ', '. So using python I am using this code to try to get all the accounts that are associated
def getWhois(self):
x = []
f = open('whois.txt','r')
for line in f.readlines():
rid,names = json.loads(line.strip())
x.append([rid,names])
return x
def recvWhois(self,user):
returned = self.getWhois()
x = []
for data in returned:
rid,names = data[0],data[1]
if user in names:
x.append(names)
matches = list(set(', '.join(x).split(', ')))
return matches
So what that is doing is getting the matches of a user you are searching but I want to search the users in those matches also, I have done this but It feels Like I would have to do this an infinite amount of times of researching matches that are pulled so if I were to do self.recvWhois('missile') It would pull "['missile', 'wired', '#a1321']" I would then try to search all of those accounts to link more, and by now you probably see my problem because I would have to do that x amount of times depending on how many matches there are linked to the previous matched accounts If any of you have a solution to my problem it would be very appreciated.
First i would suggest to maintain an index for searching. You could use a search engine but a python map can also serve as a poor man's search engine. So idea is to have an inverted index where the usernames points to records to which they belong. For searching all linked accounts you can write a memoized recursive function which will cut down the infinite recursive paths. Also in case you have large no. of records you can limit recursion to a predefined maximum level.
It is really hard to tell what you are trying to do, but I think you are making it too complicated. Your data structure lends itself to a dictionary. Why not load it using rid as the key and names as the values?

Resources