I have a function that produces a large amount of data and store them in a dictionary that is to be returned at the end of the function. However I've been told that it's not efficient in terms of memory and that I should use data generator in Python.
Example:
def gen_data(dataz):
data_dict = {"uniqz":[], "random":[]}
for v in dataz:
if dataz[v]["row"] == "ay1":
data_dict["uniqz"].append(dataz[v]["row"])
else:
data_dict["random"].append(dataz[v]["val"])
return data_dict
dataz = {"ax1":{"row":"ay1","val":2},
"ax2":{"row":"ay2","val":3}}
print(gen_data(dataz))
How do I convert this block of code and make use of data generator/yield?
I know how to yield individual dictionary data, example:
for i in xrange(num_people):
datadict = {
'id': i
}
yield datadict
But in the above scenario it should append all necessary data into the dictionary first before yielding. How do I do this?
It depends on how you want to use the data, from your code above you are creating all the values and processing it at once. generator are useful when you want to process data one at a time (or one batch at a time)
Here you can modify your code like this
def gen_data(dataz):
for v in dataz:
if dataz[v]["row"] == "ay1":
yield ("uniqz", dataz[v]["row"])
else:
yield ("random", dataz[v]["val"])
then you can process your data like below
for key, value in gen_data(dataz):
process(key, value)
In this way you will save the memory occupied by the dictionary. This will also be efficient in case of parallel processing. e.g. there will be some latency in getting new data and also in processing it, using generator, while you are getting data from some source it can be processed in parallel in different thread or machine
Related
I want to create a method/function that queries a database and return 3 values if the searched value exists in the database table.
If there are no results or an error ocurrs, I need to identify that so I can use it in a conditional statement.
So far I have this:
def read_log(self,ftpsource):
try:
conn = pyodbc.connect(...)
sql_query=f"""Select ID,file_modif_date,size FROM [dbo].[Table] WHERE Source='{ftpsource}'"""
cursor = conn.cursor()
cursor.execute(sql_query)
for row in cursor:
id=row[0]
file_mtime=row[1]
file_size=row[2]
return('1',id,file_mtime,file_size)
except Exception as e:
return('0','0','0','0')
#Call the function
sql_status,id,time,size = log.read_log('hol1a')
if sql_status!=0:
print(id,time,size)
elif sql_status==0:
print("Error")
But doesn't looks good to me.. So I wonder, what is the best practice to return multiple values and identify if there are no values or an error ocurrs, with a function?
Thanks!
There is no one best way to return values.
Do you return coordinates, vectors? It is obvious that it will return more than one value and what values to expect?
def get_position2d():
return x, y
If the returned values are not obvious/ordered you may need names, using a dictionary!
def get_ingredients():
return {'sugar': 10, 'salt': 30, 'flour': 150, 'turobo_reactor': 1}
Do you return big datasets for machine learning? Use Pandas or numpy arrays!
import pandas as pd
def get_input_data():
# Imagine this is a 5k+ entries dataset...
return pd.DataFrame(dict(
speed=[1,2,3,4,5], heat=[50,50,35,24,10], tire_type=[1,1,1,1,2]))
The one rule is that it should make sense and should be considered as "one element". For a position, x and y are one element in the sense that they are "the position". Same for ingredients, they are the "ingredients for one specific recipe".
So avoid mixing nonsensical return values. One function, one job, one output (within the current context).
Finally, you may want to consider if the data you return should be mutable or immutable. If you are going to modify the data afterwards you may need to use lists instead of sets.
Now, about errors!
If you are talking about Python errors, you can only have one error at a time. So the following code should allow you to catch errors and play with them.
def i_will_throw_an_error():
try:
do_something_bad()
except:
do_something_if_a_bad_thing_happens()
Now if you talk about error messages and values, you may (or may not) want to trigger a Python error if the data given to you was wrong!
def i_get_bad_data_and_i_raise_it():
bad_data = get_bad_data()
if(bad_data['is_really_bad'] is True):
raise ValueError('Oh no, the data we got is really bad!')
How to sort the data that are stored in a global list after inserting them within a method; so that before they are stacked into another list in accordance to their inserted elements? Or is this a bad practice and complicate things in storing data inside of a global list instead of seperated ones within a method; and finally sorting them thereafter ?
Below is the example of the scenario
list= []
dictionary = {}
def MethodA(#returns title):
#searches for corresponding data using beautifulsoup
#adds data into dictionary
# list.append(dictionary)
# returns list
def MethodB(#returns description):
#searches for corresponding data using beautifulsoup
#adds data into dictionary
# list.append(dictionary)
# returns list
Example of Wanted output
MethodA():[title] #scraps(text.title) data from the web
MethodB():[description] #scraps(text.description) from the web
#print(list)
>>>list=[{title,description},{title.description},{title,description},{title.description}]
Actual output
MethodA():[title] #scraps(text.title) data from the web
MethodB():[description] #scraps(text.description) from the web
#print(list)
>>>list =[{title},{title},{description},{description}]
There are a few examples I've seen; such as using Numpy and sorting them in an Array;-
arraylist = np.array(list)
arraylist[:, 0]
#but i get a 'too many indices for array'-
#because I have too much data loading in; including that some of them
#do not have data and are replaced as `None`; so there's an imbalance of indexes.
Im trying to keep it as modulated as possible. I've tried using the norm of iteration;
but it's sort of complicated because I have to indent more loops in it;
I've tried Numpy and Enumerate, but I'm not able to understand how to go about with it. But because it's an unbalanced list; meaning that some value are returned as Nonegives me the return error that; all the input array dimensions except for the concatenation axis must match exactly
Example : ({'Toy Box','Has a toy inside'},{'Phone', None }, {'Crayons','Used for colouring'})
Update; code sample of methodA
def MethodA(tableName, rowName, selectedLink):
try:
for table_tag in selectedLink.find_all(tableName, {'class': rowName}):
topic_title = table_tag.find('a', href=True)
if topic_title:
def_dict1 = {
'Titles': topic_title.text.replace("\n", "")}
global_list.append(def_dict1 )
return def_dict1
except:
def_dict1 = None
Assuming you have something of the form:
x = [{'a'}, {'a1'}, {'b'}, {'b1'}, {'c'}, {None}]
you can do:
dictionary = {list(k)[0]: list(v)[0] for k, v in zip(x[::2], x[1::2])}
or
dictionary = {s.pop(): v.pop() for k, v in zip(x[::2], x[1::2])}
The second method will clear your sets in x
I would like to know if there is a save and fast way of retrieving the index of an element in a list in python 3.
Idea:
# how I do it
try:
test_index = [1,2,3].index(4) # error
except:
# handle error
# how I would like to do it:
test_index = [1,2,3].awesome_index(4) # test_index = -1
if test_index == -1:
# handle error
The reason, why I would like to not use try, except is because try-except appears to be a little bit slower that an if-statement. Since I am handling a lot of data, performance is important to me.
Yes, I know that I could implement awesome_index by simply looping through the list, but I assume that there already is a function doing exactly that in a very efficient way.
There is no such thing, but I like your thinking, so let's built upon it:
If you have only one index, and create many queries against it, why not maintain a set() or a dict() right next to that index?
numbers = [1,2,3]
unique_nunbers = set(numbers)
if 4 in unique_nunbers:
return numbers.index(4)
Dict option:
numbers = [1,2,3]
unique_nunbers = dict(zip(numbers, range(len(numbers))))
index = unique_nunbers.get(4)
if index is None:
# Do stuff
Lookups inside sets and dicts O(1), thus 4 in unique_numbers costs almost nothing. Iterating over the whole list in case something doesn't exist though, is a wasted O(n) operation.
This way, you have a way more efficient code, with a better algorithm, and no excepctions!
Keep in mind the same thing cannot be said for 4 in numbers, as it iterates over the whole list.
I am trying to apply a multiprocessing logic for the following function or for the for cycles inside of the function, but I am new to the multiprocessing and I failed miserably :/
Additional info:
The json_file is loaded as a dic and his keys contains the full path to a file (many different locations) - /foo/bar/fofo/bar.h
The input_list contains path the a file from different level in the filesystem - fofo/bar.h
def matcher(json_file, input_list):
with open(json_file) as jf:
data = json.load(jf)
key_list = data.keys()
full_path_list = []
for target in input_list:
for key in key_list:
if key.endswith("{}".format(target)):
full_path_list.append(key)
return full_path_list
Can you guys help?
Thank you in advance!
Here are examples of the key_list and the input list:
key list:
['/foo/bar/123456/.BAR/fofo/baba/dir/BAR.py', '/foo/bar/123456/.BAR/fofo/baba/dir/BAR.pyc', '/foo/bar/123456/.BAR/fofo/baba/dir/BAR.pye', '/foo/bar/123456/.BAR/fofo/baba/dir/BAR_fight.h', '/foo/bar/123456/.BAR/fofo/baba/dir/BARfoo.h', /bar/dir/98765/.FOO/barbar/foofoo/rid/MEH.py', '/bar/dir/98765/.FOO/barbar/foofoo/rid/MEH.pyc', '/bar/dir/98765/.FOO/barbar/foofoo/rid/MEH.pye', '/bar/dir/98765/.FOO/barbar/foofoo/rid/MEH_fight.h', '/bar/dir/98765/.FOO/barbar/foofoo/rid/MEHfoo.h]
input list:
['.BAR/fofo/baba/dir/BAR.py', '.BAR/fofo/baba/dir/BAR.pyc', '.BAR/fofo/baba/dir/BAR.pye', '.BAR/fofo/baba/dir/BAR_fight.h', '.BAR/fofo/baba/dir/BARfoo.h', .FOO/barbar/foofoo/rid/MEH.py', '.FOO/barbar/foofoo/rid/MEH.pyc', '.FOO/barbar/foofoo/rid/MEH.pye', '.FOO/barbar/foofoo/rid/MEH_fight.h', '.FOO/barbar/foofoo/rid/MEHfoo.h]
The length of both lists is 30000+.
It would help if you can share the dataset size and some example use case. Based on your current implementation, I'd say using a list comprehension along with any would help fasten the process.
full_path_list = [key for key in data.keys() if any(target in key for target in input_list)]
Ofcourse, it can be parallelized by having multiple threads use an item from input_list to search in the key_list. But it would be better if you share some usecase with an example to understand what you are trying to achieve.
I have a loop that generates data and writes it to a database:
myDatabase = Database('myDatabase')
for i in range(10):
#some code here that generates dictionaries that can be saved as activities
myDatabase.write({('myDatabase', 'valid code'): activityDict})
Single activities thus created can be saved to the database. However, when creating more than one, the length of the database is always 1 and only the last activity makes its way to the database.
Because I have lots of very big datasets, it is not convenient to store all of them in a single dictionary and write to the database all at once.
Is there a way to incrementally add activities to an existing database?
Normal activity writing
Database.write() will replace the entire database. The best approach is to create the database in python, and then write the entire thing:
data = {}
for i in range(10):
# some code here that generates data
data['foo'] = 'bar'
Database('myDatabase').write(data)
Dynamically generating datasets
However, if you are dynamically creating aggregated datasets from an existing database, you can create the individual datasets in a custom generator. This generator will need to support the following:
__iter__: Returns the database keys. Used to check that each dataset belongs to the database being written. Therefor we only need to return the first element.
__len__: Number of datasets to write.
keys: Used to add keys to mapping.
values: Used to add activity locations to geomapping. As the locations will be the same in our source database and aggregated system database, we can just give the original datasets here.
items: The new keys and datasets.
Here is the code:
class IterativeSystemGenerator(object):
def __init__(self, from_db_name, to_db_name):
self.source = Database(from_db_name)
self.new_name = to_db_name
self.lca = LCA({self.source.random(): 1})
self.lca.lci(factorize=True)
def __len__(self):
return len(self.source)
def __iter__(self):
yield ((self.new_name,))
def get_exchanges(self):
vector = self.lca.inventory.sum(axis=1)
assert vector.shape == (len(self.lca.biosphere_dict), 1)
return [{
'input': flow,
'amount': float(vector[index]),
'type': 'biosphere',
} for flow, index in self.lca.biosphere_dict.items()
if abs(float(vector[index])) > 1e-17]
def keys(self):
for act in self.source:
yield (self.new_name, act['code'])
def values(self):
for act in self.source:
yield act
def items(self):
for act in self.source:
self.lca.redo_lci({act: 1})
obj = copy.deepcopy(act._data)
obj['database'] = self.new_name
obj['exchanges'] = self.get_exchanges()
yield ((self.new_name, obj['code']), obj)
And usage:
new_name = "ecoinvent 3.2 cutoff aggregated"
new_data = IterativeSystemGenerator("ecoinvent 3.2 cutoff", new_name)
Database(new_name).write(new_data)
Limitations of this approach
If you are writing so many datasets or exchanges within datasets that you are running into memory problems, then you are also probably using the wrong tool. The current system of database tables and matrix builders uses sparse matrices. In this case, dense matrices would make much more sense. For example, the IO table backend skips the database entirely, and just writes processed arrays. It will take a long time to load and create the biosphere matrix if it has 13.000 * 1.500 = 20.000.000 entries. In this specific case, my first instinct is to try one of the following:
Don't write the biosphere flows into the database, but save them separately per aggregated process, and then add them after the inventory calculation.
Create a separate database for each aggregated system process.