I have a loop that generates data and writes it to a database:
myDatabase = Database('myDatabase')
for i in range(10):
#some code here that generates dictionaries that can be saved as activities
myDatabase.write({('myDatabase', 'valid code'): activityDict})
Single activities thus created can be saved to the database. However, when creating more than one, the length of the database is always 1 and only the last activity makes its way to the database.
Because I have lots of very big datasets, it is not convenient to store all of them in a single dictionary and write to the database all at once.
Is there a way to incrementally add activities to an existing database?
Normal activity writing
Database.write() will replace the entire database. The best approach is to create the database in python, and then write the entire thing:
data = {}
for i in range(10):
# some code here that generates data
data['foo'] = 'bar'
Database('myDatabase').write(data)
Dynamically generating datasets
However, if you are dynamically creating aggregated datasets from an existing database, you can create the individual datasets in a custom generator. This generator will need to support the following:
__iter__: Returns the database keys. Used to check that each dataset belongs to the database being written. Therefor we only need to return the first element.
__len__: Number of datasets to write.
keys: Used to add keys to mapping.
values: Used to add activity locations to geomapping. As the locations will be the same in our source database and aggregated system database, we can just give the original datasets here.
items: The new keys and datasets.
Here is the code:
class IterativeSystemGenerator(object):
def __init__(self, from_db_name, to_db_name):
self.source = Database(from_db_name)
self.new_name = to_db_name
self.lca = LCA({self.source.random(): 1})
self.lca.lci(factorize=True)
def __len__(self):
return len(self.source)
def __iter__(self):
yield ((self.new_name,))
def get_exchanges(self):
vector = self.lca.inventory.sum(axis=1)
assert vector.shape == (len(self.lca.biosphere_dict), 1)
return [{
'input': flow,
'amount': float(vector[index]),
'type': 'biosphere',
} for flow, index in self.lca.biosphere_dict.items()
if abs(float(vector[index])) > 1e-17]
def keys(self):
for act in self.source:
yield (self.new_name, act['code'])
def values(self):
for act in self.source:
yield act
def items(self):
for act in self.source:
self.lca.redo_lci({act: 1})
obj = copy.deepcopy(act._data)
obj['database'] = self.new_name
obj['exchanges'] = self.get_exchanges()
yield ((self.new_name, obj['code']), obj)
And usage:
new_name = "ecoinvent 3.2 cutoff aggregated"
new_data = IterativeSystemGenerator("ecoinvent 3.2 cutoff", new_name)
Database(new_name).write(new_data)
Limitations of this approach
If you are writing so many datasets or exchanges within datasets that you are running into memory problems, then you are also probably using the wrong tool. The current system of database tables and matrix builders uses sparse matrices. In this case, dense matrices would make much more sense. For example, the IO table backend skips the database entirely, and just writes processed arrays. It will take a long time to load and create the biosphere matrix if it has 13.000 * 1.500 = 20.000.000 entries. In this specific case, my first instinct is to try one of the following:
Don't write the biosphere flows into the database, but save them separately per aggregated process, and then add them after the inventory calculation.
Create a separate database for each aggregated system process.
Related
When invoking Azure ML Batch Endpoints (creating jobs for inferencing), the run() method should return a pandas DataFrame or an array as explained here
However this example shown, doesn't represent an output with headers for a csv, as it is often needed.
The first thing I've tried was to return the data as a pandas DataFrame and the result is just a simple csv with a single column and without the headers.
When trying to pass the values with several columns and it's corresponding headers, to be later saved as csv, as a result, I'm getting awkward square brackets (representing the lists in python) and the apostrophes (representing strings)
I haven't been able to find documentation elsewhere, to fix this:
This is the way I found to create a clean output in csv format using python, from a batch endpoint invoke in AzureML:
def run(mini_batch):
batch = []
for file_path in mini_batch:
df = pd.read_csv(file_path)
# Do any data quality verification here:
if 'id' not in df.columns:
logger.error("ERROR: CSV file uploaded without id column")
return None
else:
df['id'] = df['id'].astype(str)
# Now we need to create the predictions, with previously loaded model in init():
df['prediction'] = model.predict(df)
# or alternative, df[MULTILABEL_LIST] = model.predict(df)
batch.append(df)
batch_df = pd.concat(batch)
# After joining all data, we create the columns headers as a string,
# here we remove the square brackets and apostrophes:
azureml_columns = str(batch_df.columns.tolist())[1:-1].replace('\'','')
result = []
result.append(azureml_columns)
# Now we have to parse all values as strings, row by row,
# adding a comma between each value
for row in batch_df.iterrows():
azureml_row = str(row[1].values).replace(' ', ',')[1:-1].replace('\'','').replace('\n','')
result.append(azureml_row)
logger.info("Finished Run")
return result
I have a function that produces a large amount of data and store them in a dictionary that is to be returned at the end of the function. However I've been told that it's not efficient in terms of memory and that I should use data generator in Python.
Example:
def gen_data(dataz):
data_dict = {"uniqz":[], "random":[]}
for v in dataz:
if dataz[v]["row"] == "ay1":
data_dict["uniqz"].append(dataz[v]["row"])
else:
data_dict["random"].append(dataz[v]["val"])
return data_dict
dataz = {"ax1":{"row":"ay1","val":2},
"ax2":{"row":"ay2","val":3}}
print(gen_data(dataz))
How do I convert this block of code and make use of data generator/yield?
I know how to yield individual dictionary data, example:
for i in xrange(num_people):
datadict = {
'id': i
}
yield datadict
But in the above scenario it should append all necessary data into the dictionary first before yielding. How do I do this?
It depends on how you want to use the data, from your code above you are creating all the values and processing it at once. generator are useful when you want to process data one at a time (or one batch at a time)
Here you can modify your code like this
def gen_data(dataz):
for v in dataz:
if dataz[v]["row"] == "ay1":
yield ("uniqz", dataz[v]["row"])
else:
yield ("random", dataz[v]["val"])
then you can process your data like below
for key, value in gen_data(dataz):
process(key, value)
In this way you will save the memory occupied by the dictionary. This will also be efficient in case of parallel processing. e.g. there will be some latency in getting new data and also in processing it, using generator, while you are getting data from some source it can be processed in parallel in different thread or machine
My gensim model is like this:
class MyCorpus(object):
parametersList = []
def __init__(self,dictionary):
self.dictionary=dictionary
def __iter__(self):
#for line in open('mycorpus.txt'):
for line in texts:
# assume there's one document per line, tokens separated by whitespace
yield self.dictionary.doc2bow(line[0].lower().split())
if __name__=="__main__":
texts=[['human human interface computer'],
['survey user user computer system system system response time'],
['eps user interface system'],
['system human system eps'],
['user response time'],
['trees'],
['graph trees'],
['graph minors trees'],
['graph minors minors survey survey survey']]
dictionary = corpora.Dictionary(line[0].lower().split() for line in texts)
corpus= MyCorpus(dictionary)
The frequency of each token in each document is automatically evaluated.
I also can define the tf-idf model and access the tf-idf statistic for each token in each document.
model = TfidfModel(corpus)
However, I have no clue how to count (memory-friendly) the number of documents that a given word arise. How can I do that? [Sure... I can use the values of tf-idf and document frequency to evaluate it... However, I would like to evaluate it directly from some counting process]
For instance, for the first document, I would like to get somenthing like
[('human',2), ('interface',2), ('computer',2)]
since each token above arises twice in each document.
For the second.
[('survey',2), ('user',3), ('computer',2),('system',3), ('response',2),('time',2)]
How about this?
from collections import Counter
documents = [...]
count_dict = [word_count(document) for filename in documents]
total = sum(count_dict, Counter())
I assumed that all your string are different documents/files. You can make related changes. Also, made change to the code.
How to sort the data that are stored in a global list after inserting them within a method; so that before they are stacked into another list in accordance to their inserted elements? Or is this a bad practice and complicate things in storing data inside of a global list instead of seperated ones within a method; and finally sorting them thereafter ?
Below is the example of the scenario
list= []
dictionary = {}
def MethodA(#returns title):
#searches for corresponding data using beautifulsoup
#adds data into dictionary
# list.append(dictionary)
# returns list
def MethodB(#returns description):
#searches for corresponding data using beautifulsoup
#adds data into dictionary
# list.append(dictionary)
# returns list
Example of Wanted output
MethodA():[title] #scraps(text.title) data from the web
MethodB():[description] #scraps(text.description) from the web
#print(list)
>>>list=[{title,description},{title.description},{title,description},{title.description}]
Actual output
MethodA():[title] #scraps(text.title) data from the web
MethodB():[description] #scraps(text.description) from the web
#print(list)
>>>list =[{title},{title},{description},{description}]
There are a few examples I've seen; such as using Numpy and sorting them in an Array;-
arraylist = np.array(list)
arraylist[:, 0]
#but i get a 'too many indices for array'-
#because I have too much data loading in; including that some of them
#do not have data and are replaced as `None`; so there's an imbalance of indexes.
Im trying to keep it as modulated as possible. I've tried using the norm of iteration;
but it's sort of complicated because I have to indent more loops in it;
I've tried Numpy and Enumerate, but I'm not able to understand how to go about with it. But because it's an unbalanced list; meaning that some value are returned as Nonegives me the return error that; all the input array dimensions except for the concatenation axis must match exactly
Example : ({'Toy Box','Has a toy inside'},{'Phone', None }, {'Crayons','Used for colouring'})
Update; code sample of methodA
def MethodA(tableName, rowName, selectedLink):
try:
for table_tag in selectedLink.find_all(tableName, {'class': rowName}):
topic_title = table_tag.find('a', href=True)
if topic_title:
def_dict1 = {
'Titles': topic_title.text.replace("\n", "")}
global_list.append(def_dict1 )
return def_dict1
except:
def_dict1 = None
Assuming you have something of the form:
x = [{'a'}, {'a1'}, {'b'}, {'b1'}, {'c'}, {None}]
you can do:
dictionary = {list(k)[0]: list(v)[0] for k, v in zip(x[::2], x[1::2])}
or
dictionary = {s.pop(): v.pop() for k, v in zip(x[::2], x[1::2])}
The second method will clear your sets in x
I have a requirement to Create BigQuery Dataset at runtime and assign the required roles at runtime. Using Python scripting for this.I have searched on Google for help on how to update access setup after Dataset is created and came across following solution :
entry = bigquery.AccessEntry(
role='READER',
entity_type='userByEmail',
entity_id='sample.bigquery.dev#gmail.com')
assert entry not in dataset.access_entries
entries = list(dataset.access_entries)
entries.append(entry)
dataset.access_entries = entries
dataset = client.update_dataset(dataset, ['access_entries']) # API request
assert entry in dataset.access_entries
My requirement is to assign multiple roles to a dataset depending on the region for which Dataset is created as like below :
"access": [
{"role": "OWNER","groupByEmail": "gcp.abc.bigquery-admin#xyz.com"},
{"role": "READER","groupByEmail": "gcp.def.bigdata#xyz.com"},
{"role": "READER","groupByEmail": "gcp.ghi.bigquery#xyz.com"}]
Can anyone suggest the best way to get it done ? I am thinking to store GroupByMail and Role as key,value pair as a dictionary in config file and read and assign each value one by one. Is there any other best way to get it done ?
Any suggestion will be helpful.
The above code is good to assign access controls to a dataset at the BQ Dataset creation time, but it is not ideal to update access:
Let's say if 'sample.bigquery.dev#gmail.com' already had role='OWNER, and you run the above code, you will have two access entries, one with OWNER role, and one with READER role.
To update, you probably want to check if entity_id already exists. If not append the entry, otherwise overwrite the entry. (It's probably easier to do it through BQ UI)
Now having said that, if you have to assign multiple roles, you can have a list of entries.
from google.cloud import bigquery
client = bigquery.Client()
dataset_id = 'test_dataset'
dataset_ref = client.dataset(dataset_id)
dataset = bigquery.Dataset(dataset_ref)
dataset.location = 'EU'
dataset = client.create_dataset(dataset)
entries_list = [bigquery.AccessEntry('OWNER','groupByEmail','gcp.abc.bigquery-admin#xyz.com'),
bigquery.AccessEntry('READER','groupByEmail', 'gcp.def.bigdata#xyz.com'),
bigquery.AccessEntry('READER','groupByEmail', 'gcp.ghi.bigquery#xyz.com')]
entries = list(dataset.access_entries)
entries.extend(entries_list)
dataset.access_entries = entries
dataset = client.update_dataset(dataset, ['access_entries']) # API request