I have a requirement to Create BigQuery Dataset at runtime and assign the required roles at runtime. Using Python scripting for this.I have searched on Google for help on how to update access setup after Dataset is created and came across following solution :
entry = bigquery.AccessEntry(
role='READER',
entity_type='userByEmail',
entity_id='sample.bigquery.dev#gmail.com')
assert entry not in dataset.access_entries
entries = list(dataset.access_entries)
entries.append(entry)
dataset.access_entries = entries
dataset = client.update_dataset(dataset, ['access_entries']) # API request
assert entry in dataset.access_entries
My requirement is to assign multiple roles to a dataset depending on the region for which Dataset is created as like below :
"access": [
{"role": "OWNER","groupByEmail": "gcp.abc.bigquery-admin#xyz.com"},
{"role": "READER","groupByEmail": "gcp.def.bigdata#xyz.com"},
{"role": "READER","groupByEmail": "gcp.ghi.bigquery#xyz.com"}]
Can anyone suggest the best way to get it done ? I am thinking to store GroupByMail and Role as key,value pair as a dictionary in config file and read and assign each value one by one. Is there any other best way to get it done ?
Any suggestion will be helpful.
The above code is good to assign access controls to a dataset at the BQ Dataset creation time, but it is not ideal to update access:
Let's say if 'sample.bigquery.dev#gmail.com' already had role='OWNER, and you run the above code, you will have two access entries, one with OWNER role, and one with READER role.
To update, you probably want to check if entity_id already exists. If not append the entry, otherwise overwrite the entry. (It's probably easier to do it through BQ UI)
Now having said that, if you have to assign multiple roles, you can have a list of entries.
from google.cloud import bigquery
client = bigquery.Client()
dataset_id = 'test_dataset'
dataset_ref = client.dataset(dataset_id)
dataset = bigquery.Dataset(dataset_ref)
dataset.location = 'EU'
dataset = client.create_dataset(dataset)
entries_list = [bigquery.AccessEntry('OWNER','groupByEmail','gcp.abc.bigquery-admin#xyz.com'),
bigquery.AccessEntry('READER','groupByEmail', 'gcp.def.bigdata#xyz.com'),
bigquery.AccessEntry('READER','groupByEmail', 'gcp.ghi.bigquery#xyz.com')]
entries = list(dataset.access_entries)
entries.extend(entries_list)
dataset.access_entries = entries
dataset = client.update_dataset(dataset, ['access_entries']) # API request
Related
from simple_salesforce import Salesforce
import pandas as pd
# username = 'USER_NAME'
# password = 'PASSWORD'
# security_token = 'SECURITY_TOKEN'
uname = 'USER_NAME'
passwd = 'PASSWORD'
token = 'SECURITY_TOKEN'
sfdc_query = Salesforce(username = uname,password=passwd,security_token=token)
object_list = []
for x in sfdc_query.describe()["sobjects"]:
object_list.append(x["name"])
object_list = ['ag1__c'] # my custom database in sales force
obj = ", ".join(object_list)
soql = 'SELECT FIELDS(CUSTOM) FROM ag1__c LIMIT 200' # CUSTOM
sfdc_rec = sfdc_query.query_all(soql)
sfdc_df = pd.DataFrame(sfdc_rec['records'])
sfdc_df
Here i am trying to get all the records from my custom database in sales force which has 1044 rows and i want to extract all the records.
i have tried lot of things but its not working please help me out with this, it will be great help to me.
thanks
ERROR: -
SalesforceMalformedRequest: Malformed request https://pujatiles-dev-ed.develop.my.salesforce.com/services/data/v52.0/query/?q=SELECT+FIELDS%28CUSTOM%29+FROM+ag1__c. Response content: [{'message': 'The SOQL FIELDS function must have a LIMIT of at most 200', 'errorCode': 'MALFORMED_QUERY'}]
If you need more rows - you need to list all the fields you need, explicitly mention them in SELECT. "FIELDS(ALL)" trick is limited to 200 records, period. If you don't know what fields are there - simple should have a "describe" operation for you. And then the query has to be under 100K characters.
As for "size in MB" - you can't query that. You can fetch record counts per table (it's a special call, not a query: https://developer.salesforce.com/docs/atlas.en-us.api_rest.meta/api_rest/resources_record_count.htm)
And then... For most objects it's count * 2kB. There are few exceptions (CampaignMember uses 3kB; Email Message uses however many kilobytes the actual email had) but that's a good rule of thumb.
https://developer.salesforce.com/docs/atlas.en-us.api_rest.meta/api_rest/dome_limits.htm is interesting too, DataStorageMB and FileStorageMB.
I'm trying to create a dictionary and my dictionary keys keep overwriting themselves. I don't understand how I can handle this issue.
Here's the script:
import MDAnalysis as mda
u = mda.Universe('rps5.prmtop', 'rps5.inpcrd')
ca = u.select_atoms('protein')
charges = ca.charges
atom_types = ca.names
resnames = ca.resnames
charge_dict = {}
for i in range(len(charges)):
#print(i+1 ,resnames[i], atom_types[i], charges[i])
charge_dict[resnames[i]] = {}
charge_dict[resnames[i]][atom_types[i]] = charges[i]
print(charge_dict)
The charges, atom_types and resnames are all lists, with the same number of elements.
I want my dictionary to look like this: charge_dict[resname][atom_types] = charges (charge_dict['MET']['CA'] = 0.32198, for example).
Could you please help me with this issue?
Without actually seeing a complete problem description, my guess is that your final result is that each charge_dict[name] is a dictionary with just one key. That's not because the keys "overwrite themselves". Your program overwrites them explicitly: charge_dict[resnames[i]] = {}.
What you want is to only reset the value for that key if it is not already set. You could easily do that by first testing if resnames[i] not in charge_dict:, but the Python standard library provides an even simpler mechanism: collections.defaultdict. A defaultdict is a dictionary with an associated default value creator. So you can do the following:
from collections import defaultdict
charge_dict = defaultdict(dict)
After that, you won't need to worry about initializing charge_dict[name] because a new dictionary will automatically spring into existence when the default value function (dict) is called.
This code is returning an error that I don't understand:
query = Analytic.objects(uid__type="binData")
analytics = []
for analytic in query:
analytic.sessionId = str(analytic.sessionId)
analytic.uid = str(analytic.uid)
analytics.append(analytic)
if len(analytics) % 10000 == 0:
print(".")
if len(analytics) == 100000:
Analytic.objects.update(analytics, upsert=False)
analytics = []
TypeError: update() got multiple values for argument 'upsert'
Updating multiple documents at the same time, I was able to get it working using the atomic updates section in the User guide in the documents.
atomic-updates
So your update should look something a little like
Analytic.objects(query_params='value').update(set__param='value')
or
query = Analytic.objects(query_params='value')
query.update(set__param='value')
The section has a list of modifies that you might want to look at. You still might want to do the update outside of your loop, as you'll be updating your query many times over.
It looks like you are already looping through all the objects it the queryset.
query = Analytic.objects(uid__type="binData")
Then for every iteration of the loop that satisfies~
if len(analytics) == 100000:
Analytic.objects.update(analytics, upsert=False)
analytics = []
You start another query and set analytics to an empty array. Here you are retrieving many objects in a query. Since you are already in a loop
I think you want to~
analytics_array= []
...
if len(analytics) == 100000:
analytics.save()
analytics_array.append(analytics)
The save will update objects that are already created. Not sure if that's what you wanted but the error is definitely coming from the line that reads "Analytic.objects.update(analytics, upsert=False). Hope this helps!
I would like to access an HDF5 file structure with h5py, where the groups and data sets are stored as following :
/Group 1/Sub Group 1/*/Data set 1/
where the asterisk signifies a sub-sub group which has a unique address. However, its address is irrelevant, since I am simply interested in the data sets it contains. How can I access any random sub-sub group without having to specify its unique address?
Here is a script for a specific case:
import h5py as h5
deleteme = h5.File("deleteme.hdf5", "w")
nobody_in_particular = deleteme.create_group("/grp_1/subgr_1/nobody_in_particular/")
dt = h5.special_dtype(vlen=str)
dataset_1 = nobody_in_particular.create_dataset("dataset_1",(1,),dtype=dt)
dataset_1.attrs[str(1)] = "Some useful data 1"
dataset_1.attrs[str(2)] = "Some useful data 2"
deleteme.close()
# access data from nobody_in_particular subgroup and do something
deleteme = h5.File("deleteme.hdf5", "r")
deleteme["/grp_1/subgr_1/nobody_in_particular/dataset_1"]
This gives output:
<HDF5 dataset "dataset_1": shape (1,), type "|O">
Now I wish accomplish the same result, however without knowing who (or which group) in particular. Any random subgroup in place of nobody_in_particular will do for me. How can I access this random subgroup?
In other words:
deleteme["/grp_1/subgr_1/<any random sub-group>/dataset_1"]
Assuming you only want to read and not create groups/datasets, then using visit (http://docs.h5py.org/en/latest/high/group.html#Group.visit) with a suitable function will allow you to select the desired groups/datasets.
I have a loop that generates data and writes it to a database:
myDatabase = Database('myDatabase')
for i in range(10):
#some code here that generates dictionaries that can be saved as activities
myDatabase.write({('myDatabase', 'valid code'): activityDict})
Single activities thus created can be saved to the database. However, when creating more than one, the length of the database is always 1 and only the last activity makes its way to the database.
Because I have lots of very big datasets, it is not convenient to store all of them in a single dictionary and write to the database all at once.
Is there a way to incrementally add activities to an existing database?
Normal activity writing
Database.write() will replace the entire database. The best approach is to create the database in python, and then write the entire thing:
data = {}
for i in range(10):
# some code here that generates data
data['foo'] = 'bar'
Database('myDatabase').write(data)
Dynamically generating datasets
However, if you are dynamically creating aggregated datasets from an existing database, you can create the individual datasets in a custom generator. This generator will need to support the following:
__iter__: Returns the database keys. Used to check that each dataset belongs to the database being written. Therefor we only need to return the first element.
__len__: Number of datasets to write.
keys: Used to add keys to mapping.
values: Used to add activity locations to geomapping. As the locations will be the same in our source database and aggregated system database, we can just give the original datasets here.
items: The new keys and datasets.
Here is the code:
class IterativeSystemGenerator(object):
def __init__(self, from_db_name, to_db_name):
self.source = Database(from_db_name)
self.new_name = to_db_name
self.lca = LCA({self.source.random(): 1})
self.lca.lci(factorize=True)
def __len__(self):
return len(self.source)
def __iter__(self):
yield ((self.new_name,))
def get_exchanges(self):
vector = self.lca.inventory.sum(axis=1)
assert vector.shape == (len(self.lca.biosphere_dict), 1)
return [{
'input': flow,
'amount': float(vector[index]),
'type': 'biosphere',
} for flow, index in self.lca.biosphere_dict.items()
if abs(float(vector[index])) > 1e-17]
def keys(self):
for act in self.source:
yield (self.new_name, act['code'])
def values(self):
for act in self.source:
yield act
def items(self):
for act in self.source:
self.lca.redo_lci({act: 1})
obj = copy.deepcopy(act._data)
obj['database'] = self.new_name
obj['exchanges'] = self.get_exchanges()
yield ((self.new_name, obj['code']), obj)
And usage:
new_name = "ecoinvent 3.2 cutoff aggregated"
new_data = IterativeSystemGenerator("ecoinvent 3.2 cutoff", new_name)
Database(new_name).write(new_data)
Limitations of this approach
If you are writing so many datasets or exchanges within datasets that you are running into memory problems, then you are also probably using the wrong tool. The current system of database tables and matrix builders uses sparse matrices. In this case, dense matrices would make much more sense. For example, the IO table backend skips the database entirely, and just writes processed arrays. It will take a long time to load and create the biosphere matrix if it has 13.000 * 1.500 = 20.000.000 entries. In this specific case, my first instinct is to try one of the following:
Don't write the biosphere flows into the database, but save them separately per aggregated process, and then add them after the inventory calculation.
Create a separate database for each aggregated system process.