DataProc BigQuery Connector Access Across Projects - apache-spark

I am writing a Spark Job to run on a DataProc cluster in Project A but the job itself will pull data from a BigQuery instance in Project B using the BigQuery Connector. I have owner privileges for both project, but the job is run using a service account. The response I'm getting in the stack trace is this:
{
"code" : 403,
"errors" : [ {
"domain" : "global",
"message" : "Access Denied: Table ABC:DEF.ghi: The user me-compute#developer.gserviceaccount.com does not have bigquery.tables.get permission for table ABC:DEF.ghi.",
"reason" : "accessDenied"
} ],
"message" : "Access Denied: Table ABC:DEF.ghi: The user me-compute#developer.gserviceaccount.com does not have bigquery.tables.get permission for table ABC:DEF.ghi."
}

As you noticed, Dataproc clusters run on behalf of service accounts instead of individual users; this is intentional, since different users may be creating Dataproc clusters in a shared project where they do not want their personal permissions to leak across other members of the org using the same project, and instead should define permissions according to service accounts which each represent a particular scope of workloads.
In this case, all you have to do is go into project B and add the service account from project A as one of the roles that can access BQ in project B. If it's not a complex arrangement withlots of users and different teams, you could just add it as "project viewer" on project B, otherwise you'll want something more fine-grained like a "bigquery viewer" or "bigquery editor".
Add that service account the same way you would add any user to project B.

Related

Google Cloud Platform - Dataflow submit batch job from python web app. Intermittent access issue when writing staging files to Google Cloud Storage

Background: My web app calls dataflow using a created service account, prior to the Easter weekend this was working fine.
But since then when the job submits and attempts to create the staging folders/files on my google cloud storage bucket (this is the same project as dataflow).
Issue: I run into the following issue, to make matters worse it is not even consistent. At the moment I would say around 1 in 4 jobs succeeds and runs fine whilst the other attempts receive the following error message.
OSError: Could not upload to GCS path gs://{my bucket name}/{dataflow staging location}/{dataflow jobname}: access denied.
This occurs when trying to upload the dataflow_python_sdk.tar
The http error message:
{
"reason": "forbidden"
"domain": "global",
"message": "Access denied.",
"errors": [
"message": "Access denied.",
"code": "403",
"error": apitools.base.py.exceptions.HttpForbiddenError: HttpError accessing "<https://www.googleapis.com/upload/storage/v1/b/{my bucket name}/o?&alt=json&name={my dataflow job name staging location}dataflow_python_sdk.tar&uploadType=multipart>: response.... "
}
The dataflow pipeline options are as follows:
{
service_account_email={service account email},
runner='DataflowRunner',
project=os.getenv('project_id'),
job_name=<job name> + '-' + start_date_time.replace('_', '-'),
temp_location=os.getenv('GCP_DATAFLOW_OPTIMAL_N_TEMP'),
staging_location=os.getenv('GCP_DATAFLOW_OPTIMAL_N_STAGING'),
setup_file='./setup.py',
region='europe-west2',
machine_type=<machine_type>,
max_num_workers=10,
profile_memory=True,
subnetwork=os.getenv('GCP_SUBNET'),
use_public_ips=False
My service account has the following permissions:
Compute Instance Admin (v1)
Dataflow Developer
Dataflow Worker
Logs Writer
Service Account User
Storage Object Admin
The service account json is stored on the web app as a environment variable called GOOGLE_APPLICATION_CREDENTIALS.
Has anyone ever come across this and if so how did you solve it?
EDIT:
I have looked at the logging more closely and have noticed that, during the attempt to upload the .tar file to google cloud storage. The process has to refresh the access token.
In failed attempted jobs, it logs the following message twice:
Attempting refresh to obtain initial access_token
I assume this means that it fails to refresh the access token, as far as I am aware it can only try twice.
I am currently using apache-beam[gcp] == 2.30.0
Any solutions to this?
It seems like your pipeline is basically working as intended (failing after a certain number of errors) and this seems like a problem with your interactions with GCS, or potentially something to do with the paths which you are getting from the os.env. I'd suggest reaching out to support if you cannot diagnose it.

Airflow authetication with LDAP based on owner

I am trying to configure my Airflow (version 2.10) LDAP authentication with RBAC.
UI access is restricted by the AD groups (multiple groups for Python Developer, ML Developer, etc.)
Members belonging to a particular group only should be able to view the DAGs created by fellow group members while the other group members shouldn't be.
Able to provide access to users via AD groups but all the users are able to see all the DAGs created. I want to restrict this access based on the defined set of owners, (this can be achieved by switching off the LDAP and creating users directly in Airflow, but I want it with AD groups.)
added fiter_by_owner=True in airflow.cfg file, seems nothing is effected.
Any thoughts on this.
EDIT1:
From FAB,
we can configure roles & then map it to AD groups as below:
FAB_ROLES = {
"ReadOnly_Altered": [
[".*", "can_list"],
[".*", "can_show"],
[".*", "menu_access"],
[".*", "can_get"],
[".*", "can_info"]
]
}
FAB_ROLES_MAPPING = {
1: "ReadOnly_Altered"
}
And to use this, I assume we need to have the endpoints created from the application end similar to can_list, can_show .
In the case of Airflow, I am unable to find the end-points that provides access based on owner (or based on tags). I believe if we have them, I can map it to roles & then to AD groups accordingly.
With newer versions of Airlfow you can map LDAP groups to Airflow Groups.
Owner is an old and currently defunct feature which is deprecated.
You can see some examples about FAB configuration (Flask Application Builder implements all authentication features):
https://flask-appbuilder.readthedocs.io/en/latest/security.html
See the part which starts with:
You can give FlaskAppBuilder roles based on LDAP roles (note, this requires AUTH_LDAP_SEARCH to be set):
From the docs:
# a mapping from LDAP DN to a list of FAB roles
AUTH_ROLES_MAPPING = {
"cn=fab_users,ou=groups,dc=example,dc=com": ["User"],
"cn=fab_admins,ou=groups,dc=example,dc=com": ["Admin"],
}
# the LDAP user attribute which has their role DNs
AUTH_LDAP_GROUP_FIELD = "memberOf"
# if we should replace ALL the user's roles each login, or only on registration
AUTH_ROLES_SYNC_AT_LOGIN = True
# force users to re-auth after 30min of inactivity (to keep roles in sync)
PERMANENT_SESSION_LIFETIME = 1800
See here about roles (including custom roles) https://airflow.apache.org/docs/apache-airflow/stable/security/access-control.html

Creating multiple bundles in Azure API for FHIR

Using Synthea i have generated 10 patient information. I have an azure account where i have setup "Azure API for FHIR" service. i did all the setup and tried pushing a sample patient (as mentioned in the official docs). i am able to retrieve the patient information by patient id as well.
However, the generated resource from Synthea are not just one resource type.. It has many entries like Patient, Organization, Claim etc.. everything bundled under one resource - bundle
Something like this.. but having more than 100 resource types for a patient. Its good that, it covers entire journey of the patient.
{
"resourceType": "Bundle",
"type": "transaction",
"entry": [
.....
{
....
"resourceType": "patient"
....
},
{
....
"resourceType": "organization"
....
},
]
}
Using post man i tried to insert this bundle with api below
https://XXXXXX.azurehealthcareapis.com/Bundle/
i was able to insert multiple bundles..
However, when i query the patients using the following api
https://XXXXXX.azurehealthcareapis.com/Patient/
All the patient information are not getting retrieved.
Here are my questions.
Inserting bundle by bundle - Is that the right approach.. or
Insert resource by resource .. Patient, Organization , Patient , Organization... But this looks meaningless. Because, if i need to find entire journey of a patient how would i be mapping it
Is There any way i can convert this each bundle as CSV files.. i would like to extract information and run a machine learning model on it.
When you need to process bundles at the FHIR endpoint, you need to POST it to the root / of the FHIR server. This is all described in https://www.hl7.org/fhir/http.html#transaction.
That said, the managed Azure API for FHIR only supports "batch" bundles at the moment. Bundle type transaction is not currently supported on Azure API for FHIR.
Both batch and transaction are supported on the OSS FHIR Server for Azure (https://github.com/Microsoft/fhir-server) when deployed with the SQL server persistence provider.
If you want to convert the transaction bundle that Synthea produces to a batch bundle, then you could take a look at something like this: https://github.com/hansenms/FhirTransactionToBatch

BigQuery Node.js API Permission Bug

I am building a Node.js server to run queries against BigQuery. For security reasons, I want this server to be read only. For example, if I write a query with DROP, INSERT, ALTER, etc. statement my query should get rejected. However, something like SELECT * FROM DATASET.TABLE LIMIT 10 should be allowed.
To solve this problem, I decided to use a service account with "jobUser" level access. According to BQ documentation, that should allow me to run queries, but I shouldn't be able to "modify/delete tables".
So I created such a service account using the Google Cloud Console UI and I pass that file to the BigQuery Client Library (for Node.js) as the keyFilename parameter in the code below.
// Get service account key for .env file
require( 'dotenv' ).config()
const BigQuery = require( '#google-cloud/bigquery' );
// Query goes here
const query = `
SELECT *
FROM \`dataset.table0\`
LIMIT 10
`
// Creates a client
const bigquery = new BigQuery({
projectId: process.env.BQ_PROJECT,
keyFilename: process.env.BQ_SERVICE_ACCOUNT
});
// Use standard sql
const query_options = {
query : query,
useLegacySql : false,
useQueryCache : false
}
// Run query and log results
bigquery
.query( query_options )
.then( console.log )
.catch( console.log )
I then ran the above code against my test dataset/table in BigQuery. However, running this code results in the following error message (fyi: exemplary-city-194015 is my projectID for my test account)
{ ApiError: Access Denied: Project exemplary-city-194015: The user test-bq-jobuser#exemplary-city-194015.iam.gserviceaccount.com does not have bigquery.jobs.create permission in project exemplary-city-194015.
What is strange is that my service account (test-bq-jobuser#exemplary-city-194015.iam.gserviceaccount.com) has the 'Job User' role and the Job User role does contain the bigquery.jobs.create permission. So that error message doesn't make sense.
In fact, I tested out all possible access control levels (dataViewer, dataEditor, ... , admin) and I get error messages for every role except the "admin" role. So either my service account isn't correctly configured or #google-cloud/bigquery has some bug. I don't want to use a service account with 'admin' level access because that allows me to run DROP TABLE-esque queries.
Solution:
I created a service account and assigned it a custom role with bigquery.jobs.create and bigquery.tables.getData permissions. And that seemed to work. I can run basic SELECT queries but DROP TABLE and other write operations fail, which is what I want.
As the error message shows, your service account doesn't have permissions to create BigQuery Job
You need to grant it roles/bigquery.user or roles/bigquery.jobUser access, see BigQuery Access Control Roles, as you see in this reference dataViewer and dataEditor don't have Create jobs/queries, but admin does, but you don't need that
To do the required roles, you can follow the instructions in Granting Access to a Service Account for a Resource
From command line using gcloud, run
gcloud projects add-iam-policy-binding $BQ_PROJECT \
--member serviceAccount:$SERVICE_ACOUNT_EMAIL \
--role roles/bigquery.user
Where BQ_PROJECT is your project-id and SERVICE_ACOUNT_EMAIL is your service-account email/id
Or from Google Cloud Platform console search or add your service-account email/id and give it the required ACLs
I solved my own problem. To make queries you need both bigquery.jobs.create and bigquery.tables.getData permissions. The JobUser role has the former but not the latter. I created a custom role (and assigned my service account to that custom role) that has both permissions and now it works. I did this using the Google Cloud Console UI ( IAM -> Roles -> +Add ) then ( IAM -> IAM -> <set service account to custom role> )

Getting all B2B directories user is member of

Since we have Azure AD's B2B feature in GA, I am curious how to make use of B2B in multi-tenant applications. More specifically, how to get a list of directories which the user is member of? For example, the Azure Portal does this by calling https://portal.azure.com/AzureHubs/api/tenants/List, Microsoft's My Apps calls https://account.activedirectory.windowsazure.com/responsive/multidirectoryinfo to get the information - is there a public endpoint for this?
The use case is to enable B2B cooperation across a multi-tenant application which is provisioned in each user's directory so they have their own instances, but there is no way to centrally pull the information about user's directories.
A simple workaround would be to query all tenants which have the application provisioned for the user's UPN and if found, display it in the list, but imagine if there were hundreds of tenants... I believe that this is quite crucial for app developers who want to leverage the B2B functions in multi-tenant applications.
Update: It seems like there is a way to do this by accessing the Azure Service Management API, however this API and method is undocumented and I suppose that if any issues would occur, Microsoft would say that it is not a supported scenario.
Update 2: I wrote an article about the whole setup, including a sample project of how to make use of this in a scenario, it can be found here https://hajekj.net/2017/07/24/creating-a-multi-tenant-application-which-supports-b2b-users/
There is a publicly documented Azure Management API that allows you to do this: https://learn.microsoft.com/en-us/rest/api/resources/tenants
GET https://management.azure.com/tenants?api-version=2016-06-01 HTTP/1.1
Authorization: Bearer eyJ0eXAiOiJKV1QiLCJhbGciOiJSUz...
...
The response body looks something like this:
{
"value" : [{
"id" : "/tenants/d765d508-7139-4851-b9c5-74d6dbb1edf0",
"tenantId" : "d765d508-7139-4851-b9c5-74d6dbb1edf0"
}, {
"id" : "/tenants/845415f3-7a05-45c2-8376-ee67080661e2",
"tenantId" : "845415f3-7a05-45c2-8376-ee67080661e2"
}, {
"id" : "/tenants/97bcb93f-8dee-48ed-afa3-356ba40f3a61",
"tenantId" : "97bcb93f-8dee-48ed-afa3-356ba40f3a61"
}
]
}
The resource for which you need to acquire an access token is https://management.azure.com/ (with the trailing slash!).

Resources