How to access BigQuery using Spark which is running outside of GCP - apache-spark

I'm trying to connect my Spark Job which is running on private datacenter with BigQuery. I have created service account and got private JSON key and gained read access to the dataset I wanted to query for. But, when I try integrating with Spark, I'm receiving User does not have bigquery.tables.create permission for dataset xxx:yyy.. Do we need create table permission to read data from table using BigQuery?
Below is the response gets printed on console,
{
"code" : 403,
"errors" : [ {
"domain" : "global",
"message" : "Access Denied: Dataset xxx:yyy: User does not have bigquery.tables.create permission for dataset xxx:yyy.",
"reason" : "accessDenied"
} ],
"message" : "Access Denied: Dataset xxx:yyy: User does not have bigquery.tables.create permission for dataset xxx:yyy.",
"status" : "PERMISSION_DENIED"
}
Below is my Spark code which I'm trying to access BigQuery
object ConnectionTester extends App {
val session = SparkSession.builder()
.appName("big-query-connector")
.config(getConf)
.getOrCreate()
session.read
.format("bigquery")
.option("viewsEnabled", true)
.load("xxx.yyy.table1")
.select("col1")
.show(2)
private def getConf : SparkConf = {
val sparkConf = new SparkConf
sparkConf.setAppName("biq-query-connector")
sparkConf.setMaster("local[*]")
sparkConf.set("parentProject", "my-gcp-project")
sparkConf.set("credentialsFile", "<path to my credentialsFile>")
sparkConf
}
}

For reading regular tables there's no need for bigquery.tables.create permission. However, the code sample you've provided hints that the table is actually a BigQuery view. BigQuery views are logical references, they are not materialized on the server side and in order for spark to read them they first need to be materialized to a temporary table. In order to create this temporary table bigquery.tables.create permission is required.

Check below code.
Credential
val credentials = """
| {
| "type": "service_account",
| "project_id": "your project id",
| "private_key_id": "your private_key_id",
| "private_key": "-----BEGIN PRIVATE KEY-----\nxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\n-----END PRIVATE KEY-----\n",
| "client_email": "xxxxx#company.com",
| "client_id": "111111111111111111111111111",
| "auth_uri": "https://accounts.google.com/o/oauth2/auth",
| "token_uri": "https://oauth2.googleapis.com/token",
| "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
| "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/xxxxx40vvvvvv.iam.gserviceaccount.com"
| }
| """
Encode base64 & pass it to spark conf.
def base64(data: String) = {
import java.nio.charset.StandardCharsets
import java.util.Base64
Base64.getEncoder.encodeToString(data.getBytes(StandardCharsets.UTF_8))
}
spark.conf.set("credentials",base64(credentials))
spark
.read
.options("parentProject","parentProject")
.option("table","dataset.table")
.format("bigquery")
.load()

Related

Streaming not working in Delta Live table pipeline (Databricks)?

I am working on a pipeline in Databricks > Workflows > Delta Live Tables and having an issue with the streaming part.
Expectations:
One bronze table reads the json files with AutoLoader (cloudFiles), in a streaming mode (spark.readStream)
One silver table reads and flattens the bronze table in streaming (dlt.read_stream)
Result:
When taking the root location as the source (load /*, several hundreds of files): the pipelines starts, but the number of rows/files appended is not updated in the graph until the bronze part be completed. Then, the silver part starts, the number of files/rows never updates either and the pipeline terminates with a memory error.
When taking a very small number of files (/specific_folder among hundreds) : the pipeline runs well and terminates with no error, but again, the number of rows/files appended is not updated in the graph until each part is completed.
This led me to the conclusion that the pipeline seems not to run in a streaming mode.
Maybe I am missing something about the config or how to run properly a DLT pipeline, and would need your help on this please.
Here is the configuration of the pipeline:
{
"id": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
"clusters": [
{
"label": "default",
"aws_attributes": {
"instance_profile_arn": "arn:aws:iam::xxxxxxxxxxxx:instance-profile/iam_role_example"
},
"autoscale": {
"min_workers": 1,
"max_workers": 10,
"mode": "LEGACY"
}
}
],
"development": true,
"continuous": false,
"channel": "CURRENT",
"edition": "PRO",
"photon": false,
"libraries": [
{
"notebook": {
"path": "/Repos/user_example#xxxxxx.xx/dms/bronze_job"
}
}
],
"name": "01-landing-task-1",
"storage": "dbfs:/pipelines/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx",
"configuration": {
"SCHEMA": "example_schema",
"RAW_MOUNT_NAME": "xxxx",
"DELTA_MOUNT_NAME": "xxxx",
"spark.sql.parquet.enableVectorizedReader": "false"
},
"target": "landing"}
Here is the code of the pipeline (the query in the silver table contains many more columns with a get_json_object, ~30 actually):
import dlt
import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark.sql.window import Window
RAW_MOUNT_NAME = spark.conf.get("RAW_MOUNT_NAME")
SCHEMA = spark.conf.get("SCHEMA")
SOURCE = spark.conf.get("SOURCE")
TABLE_NAME = spark.conf.get("TABLE_NAME")
PRIMARY_KEY_PATH = spark.conf.get("PRIMARY_KEY_PATH")
#dlt.table(
name=f"{SCHEMA}_{TABLE_NAME}_bronze",
table_properties={
"quality": "bronze"
}
)
def bronze_job():
load_path = f"/mnt/{RAW_MOUNT_NAME}/{SOURCE}/5e*"
return spark \
.readStream \
.format("text") \
.option("encoding", "UTF-8") \
.load(load_path) \
.select("value", "_metadata") \
.withColumnRenamed("value", "json") \
.withColumn("id", F.expr(f"get_json_object(json, '$.{PRIMARY_KEY_PATH}')")) \
.withColumn("_etl_timestamp", F.col("_metadata.file_modification_time")) \
.withColumn("_metadata", F.col("_metadata").cast(T.StringType())) \
.withColumn("_etl_operation", F.lit("U")) \
.withColumn("_etl_to_delete", F.lit(False)) \
.withColumn("_etl_file_name", F.input_file_name()) \
.withColumn("_etl_job_processing_timestamp", F.current_timestamp()) \
.withColumn("_etl_table", F.lit(f"{TABLE_NAME}")) \
.withColumn("_etl_partition_date", F.to_date(F.col("_etl_timestamp"), "yyyy-MM-dd")) \
.select("_etl_operation", "_etl_timestamp", "id", "json", "_etl_file_name", "_etl_job_processing_timestamp", "_etl_table", "_etl_partition_date", "_etl_to_delete", "_metadata")
#dlt.table(
name=f"{SCHEMA}_{TABLE_NAME}_silver",
table_properties = {
"quality": "silver",
"delta.autoOptimize.optimizeWrite": "true",
"delta.autoOptimize.autoCompact": "true"
}
)
def silver_job():
df = dlt.read_stream(f"{SCHEMA}_{TABLE_NAME}_bronze").where("_etl_table == 'extraction'")
return df.select(
df.id.alias('medium_id'),
F.get_json_object(df.json, '$.request').alias('request_id'))
Thank you very much for your help!

How to create DNS record-set in GCP using python script

I am trying to develop a Python Automation script that adds a DNS record-sets of "A" type into my existing GCP DNS Managed-Zone "my-sites"
import json
from google.oauth2 import service_account
from google.cloud import dns
from google.cloud.exceptions import NotFound
gcp_dns_credentials={
"type": "service_account",
"project_id": "mygcpprojectid-1122",
"private_key_id": "myprivkeyid",
"private_key": "-----BEGIN PRIVATE KEY-----\nmyprivatekey\n-----END PRIVATE KEY-----\n",
"client_email": "client-mail#mygcpprojectid-1122.iam.gserviceaccount.com",
"client_id": "myclientid",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/client-mail%40mygcpprojectid-1122.iam.gserviceaccount.com"
}
project_id="mygcpprojectid-1122"
zone_name="my-sites"
dns_credentials = service_account.Credentials.from_service_account_info(gcp_dns_credentials)
client = dns.Client(project=project_id,credentials=dns_credentials)
zone = client.zone(zone_name)
create_records=dns.resource_record_set.ResourceRecordSet(name="mydnsrecord2.mygcpproject.com",record_type="A",ttl=300,rrdatas=["13.66.xx.xx"],zone=zone)
This script execution neither throws the error nor creates DNS record-set.
I referred this doc - https://cloud.google.com/python/docs/reference/dns/latest/resource-record-set
Can someone help me :)
Just reiterating #JohnHanley solution with python code
from google.oauth2 import service_account
from googleapiclient import discovery
gcp_dns_credentials={
"blah blah": "all dummy credentials in json format already mentioned in the question "
}
project_id="mygcpprojectid-1122"
zone_name="my-sites"
credentials = service_account.Credentials.from_service_account_info(gcp_dns_credentials)
service = discovery.build('dns', 'v1', credentials=credentials)
change_body = {
"additions": [
{
"name": "mydnsrecord2.mygcpproject.com.",
"type": "A",
"ttl": 300,
"rrdata": ["13.66.xx.xx"]
}
]
}
request = service.changes().create(project=project_id, managedZone=zone_name, body=change_body)
response = request.execute()
This script execution would create mydnsrecord2.mygcpproject.com record-set
Referred this doc https://cloud.google.com/dns/docs/reference/v1/changes/create#python
No error is reported because nothing has been done yet on the Google Cloud DNS side.
DNS changes are made atomically, which means you can make multiple changes (add, delete, etc) and apply them all at once. All changes take effect or none do (rollback).
Operations with DNS are performed via Change Sets. This means creating a list of the changes (e.g. create / modify / delete a resource record).
The add_record_set() method appends to the change set link.
The create() method applies the change set link. This method is what actually modifies your DNS server resource records.
Google Cloud DNS Change Sets

EMR & Spark 3.0: row-level security (filtering)

Question: is it possible to achieve row-level filtering with EMR, Spark 3.0 & S3 without purchasing expensive enterprise solutions?
I want to make sure I haven't missed anything. EMR-Ranger doc says integration supports Spark, but it looks totally useless, because:
If works through EMR Record Server It should provide fine-grained access controll, but Ranger plugins doc says with EMR Spark Plugin doesn't support write operations, write to CSV & Auro, Delta and Hudi, row filtering and data masking.
If works through EMRFS S3 Ranger Plugin Only coarse-grained (db-/table-/column- level) access control. No row-level filtering or data masking (due to storage level auth). See Ranger plugins doc
P.S. Hive isn't an option and Lake Formation doesn't support EMR 6.x (Spark 3.x).
Question: your Q1) ..is it possible to achieve row-level filtering with EMR, Spark 3.0 & S3.. without expensive solutions..? your Q2 2) ..Only coarse-grained (db-/table-/column- level) access control. No row-level filtering or data... I listed the code snippet below to handle this below..
Answer: to avoid paying for something like privacera, I implemented my own row-level filtering for reports by forcing a rowfilter policy and passing it a "rowFilterPolicyItems", see both 1) JSON and 2) Java sample /options below, hopefully you should be able to extend this for your custom needs.
AWS DOCS sucks..:
As of Aug 2021, reads are possible, but writes (regardless of a plug-in) are not supported see AWS Ranger Docs here..
... Here are some considerations and limitations before you enable Apache Ranger integration on Amazon EMR. 1/ Row-level authorization and data masking policies are currently only supported with Apache Hive. 2/ The EMR Ranger-Spark plugin enforces fine-grained authorization when reading and writing data using the Spark API with Java, Scala, R, and Pyspark. However, writing data using Spark SQL on Ranger-Enabled Clusters is currently not supported; only reading data using SparkSQL is supported. ...
Update 1:
Even AWS own samples doesn't know how to tackle it as seen below, so I rolled my own, by looking at the code in Apache...
Answer: WorkAround -- So I implemented these in ranger rowfilter and "rowFilterPolicyItems" on my own inside.
Option 1 JSON SEtup
{
"rowFilterInfo" : {
"filterExpr" : "..."
},
"roles" : [ "...", "..." ],
"groups" : [ "...", "..." ],
"conditions" : [ {
"type" : "...",
"values" : [ "...", "..." ]
}, {
"type" : "...",
"values" : [ "...", "..." ]
} ],
"delegateAdmin" : true,
"accesses" : [ {
"type" : "...",
"isAllowed" : true
}, {
"type" : "...",
"isAllowed" : true
} ],
"users" : [ "...", "..." ]
}
For e.g. you have to implement "rowFilterPolicyItems" you can use what I have sample below..
// you would setup/pass this in your JSON file
// Or, you can also do it Programmatically in Java
"rowFilterPolicyItems": [
{
"rowFilterInfo": {
"filterExpr": "page='ranger-aws-emr3.tykt.org'"
},
"accesses": [
{
// **** I Would tinker with this access type,
// if you are doing it Java I will paste a sample below
"type": "select",
"isAllowed": true
}
],
"users": [
"reports"
],
"groups": [],
"conditions": [],
"delegateAdmin": false
}
], ...
Option 2 JAva SEtup
In Java you can programmatically setup as below RangerPolicy.RangerPolicyItem policyItem = new RangerPolicy.RangerPolicyItem(); these policies can be added to the row filter item. Since AWS was missing this I had to do ALL of this on my OWN. Lots of trial and error, but it paid off.
void createTyktPolicies(RangerService createdTyktRecordsPolicyService) throws Exception {
RangerBaseService svc = serviceMgr.getRangerServiceByService(createdTyktRecordsPolicyService, this);
if (svc != null) {
List<String> serviceCheckUsers = getServiceCheckUsers(createdTyktRecordsPolicyService);
// *** your policy as needed
List<RangerPolicy.RangerPolicyItemAccess> allAccesses = svc.getAndAllowAllAccesses();
List<RangerPolicy> defaultPolicies = svc.getDefaultRangerPolicies();
if (CollectionUtils.isNotEmpty(defaultPolicies)) {
createDefaultPolicyUsersAndGroups(defaultPolicies);
for (RangerPolicy defaultPolicy : defaultPolicies) {
if (CollectionUtils.isNotEmpty(serviceCheckUsers) && StringUtils.equalsIgnoreCase(defaultPolicy.getService(), createdTyktRecordsPolicyService.getName())) {
// this is where I am creating my policy in code
// You CAN MODIFY THE POLICY ACCORDINGLY to your needs and try
RangerPolicy.RangerPolicyItem policyItem = new RangerPolicy.RangerPolicyItem();
policyItem.setUsers(serviceCheckUsers);
policyItem.setAccesses(allAccesses);
policyItem.setDelegateAdmin(true);
defaultPolicy.getPolicyItems().add(policyItem);
}
createPolicy(defaultPolicy);
}
}
}
}

How do I connect with BigQuery via Spark SQL?

I have a simple python code which includes connecting with bigQuery using a JSON file having my credentials.
data = pd.read_gbq(SampleQuery, project_id='XXXXXXXX', private_key='filename.json')
Here the filename.json has the below format:
{
"type": "service_account",
"project_id": "projectId",
"private_key_id": "privateKeyId",
"private_key": "privateKey",
"client_email": "clientEmail",
"client_id": "clientId",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/clientEmail"
}
Now, I need to port this code to pyspark. But I am facing difficulties to find how to query using Spark SQL. I am using AWS EMR cluster to run this query!
Any help would be appreciated!
Since a SQLContext object is required to use Spark SQL, the SparkContext needs to be configured first to connect to BigQuery. From my point of view, the BigQuery Connector (addressed by sramalingam24 and Kenneth Jung) can be used to query data in BigQuery.
Please note that sramalingam24 provided the link with an example, following is a summary of the code:
bucket = sc._jsc.hadoopConfiguration().get('fs.gs.system.bucket')
project = sc._jsc.hadoopConfiguration().get('fs.gs.project.id')
input_directory = 'gs://{}/hadoop/tmp/bigquery/pyspark_input'.format(bucket)
conf = {
# Input Parameters.
'mapred.bq.project.id': project,
'mapred.bq.gcs.bucket': bucket,
'mapred.bq.temp.gcs.path': input_directory,
'mapred.bq.input.project.id': 'publicdata',
'mapred.bq.input.dataset.id': 'samples',
'mapred.bq.input.table.id': 'shakespeare',
}
table_data = sc.newAPIHadoopRDD(
'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
'org.apache.hadoop.io.LongWritable',
'com.google.gson.JsonObject',
conf=conf)
word_counts = (
table_data
.map(lambda record: json.loads(record[1]))
.map(lambda x: (x['word'].lower(), int(x['word_count'])))
.reduceByKey(lambda x, y: x + y))
sql_context = SQLContext(sc)
(word_counts
.toDF(['word', 'word_count'])
.write.format('json').save(output_directory))
Then you can download the connector jar for Other Hadoop clusters. And Kenneth Jung provided link with information that suggests that the option --jar can be used to include the connector (--jars gs://spark-lib/bigquery/spark-bigquery-latest.jar) which is the option to include jar on the driver and executor classpaths.
One of your submit pyspark parameters should be jars with spark-bigquery-latest.jar package, this is how I added that to dataproc job on google cloud:
gcloud dataproc jobs submit pyspark --cluster ${CLUSTER_NAME} jars gs://spark-lib/bigquery/spark-bigquery-latest.jar --driver-log-levels root=FATAL script.py

Hive table creation in HDP using Apache Spark job

I have written following Scala program in Eclipse for reading a csv file from a location in HDFS and then saving that data into a hive table [I am using HDP2.4 sandbox running on my VMWare present on my local machine]:
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.hive.HiveContext
object HDFS2HiveFileRead {
def main(args:Array[String]){
val conf = new SparkConf()
.setAppName("HDFS2HiveFileRead")
.setMaster("local")
val sc = new SparkContext(conf)
val hiveContext = new HiveContext(sc)
println("loading data")
val loadDF = hiveContext.read
.format("com.databricks.spark.csv")
.option("header","true")
.option("delimiter",",")
.load("hdfs://192.168.159.129:8020/employee.csv")
println("data loaded")
loadDF.printSchema()
println("creating table")
loadDF.write.saveAsTable("%s.%s".format( "default" , "tblEmployee2" ))
println("table created")
val selectQuery = "SELECT * FROM default.tblEmployee2"
println("selecting data")
val result = hiveContext.sql(selectQuery)
result.show()}}
When I run this program from my Eclipse; using
Run As -> Scala Application
option: It shows me following results on Eclipse Console:
loading data
data loaded
root
|-- empid: string (nullable = true)
|-- empname: string (nullable = true)
|-- empage: string (nullable = true)
creating table
17/06/29 13:27:08 INFO CatalystWriteSupport: Initialized Parquet
WriteSupport with Catalyst schema: { "type" : "struct", "fields" :
[ {
"name" : "empid",
"type" : "string",
"nullable" : true,
"metadata" : { } }, {
"name" : "empname",
"type" : "string",
"nullable" : true,
"metadata" : { } }, {
"name" : "empage",
"type" : "string",
"nullable" : true,
"metadata" : { } } ] } and corresponding Parquet message type: message spark_schema { optional binary empid (UTF8); optional
binary empname (UTF8); optional binary empage (UTF8); }
table created
selecting data
+-----+--------+------+
|empid| empname|empage|
+-----+--------+------+
| 1201| satish| 25|
| 1202| krishna| 28|
| 1203| amith| 39|
| 1204| javed| 23|
| 1205| prudvi| 23|
+-----+--------+------+
17/06/29 13:27:14 ERROR ShutdownHookManager: Exception while deleting
Spark temp dir:
C:\Users\c.b\AppData\Local\Temp\spark-c65aa16b-6448-434f-89dc-c318f0797e10
java.io.IOException: Failed to delete:
C:\Users\c.b\AppData\Local\Temp\spark-c65aa16b-6448-434f-89dc-c318f0797e10
This shows that csv data has been loaded from desired HDFS location [present in HDP] and table with name tblEmployee2 has also been created in hive, as I could read and see the results in the console. I could even read this table again and again by running any spark job to read data from this table
BUT, the issue is as soon as I go to my HDP2.4 through putty and try to see this table in hive,
1) I could not see this table there.
2) I am considering that this code will create a managed/internal table in hive, hence the csv file present at given location in HDFS should also get moved from its base location to hive metastore location, which is not happening?
3) I could also see metastore_db folder getting created in my Eclipse, does that mean that this tblEmployee2 is getting created in my local/windows machine?
4) How can I resolve this issue and ask my code to create hive table in hdp? Is there any configuration which I am missing here?
5) Why am I getting last error in my execution?
Any quick response/pointer would be appreciated.
UPDATE After thinking a lot when I added hiveContext.setConf("hive.metastore.uris","thrift://192.168.159.129:9083")
Code moved a bit but with some permission related issues started appearing. I could now see this table [tblEmployee2] in my hive's default database present in my VMWare but it does that with SparkSQL by itself:
17/06/29 22:43:21 WARN HiveContext$$anon$2: Could not persist `default`.`tblEmployee2` in a Hive compatible way. Persisting it into Hive metastore in Spark SQL specific format.
Hence, I am still not able to use HiveContext, and my above mentioned issues 2-5 still persists.
Regards,
Bhupesh
You are running the spark in local mode.
val conf = new SparkConf()
.setAppName("HDFS2HiveFileRead")
.setMaster("local")
In local mode, when you specify saveAsTable, it will try to create the table in local machine. Change your configuration to run in yarn mode.
You can refer to the below URL, for details:
http://www.coding-daddy.xyz/node/7

Resources