Fetching username inside notebook in Databricks on high concurrency cluster? - databricks

While trying to fetch user data on high concurrency cluster, I am facing this issue. I am using the command below to fetch the user details
dbutils.notebook.entry_point.getDbutils().notebook().getContext().tags().apply('user')
Below is the error log, for the run. Any help would be really appreciated.
Py4JError: An error occurred while calling o475.tags. Trace:
py4j.security.Py4JSecurityException: Method public scala.collection.immutable.Map com.databricks.backend.common.rpc.CommandContext.tags() is not whitelisted on class class com.databricks.backend.common.rpc.CommandContext
at py4j.security.WhitelistingPy4JSecurityManager.checkCall(WhitelistingPy4JSecurityManager.java:409)
at py4j.Gateway.invoke(Gateway.java:294)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:251)
at java.lang.Thread.run(Thread.java:748)

You can retrieve the information by using dbutils command:
dbutils.notebook.entry_point.getDbutils().notebook().getContext().userName().get()

You can use below code :
import json
# parse x:
y = dbutils.notebook.entry_point.getDbutils().notebook().getContext().toJson()
res = json.loads(y)
print(res['tags']['user'])
Note : Tested code

I've been using this:
user_id = spark.sql('select current_user() as user').collect()[0]['user']
current_user() is a documented SQL function in Databricks

This is a gross work around, but I haven't found much better yet.
import uuid
import shutil
# Create a unique temporary table location
tmpTable = f'/tmp/identifier/{uuid.uuid4()}'
# Write a single line to a delta format table.
spark.range(1).write.format('delta').save(tmpTable)
# Extract the username from the delta history
username = spark.sql(f'DESCRIBE HISTORY delta.`{tmpTable}`').select('userName').collect()[0]['userName']
# Delete the temporary table
shutil.rmtree('/dbfs'+tmpTable)

Related

Databricks: I met with an issue when I was trying to use autoloader to read json files from Azure ADLS Gen2

I met with an issue when I was trying to use autoloader to read json files from Azure ADLS Gen2. I am getting this issue for specific files only. I checked the file are good and not corrupted.
Following is the issue:
Caused by: java.lang.IllegalArgumentException: ***requirement failed: Literal must have a corresponding value to string, but class Integer found.***
at scala.Predef$.require(Predef.scala:281)
at at ***com.databricks.sql.io.FileReadException: Error while reading file /mnt/Source/kafka/customer_raw/filtered_data/year=2022/month=11/day=9/hour=15/part-00000-31413bcf-0a8f-480f-8d45-6970f4c4c9f7.c000.json.***
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.logFileNameAndThrow(FileScanRDD.scala:598)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:422)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(null:-1)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
java.lang.IllegalArgumentException: requirement failed: Literal must have a corresponding value to string, but class Integer found.
at scala.Predef$.require(Predef.scala:281)
at org.apache.spark.sql.catalyst.expressions.Literal$.validateLiteralValue(literals.scala:274)
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.sat java.lang.Thread.run(Thread.java:750)
I am using Delta Live Pipeline. Here is the code:
#dlt.table(name = tablename,
comment = "Create Bronze Table",
table_properties={
"quality": "bronze"
}
)
def Bronze_Table_Create():
return
spark
.readStream
.schema(schemapath)
.format("cloudFiles")
.option("cloudFiles.format","json)
.option("cloudFile.schemaLocation, schemalocation)
.option("cloudFiles.inferColumnTypes", "false")
.option("cloudFiles.schemaEvolutionMode", "rescue")
.load(sourcelocation
I got the issue resolved. The issues was by mistake we have duplicate columns in the schema files. Because of that it was showing that error. However, the error is totally mis-leading, that's why didn't able to rectify it.

spark GroupBy throws StateSchemaNotCompatible exception with different "Existing key schema"

I am reading and writing events from EventHub in spark after trying to aggregated based on few keys like this:
val df1 = df0
.groupBy(
colKey,
colTimestamp
)
.agg(
collect_list(
struct(
colCreationTimestamp,
colRecordId
)
).as("Records")
)
But i am getting this error at runtime:
Error
Caused by: org.apache.spark.sql.execution.streaming.state.StateSchemaNotCompatible: Provided schema doesn't match to the schema for existing state! Please note that Spark allow difference of field name: check count of fields and data type of each field.
- Provided key schema: StructType(StructField(Key,StringType,true), StructField(Timestamp,TimestampType,true)
- Provided value schema: StructType(StructField(buf,BinaryType,true))
- Existing key schema: StructType(StructField(_1,StringType,true), StructField(_2,TimestampType,true))
- Existing value schema: StructType(StructField(buf,BinaryType,true))
If you want to force running query without schema validation, please set spark.sql.streaming.stateStore.stateSchemaCheck to false.
Please note running query with incompatible schema could cause indeterministic behavior.
at org.apache.spark.sql.execution.streaming.state.StateSchemaCompatibilityChecker.check(StateSchemaCompatibilityChecker.scala:60)
at org.apache.spark.sql.execution.streaming.state.StateStore$.$anonfun$getStateStoreProvider$2(StateStore.scala:487)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at scala.util.Try$.apply(Try.scala:213)
at org.apache.spark.sql.execution.streaming.state.StateStore$.$anonfun$getStateStoreProvider$1(StateStore.scala:487)
at scala.collection.mutable.HashMap.getOrElseUpdate(HashMap.scala:86)
The exception doesnt contains the exact line number to reference my code, so i narrowed down to this code based on the provided key schema columns, and also if i change the groupBy key columns the error changes accordingly.
I tried different things like explicit df0.select() before group by for the required column to ensure that incoming data had the given column. but got the same error.
can someone suggest how its picking the Existing key schema, or what should i look for to resolve this?
update [Solved for me]
While uploading the records to eventHub, EventHubSpark library stores the states in checkpoint directory, where it had old state and causing the StateSchemaNotCompatible issue, pointing to new Checkpoint dir solved the issue for me.

Error writing data to Bigquery using Databricks Pyspark

I run a daily job to write data to BigQuery using Databricks Pyspark. There was a recent update of configuration for Databricks (https://docs.databricks.com/data/data-sources/google/bigquery.html) which caused the job to fail. I followed all the steps in the docs. Reading data works again but writing throws the following error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS not found
I tried adding configuration also right in the code (as advised for similar errors in Spark) but it did not help:
spark._jsc.hadoopConfiguration().set('fs.gs.impl', 'com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem')
spark._jsc.hadoopConfiguration().set('fs.gs.auth.service.account.enable', 'true')
spark._jsc.hadoopConfiguration().set('google.cloud.auth.service.account.json.keyfile', "<path-to-key.json>")
My code is:
upload_table_dataset = 'testing_dataset'
upload_table_name = 'testing_table'
upload_table = upload_table_dataset + '.' + upload_table_name
(import_df.write.format('bigquery')
.mode('overwrite')
.option('project', 'xxxxx-test-project')
.option('parentProject', 'xxxxx-test-project')
.option('temporaryGcsBucket', 'xxxxx-testing-bucket')
.option('table', upload_table)
.save()
)
You need to install the GCS connector on your cluster first

Incremental data load from Redshift to S3 using Pyspark and Glue Jobs

I have created a pipeline where the data ingestion takes place between Redshift and S3. I was able to do the complete load using the below method:
def readFromRedShift(spark: SparkSession, schema, tablename):
table = str(schema) + str(".") + str(tablename)
(url, Properties, host, port, db) = con.getConnection("REDSHIFT")
df = spark.read.jdbc(url=url, table=table, properties=Properties)
return df
Where getConnection is a different method under a separate class that handles all the redshift-related details. Later on, I used this method and created a data frame, and wrote the results into S3 which worked like a charm.
Now, I want to load the incremental data. Will enabling the Job Bookmarks Glue option help me? Or is there any other way to do it? I followed this official documentation but was of no help to me for my problem statement. So, if I run it for the first time as it will load the complete data, and if I rerun it will it be able to load the newly arrived records?
You are right. It can be achieved via use of job bookmarks, but at the same time it can be a bit tricky.
Please refer to this doc https://aws.amazon.com/blogs/big-data/load-data-incrementally-and-optimized-parquet-writer-with-aws-glue/

Sqlalchemy Snowflake not closing connection after successfully retrieving results

I am connecting to snowflake datawarehouse from Python and I encounter a weird behavior. The Python program exits successfully if I retrieve fewer number of rows from SnowFlake but hangs in there in-definitely if I try to retrieve more than 200K rows. I am 100% sure that there are no issues with my machine because I am able to retrieve 5 to 10 million rows from other type of database systems such as Postgres.
My Python environment is Python 3.6 and I use the following version of the libraries -> SQLAlchemy 1.1.13, snowflake-connector-python 1.4.13, snowflake-sqlalchemy 1.0.7,
The following code prints the total number of rows and closes the connection.
from sqlalchemy import create_engine
from snowflake.sqlalchemy import URL
engine = create_engine(URL(
account=xxxx,
user=xxxxx,
password=xxxxx,
database=xxxxx,
schema=xxxxxx,
warehouse=xxxxx))
query = """SELECT * FROM db_name.schema_name.table_name LIMIT 1000"""
results = engine.execute(query)
print (results.rowcount)
engine.dispose()
The following code prints the total number of rows but the connection doesn't close, it just hangs in there until I manually kill the Python process.
from sqlalchemy import create_engine
from snowflake.sqlalchemy import URL
engine = create_engine(URL(
account=xxxx,
user=xxxxx,
password=xxxxx,
database=xxxxx,
schema=xxxxxx,
warehouse=xxxxx))
query = """SELECT * FROM db_name.schema_name.table_name LIMIT 500000"""
results = engine.execute(query)
print (results.rowcount)
engine.dispose()
I tried multiple different tables and I encounter the same issue with SnowFlake. Did anyone encounter similar issues?
Can you check the query status from UI? "History" page should include the query. If the warehouse is not ready, it may take a couple of minutes to start the query. (I guess that's very unlikely, though).
Try changing the connection to this:
connection = engine.connect()
results = connection.execute(query)
print (results.rowcount)
connection.close()
engine.dispose()
SQLAlchemy's dispose doesn't close the connection if the connection is not explicitly closed. I inquired before, but so far the workaround is just close the connection.
https://groups.google.com/forum/#!searchin/sqlalchemy/shige%7Csort:date/sqlalchemy/M7IIJkrlv0Q/HGaQLBFGAQAJ
Lastly, if the issue still persist, add the logger to top:
import logging
for logger_name in ['snowflake','botocore']:
logger = logging.getLogger(logger_name)
logger.setLevel(logging.DEBUG)
ch = logging.FileHandler('log')
ch.setLevel(logging.DEBUG)
ch.setFormatter(logging.Formatter('%(asctime)s - %(threadName)s %(filename)s:%(lineno)d - %(funcName)s() - %(levelname)s - %(message)s'))
logger.addHandler(ch)
and collect log.
If the output is too long to fit into here, I can take it at the issue page at https://github.com/snowflakedb/snowflake-sqlalchemy.
Note I tried it myself but cannot reproduce the issue so far.
Have you tried using a with statement instead to make your connection
instead of this:
engine = create_engine(URL(account=xxxx,user=xxxxx,password=xxxxx,database=xxxxx,schema=xxxxxx,warehouse=xxxxx))
results = engine.execute(query)
do the following:
with create_engine(URL(account=xxxx,user=xxxxx,password=xxxxx,database=xxxxx,schema=xxxxxx,warehouse=xxxxx)) as engine:
# do work
results = engine.execute(query)
...
After the with .. the engine object should be automatically closed out.

Resources