I am trying to load the data using Spark connector from an ADX cluster and save it in Azure storage. But the table in the ADX database will get updated every week, So how can I write KQL query to load only the latest data from the cluster and save it in storage?
Thanks in advance!
This is the code for spark connector.
query=
"""
Table
| count
"""
dataframe= spark.read. \
format("com.microsoft.kusto.spark.datasource"). \
option("kustoCluster", kustoOptions["kustoCluster"]). \
option("kustoDatabase", kustoOptions["kustoDatabase"]). \
option("kustoQuery", query). \
option("kustoAadAppId", kustoOptions["kustoAadAppId"]). \
option("kustoAadAppSecret", kustoOptions["kustoAadAppSecret"]). \
option("kustoAadAuthorityID", kustoOptions["kustoAadAuthorityID"]). \
load()
Database cursors
Database cursors are designed to address two important scenarios:
The ability to repeat the same query multiple times and get the same results, as long as the query indicates "same data set".
The ability to make an "exactly once" query. This query only "sees" the data that a previous query didn't see, because the data wasn't available then. The query lets you iterate, for example, through all the newly arrived data in a table without fear of processing the same record twice or skipping records by mistake.
Related
I was working with streaming in delta tables with foreachbatch.
spark.readStream.format("delta") \
.option("readChangeFeed", "true").option("startingVersion","latest") \
.load("dbfs:/mnt/cg/file_to_up/table2") \
.writeStream.queryName("myquery") \
.foreachBatch(merge_deltas) \
.option("checkpointLocation", "dbfs:/mnt/cg/tmp/checkpoint") \
.trigger(processingTime='2 seconds') \
.start()
I would like to know if there is a way to do the merge between a delta table(source) and synapse table(target) I found this page here in "Write to Azure Synapse Analytics using foreachBatch() in Python" is the closest I have found to what I need, I think that use the modes(append, overwrite) of the write will not help me.
Also, I know that it could be simulated using insert and delete in post and pre action option, but if there is way to do it with the merge that would be great.
You can use a combination of merge and foreachBatch (see foreachbatch for more information) to write complex upserts from a streaming query into a Delta table. For example:
• Write streaming aggregates in Update Mode: This is much more efficient than Complete Mode.
• Write a stream of database changes into a Delta table: The merge query for writing change data can be used in foreachBatch to continuously apply a stream of changes to a Delta table.
• Write a stream data into Delta table with deduplication: The insert-only merge query for deduplication can be used in foreachBatch to continuously write data (with duplicates) to a Delta table with automatic deduplication.
For more information refer this official documentation
Note -
• Make sure that your merge statement inside foreachBatch is
idempotent as restarts of the streaming query can apply the operation
on the same batch of data multiple times.
• When merge is used in foreachBatch, the input data rate of the
streaming query (reported through StreamingQueryProgress and visible
in the notebook rate graph) may be reported as a multiple of the
actual rate at which data is generated at the source. This is because
merge reads the input data multiple times causing the input metrics to
be multiplied. If this is a bottleneck, you can cache the batch
DataFrame before merge and then uncache it after merge.
At the moment ,we are using steps in the below article to do a full load of the data from one of our spark data sources(delta lake table) and write them to a table on SQL DW.
https://learn.microsoft.com/en-us/azure/databricks/data/data-sources/azure/synapse-analytics
Specifically, the write is carried out using,
df.write \
.format("com.databricks.spark.sqldw") \
.option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>") \
.option("forwardSparkAzureStorageCredentials", "true") \
.option("dbTable", "<your-table-name>") \
.option("tempDir", "wasbs://<your-container-name>#<your-storage-account-name>.blob.core.windows.net/<your-directory-name>") \
.option("maxStrLength",4000).mode("overwrite").save()
Now,our source data,by virture of it being a delta lake, is partitioned on the basis of countryid. And we would to load/refresh only certain partitions to the SQL DWH, instead of the full drop table and load(because we specify "overwrite") that is happening now.I tried adding an adding a additional option (partitionBy,countryid) to the above script,but that doesnt seem to work.
Also the above article doesn't mention partitioning.
How do I work around this?
There might be better ways to do this, but this is how I achieved it. If the target Synapse table is partitioned, then we could leverage the "preActions" option provided by the Synapse connector to delete the existing data at that partition. And then we append new data pertaining to that partition(read as a dataframe from source), instead of overwriting the whole data.
I am extracting data from mongodb collection and writing it to bigquery table using Spark python code.
below is my code snippet:
df = spark.read\
.format("com.mongodb.spark.sql.DefaultSource")\
.option("uri","mongodb_url")\
.option("database","db_name")\
.option("collection", "collection_name")\
.load()
df.write \
.format("bigquery") \
.mode("append")\
.option("temporaryGcsBucket","gcs_bucket") \
.option("createDisposition","CREATE_IF_NEEDED")\
.save("bq_dataset_name.collection_name")
This will extract all the data from mongodb collection. but i want to extract only the documents which satisfied the condition(like where condition in sql query).
One way i found was to read whole data in dataframe and use filter on that dataframe like below:
df2 = df.filter(df['date'] < '12-03-2020 10:12:40')
But as my source mongo collection has 8-10 Gb of data, i cannot afford to read whole data everytime from mongo.
How can i use filtering while reading data from mongo using spark.read?
Have you tried checking if your whole data is being scanned even after applying filters?
Assuming you are using the official connector with spark, filter/ predicate pushdown is supported.
“Predicates pushdown” is an optimization from the connector and the
Catalyst optimizer to automatically “push down” predicates to the data
nodes. The goal is to maximize the amount of data filtered out on the
data storage side before loading it into Spark’s node memory.
There are two kinds of predicates automatically pushed down by the connector to MongoDB:
the select clause (projections) as a $project
the filter clause content (where) as one or more $match
You can find the supporting code for this over here.
Note:
There have been some issues regarding predicate pushdown on nested fields but this is a bug in spark itself and affects other sources as well. This has been fixed in Spark 3.x. Check this answer.
I have consisently updated table from spark structured streaming (kafka source)
Written like this (in eachBatch)
parsedDf \
.select("somefield", "anotherField",'partition', 'offset') \
.write \
.format("delta") \
.mode("append") \
.option("mergeSchema", "true") \
.save(f"/mnt/defaultDatalake/{append_table_name}")
I need to keep a fast view on this table for "items inserted in the last half hour"
How can this be achieved?
I can have a readStream from this table, but what I'm missing is how keep just the "tail" of the stream there
Databricks 7.5 spark 3.
Given that Delta lake does not have materalized views and that Delta Lake time-travel is not relevant as you want the most current data:
You can load the data and include a key that does not need to be looked up whilst inserting.
Pre-populate a time dimension for joining with your data. See it as a dimension with a grain of a minute.
Join the data with this dimension, relying on dynamic file pruning. Thus you need to query per minute every 30 minutes with rolling window and set those values in the query.
See https://databricks.com/blog/2020/04/30/faster-sql-queries-on-delta-lake-with-dynamic-file-pruning.html#:~:text=Dynamic%20File%20Pruning%20(DFP)%2C%20a%20new%20feature%20now%20enabled,queries%20on%20non%2Dpartitioned%20tables.
Using the spark-elasticsearch connector it is possible to directly load only the required columns from ES to Spark. However, there doesn't seem to exist such a straight forward option to do the same, using the spark-cassandra connector
Reading data from ES into Spark
-- here only required columns are being brought from ES to Spark :
spark.conf.set('es.nodes', ",".join(ES_CLUSTER))
es_epf_df = spark.read.format("org.elasticsearch.spark.sql") \
.option("es.read.field.include", "id_,employee_name") \
.load("employee_0001") \
Reading data from Cassandra into Spark
-- here all the columns' data is brought to spark and then select is applied to pull columns of interest :
spark.conf.set('spark.cassandra.connection.host', ','.join(CASSANDRA_CLUSTER))
cass_epf_df = spark.read.format('org.apache.spark.sql.cassandra') \
.options(keyspace="db_0001", table="employee") \
.load() \
.select("id_", "employee_name")
Is it possible to do the same for Cassandra? If yes, then how. If not, then why not.
Actually, connector should do that itself, without need to explicitly set anything, it's called "predicate pushdown", and cassandra-connector does it, according to documentation:
The connector will automatically pushdown all valid predicates to
Cassandra. The Datasource will also automatically only select columns
from Cassandra which are required to complete the query. This can be
monitored with the explain command.
source: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md
The code which you have written is already doing that. You have written select after load and you may think first all the columns are pulled and then selected columns are filtered, but that is not the case.
Assumption : select * from db_0001.employee;
Actual : select id_, employee_name from db_0001.employee;
Spark will understand the columns which you need and query only those in Cassandra database. This feature is called predicate pushdown. This is not limited just to cassandra, many sources support this feature(this is a feature of spark, not casssandra).
For more info: https://docs.datastax.com/en/dse/6.7/dse-dev/datastax_enterprise/spark/sparkPredicatePushdown.html