SQL query taking too long in azure databricks - apache-spark

I want to execute SQL query on a DB which is in Azure SQL managed instance using Azure Databricks. I have connected to DB using spark connector.
import com.microsoft.azure.sqldb.spark.config.Config
import com.microsoft.azure.sqldb.spark.connect._
val config = Config(Map(
"url" -> "mysqlserver.database.windows.net",
"databaseName" -> "MyDatabase",
"queryCustom" -> "SELECT TOP 100 * FROM dbo.Clients WHERE PostalCode = 98074" //Sql query
"user" -> "username",
"password" -> "*********",
))
//Read all data in table dbo.Clients
val collection = sqlContext.read.sqlDB(config)
collection.show()
I am using above method to fetch the data(Example from MSFT doc). Table sizes are over 10M in my case. My question is How does Databricks process the query here?
Below is the documentation:
The Spark master node connects to databases in SQL Database or SQL Server and loads data from a specific table or using a specific SQL query.
The Spark master node distributes data to worker nodes for transformation.
The Worker node connects to databases that connect to SQL Database and SQL Server and writes data to the database. User can choose to use row-by-row insertion or bulk insert.
It says master node fetches the data and distributes the work to worker nodes later. In the above code, while fetching the data what if the query itself is complex and takes time? Does it spread the work to worker nodes? or I have to fetch the tables data first to Spark and then run the SQL query to get the result. Which method do you suggest?

So using the above method uses a single JDBC connection to pull the table into the Spark environment.
And if you want to use the push down predicate on the query then you can use in this way.
val pushdown_query = "(select * from employees where emp_no < 10008) emp_alias"
val df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query,
properties=connectionProperties)
display(df)
If you want to improve the performance than you need to manage parallelism while reading.
You can provide split boundaries based on the dataset’s column values.
These options specify the parallelism on read. These options must all be specified if any of them is specified. lowerBound and upperBound decide the partition stride, but do not filter the rows in table. Therefore, Spark partitions and returns all rows in the table.
The following example splits the table read across executors on the emp_no column using the columnName, lowerBound, upperBound, and numPartitions parameters.
val df = (spark.read.jdbc(url=jdbcUrl,
table="employees",
columnName="emp_no",
lowerBound=1L,
upperBound=100000L,
numPartitions=100,
connectionProperties=connectionProperties))
display(df)
For more Details : use this link

Related

How to execute CQL query using pyspark

I want to execute Cassandra CQL query using PySpark.But I am not finding the way to execute it.I can load whole table to dataframe and create Tempview and query it.
df = spark.read.format("org.apache.spark.sql.cassandra").
options(table="country_production2",keyspace="country").load()
df.createOrReplaceTempView("Test")
Please suggest any better way to so that I can execute CQL query in PySpark.
Spark SQL doesn't support Cassandra's cql dialects directly. It only allows you to load the table as a Dataframe and operate on it.
If you are concerned about reading a whole table to query it, then you may use the filters as given below to let Spark push the predicates the load only the data you need.
from pyspark.sql.functions import *
df = spark.read\
.format("org.apache.spark.sql.cassandra")\
.options(table=table_name, keyspace=keys_space_name)\
.load()\
.filter(col("id")=="A")
df.createOrReplaceTempView("Test")
In pyspark you're using SQL, not CQL. If the SQL query somehow matches the CQL, i.e., you're querying by partition or primary key, then Spark Cassandra Connector (SCC) will transform query into that CQL, and execute (so-called predicates pushdown). If it doesn't match, then Spark will load all data via SCC, and perform filtering on the Spark level.
So after you're registered temporary view, you can do:
val result = spark.sql("select ... from Test where ...")
and work with results in result variable. To check if predicates pushdown happens, execute result.explain(), and check for the * marker in the conditions in the PushedFilters section.

How to query and join csv data with Hbase data in Spark Cluster in Azure

In Microsoft Azure, we can create Spark cluster in Azure HDInsight and create Hbase cluster in Azure HDInsight. Now I have created this 2 kinds of clusters. For the Spark cluster, I can create a dataframe from a csv file and run an SQL query like this (below query is executed in Jupyter notebook):
%%sql
SELECT buildingID, (targettemp - actualtemp) AS temp_diff, date FROM hvac WHERE date = \"6/1/13\"
In the meantime, in the spark shell, I can create a connector to another HBase cluster to query data table in that HBase like this:
val query = spark.sqlContext.sql("select personalName, officeAddress from contacts")
query.show()
so, my question is is there a way to do the join operation against this two tables? For example:
select * from hvac a inner join contacts b on a.id = b.id
I just reference below 2 documents in Microsoft Azure:
Run queries on Spark Cluster
Use Spark to read and write HBase data
Any ideas or suggestion for this?

How to paralellize RDD work when using cassandra spark connector for data aggration?

Here is the sample senario, we have real time data record in cassandra, and we want to aggregate the data in different time ranges. What I write code like below:
val timeRanges = getTimeRanges(report)
timeRanges.foreach { timeRange =>
val (timestampStart, timestampEnd) = timeRange
val query = _sc.get.cassandraTable(report.keyspace, utilities.Helper.makeStringValid(report.scope)).
where(s"TIMESTAMP > ?", timestampStart).
where(s"VALID_TIMESTAMP <= ?", timestampEnd)
......do the aggregation work....
what the issue of the code is that for every time range, the aggregation work is running not in parallized. My question is how can I parallized the aggregation work? Since RDD can't run in another RDD or Future? Is there any way to parallize the work, or we can't using spark connector here?
Use the joinWithCassandraTable function. This allows you to use the data from one RDD to access C* and pull records just like in your example.
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md#performing-efficient-joins-with-cassandra-tables-since-12
joinWithCassandraTable utilizes the java driver to execute a single
query for every partition required by the source RDD so no un-needed
data will be requested or serialized. This means a join between any
RDD and a Cassandra Table can be preformed without doing a full table
scan. When preformed between two Cassandra Tables which share the
same partition key this will not require movement of data between
machines. In all cases this method will use the source RDD's
partitioning and placement for data locality.
Finally , we using union to join each RDD and makes them parallized.

Spark DataFrames: registerTempTable vs not

I just started with DataFrame yesterday and am really liking it so far.
I dont understand one thing though...
(Referring to the example under "Programmatically Specifying the Schema" here: https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema)
In this example the dataframe is registered as a table (I am guessing to provide access to SQL queries..?) but the exact same information that is being accessed can also be done by peopleDataFrame.select("name").
So question is.. When would you want to register a dataframe as a table instead of just using the given dataframe functions? And is one option more efficient than the other?
The reason to use the registerTempTable( tableName ) method for a DataFrame, is so that in addition to being able to use the Spark-provided methods of a DataFrame, you can also issue SQL queries via the sqlContext.sql( sqlQuery ) method, that use that DataFrame as an SQL table. The tableName parameter specifies the table name to use for that DataFrame in the SQL queries.
val sc: SparkContext = ...
val hc = new HiveContext( sc )
val customerDataFrame = myCodeToCreateOrLoadDataFrame()
customerDataFrame.registerTempTable( "cust" )
val query = """SELECT custId, sum( purchaseAmount ) FROM cust GROUP BY custId"""
val salesPerCustomer: DataFrame = hc.sql( query )
salesPerCustomer.show()
Whether to use SQL or DataFrame methods like select and groupBy is probably largely a matter of preference. My understanding is that the SQL queries get translated into Spark execution plans.
In my case, I found that certain kinds of aggregation and windowing queries that I needed, like computing a running balance per customer, were available in the Hive SQL query language, that I suspect would have been very difficult to do in Spark.
If you want to use SQL, then you most likely will want to create a HiveContext instead of a regular SQLContext. The Hive query language supports a broader range of SQL than available via a plain SQLContext.
It's convenient to load the dataframe into a temp view in a notebook for example, where you can run exploratory queries on the data:
df.createOrReplaceTempView("myTempView")
Then in another notebook you can run a sql query and get all the nice integration features that come out of the box e.g. table and graph visualisation etc.
%sql
SELECT * FROM myTempView

How does spark worker distributes load in cassandra cluster?

I am trying to understand how cassandra and spark work together, especially when
the data is distributed across nodes.
I have cassandra+spark setup with two node cluster using DSE.
The schema is
CREATE KEYSPACE foo WITH replication = {'class': 'SimpleStrategy','replication_factor':1}
CREATE TABLE bar (
customer text,
start timestamp,
offset bigint,
data blob,
PRIMARY KEY ((customer, start), offset)
)
I populated the table with huge set of test data. Later figured out the keys
that lie on different nodes with the help of "nodetool getendpoints" command.
For example in my case a particular customer data with date '2014-05-25' is on
node1 and '2014-05-26' is node2.
When I run the following query from spark shell, I see that spark worker on
node1 is running the task during mapPartitions stage.
csc.setKeyspace("foo")
val query = "SELECT cl_ap_mac_address FROM bar WHERE customer='test' AND start IN ('2014-05-25')"
val srdd = csc.sql(query)
srdd.count()
and for the following query, spark worker on node2 is running the task.
csc.setKeyspace("foo")
val query = "SELECT cl_ap_mac_address FROM bar WHERE customer='test' AND start IN ('2014-05-26')"
val srdd = csc.sql(query)
srdd.count()
But when I give both the dates only one node worker is getting utilized.
csc.setKeyspace("foo")
val query = "SELECT cl_ap_mac_address FROM bar WHERE customer='test' AND start IN ('2014-05-25', '2014-05-26')"
val srdd = csc.sql(query)
srdd.count()
I was thinking that this should use both the nodes in parallel during
mapPartitions stage. Am I missing something.
I think you are trying to understand the interaction between spark and Cassandra as well as the data distribution in Cassandra.
Basically from spark application, request will be made to one of the Cassandra node, which acts as a coordinator for that particular client request.More details can be found here.
Further data partitioning and replication will be take care by Cassandra system only.

Resources