Spark SQL performance issue on large table - apache-spark

We are connecting Tera data from spark SQL with below API
Dataset<Row> jdbcDF = spark.read().jdbc(connectionUrl, tableQuery, connectionProperties);
We are facing one issue when we execute above logic on large table with million rows every time we are seeing below extra query is executing every time as this resulting performance hit on DB.
This below information we got from DBA. We dont have any logs on SPARK SQL.
SELECT 1 FROM ONE_MILLION_ROWS_TABLE;
1
1
1
1
1
1
1
1
1
Can you please clarify why this query is executing or is there any chance that this type of query is executing from our code it self while check for rows count from dataframe.
Please provide me your inputs on this.
I searched on various mailing list and even created sapk question.

Related

Spark request only a partial sorting for row_number().over partitioned window

Version: DBR 8.4 | Spark 3.1.2
I'm trying to get the top 500 rows per partition, but I can see from the query plan that it is sorting the entire data set (50K rows per partition) before eventually filtering to the rows I care about.
max_rank = 500
ranking_order = Window.partitionBy(['category', 'id'])
.orderBy(F.col('primary').desc(), F.col('secondary'))
df_ranked = (df
.withColumn('rank', F.row_number().over(ranking_order))
.where(F.col('rank') <= max_rank)
)
df_ranked.explain()
I read elsewhere that expressions such as df.orderBy(desc("value")).limit(n) are optimized by the query planner to use TakeOrderedAndProject and avoid sorting the entire table. Is there a similar approach I can use here to trigger an optimization and avoid fully sorting all partitions?
For context, right now my query is taking 3.5 hours on a beefy 4 worker x 40 core cluster and shuffle write time surrounding this query (including some projections not listed above) appears to be my high-nail, so I'm trying to cut down the amount of data as soon as possible.

Connection issue python with Cassandra

I am trying to fetch data from Cassandra cluster. There is a issue it some time fetches few rows and some times no row.
I am using Python 3.8.5 and latest DataStax Python driver.
Below is the code
from cassandra.cluster import Cluster
from cassandra.query import SimpleStatement
cluster = Cluster(['xx.xx.x.xx','xx.xx.x.xx','xx.xx.x.xx'])
session = cluster.connect('keyspace')
query = "SELECT * from table ALLOW FILTERING" # Allow Filtering is for test
statement = SimpleStatement(query)
for user_row in session.execute(statement):
print(user_row)
The whole statement some time give some result but some time no result.
I have enable trace as suggested to see the execution below is the result
0:00:00.000150 Parsing SELECT * from table ALLOW FILTERING
0:00:00.000271 Preparing statement
0:00:00.000425 Computing ranges to query
0:00:00.000654 Submitting range requests on 769 ranges with a concurrency of 1 (0.0 rows per range expected)
0:00:00.001804 Submitted 1 concurrent range requests
0:00:00.001839 Executing seq scan across 0 sstables for (min(-9223372036854775808), min(-9223372036854775808)]
0:00:00.001924 Read 0 live rows and 0 tombstone cells

Power BI/Query avoid materialization of join results before GROUP BY

Objective
I'm using Power BI Desktop, DirectQuery to Spark cluster. I want to join two tables, and aggregate based on MONTH and DEP_NAME columns. Facts table is 10GB+ (contains MONTH col), while Department table is about few KBs (contains DEP_ID, DEP_NAME cols). The expected result is very small, about 100 rows.
Issue
Spark fails due to the following exception:
DataSource.Error: ODBC: ERROR [HY000] [Microsoft][Hardy] (35) Error
from server: error code: '0' error message: 'Error running query:
org.apache.spark.SparkException: Job aborted due to stage failure:
Total size of serialized results of 10 tasks (4.1 GB) is bigger than
spark.driver.maxResultSize (4.0 GB)'.
I'm pretty sure Power BI tries to materialize join result (10GB+), before applying aggregation.
Question
Is there any way to make Power BI not execute/materialize join results without applying aggregation?
Power Query
let
Source = ApacheSpark.Tables("https://xxxxxxxxx.azuredatabricks.net:443/sql/protocolv1/o/yyyyyyyyyyy", 2, [BatchSize=null]),
#"Result" = Table.Group(
Table.Join(
Source{[Schema="default",Item="Fact"]}[Data],
"DEP_ID",
Table.RenameColumns(Source{[Schema="default",Item="Department"]}[Data], {"DEP_ID", "DEP_ID_"}),
"DEP_ID_",
JoinKind.Inner
),
{"MONTH", "DEP_NAME"},
{{"TOTAL_SALARY", each List.Sum([SALARY]), type number}}
)
in
#"Result"
Power Query failed job execution plan
From Spark SQL execution plan you can see that there is no aggregation step, only join! I think Power BI try to load join results (10GB+) through Spark Driver before applying GROUP BY aggregation.
Expected execution plan
I can write the same job with PySpark:
dep = spark.read.csv(dep_path)
spark.read.parquet(fact_pat) \
.join(F.broadcast(dep), ['DEP_ID']) \
.groupBy('MONTH', 'DEP_NAME') \
.agg(F.sum('SALARY')) \
.show(1000)
The plan will be the following (pay attention to hash aggregate steps at the end):
P.S.
AFAIK, Power BI Desktop "View Native Query" is disabled for Spark DirectQuery.
UPD
Looks like the issue isn't in Query Folding, Power BI for some reason materialize the table before GROUP BY even without Join. The following query leads to full table load:
let
Source = ApacheSpark.Tables("https://xxxxxxxx.azuredatabricks.net:443/sql/protocolv1/o/yyyyyyyyyyyy", 2, [BatchSize=null]),
#"Result" = Table.Group(
Source{[Schema="default",Item="Fact"]}[Data],
{"MONTH", "DEP_ID"},
{{"TOTAL_SALARY", each List.Sum([SALARY]), type number}}
)
in
#"Result"
Still, full load happends only in case of List.Sum function. List.Count and List.Max works well, even with table join before GROUP BY.
Instead of joining then grouping, maybe you could do the reverse. Group by MONTH and DEP_ID from your Fact table and then join with the Department table to get the DEP_NAME.
Note: If multiple DEP_ID have the same DEP_NAME, you'll need to do one more group by after joining.

Spark window function on dataframe with large number of columns

I have an ML dataframe which I read from csv files. It contains three types of columns:
ID Timestamp Feature1 Feature2...Feature_n
where n is ~ 500 (500 features in ML parlance). The total number of rows in the dataset is ~ 160 millions.
As this is the result of a previous full join, there are many features which do not have values set.
My aim is to run a "fill" function(fillna style form python pandas), where each empty feature value gets set with the previously available value for that column, per Id and Date.
I am trying to achieve this with the following spark 2.2.1 code:
val rawDataset = sparkSession.read.option("header", "true").csv(inputLocation)
val window = Window.partitionBy("ID").orderBy("DATE").rowsBetween(-50000, -1)
val columns = Array(...) //first 30 columns initially, just to see it working
val rawDataSetFilled = columns.foldLeft(rawDataset) { (originalDF, columnToFill) =>
originalDF.withColumn(columnToFill, coalesce(col(columnToFill), last(col(columnToFill), ignoreNulls = true).over(window)))
}
I am running this job on a 4 m4.large instances on Amazon EMR, with spark 2.2.1. and dynamic allocation enabled.
The job runs for over 2h without completing.
Am I doing something wrong, at the code level? Given the size of the data, and the instances, I would assume it should finish in a reasonable amount of time? And I haven't even tried with the full 500 columns, just with about 30!
Looking in the container logs, all I see are many logs like this:
INFO codegen.CodeGenerator: Code generated in 166.677493 ms
INFO execution.ExternalAppendOnlyUnsafeRowArray: Reached spill
threshold of
4096 rows, switching to
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter
I have tried setting parameter spark.sql.windowExec.buffer.spill.threshold to something larger, without any impact. Is theresome other setting I should know about? Those 2 lines are the only ones I see in any container log.
In Ganglia, I see most of the CPU cores peaking around full usage, but the memory usage is lower than the maximum available. All executors are allocated and are doing work.
I have managed to rewrite the fold left logic without using withColumn calls. Apparently they can be very slow for large number of columns, and I was also getting stackoverflow errors because of that.
I would be curious to know why this massive difference - and what exactly happens behind the scenes with the query plan execution, which makes repeated withColumns calls so slow.
Links which proved very helpful: Spark Jira issue and this stackoverflow question
var rawDataset = sparkSession.read.option("header", "true").csv(inputLocation)
val window = Window.partitionBy("ID").orderBy("DATE").rowsBetween(Window.unboundedPreceding, Window.currentRow)
rawDataset = rawDataset.select(rawDataset.columns.map(column => coalesce(col(column), last(col(column), ignoreNulls = true).over(window)).alias(column)): _*)
rawDataset.write.option("header", "true").csv(outputLocation)

Cassandra - IN or TOKEN query for querying an entire partition?

I want to query a complete partition of my table.
My compound partition key consists of (id, date, hour_of_timestamp). id and date are strings, hour_of_timestamp is an integer.
I needed to add the hour_of_timestamp field to my partition key because of hotspots while ingesting the data.
Now I'm wondering what's the most efficient way to query a complete partition of my data?
According to this blog, using SELECT * from mytable WHERE id = 'x' AND date = '10-10-2016' AND hour_of_timestamp IN (0,1,...23); is causing a lot of overhead on the coordinator node.
Is it better to use the TOKEN function and query the partition with two tokens? Such as SELECT * from mytable WHERE TOKEN(id,date,hour_of_timestamp) >= TOKEN('x','10-10-2016',0) AND TOKEN(id,date,hour_of_timestamp) <= TOKEN('x','10-10-2016',23);
So my question is:
Should I use the IN or TOKEN query for querying an entire partition of my data? Or should I use 23 queries (one for each value of hour_of_timestamp) and let the driver do the rest?
I am using Cassandra 3.0.8 and the latest Datastax Java Driver to connect to a 6 node cluster.
You say:
Now I'm wondering what's the most efficient way to query a complete
partition of my data? According to this blog, using SELECT * from
mytable WHERE id = 'x' AND date = '10-10-2016' AND hour_of_timestamp
IN (0,1,...23); is causing a lot of overhead on the coordinator node.
but actually you'd query 24 partitions.
What you probably meant is that you had a design where a single partition was what now consists of 24 partitions, because you add the hour to avoid an hotspot during data ingestion. Noting that in both models (the old one with hotspots and this new one) data is still ordered by timestamp, you have two choices:
Run 1 query at time.
Run 2 queries the first time, and then one at time to "prefetch" results.
Run 24 queries in parallel.
CASE 1
If you process data sequentially, the first choice is to run the query for the hour 0, process the data and, when finished, run the query for the hour 1 and so on... This is a straightforward implementation, and I don't think it deserves more than this.
CASE 2
If your queries take more time than your data processing, you could "prefetch" some data. So, the first time you could run 2 queries in parallel to get the data of both the hours 0 and 1, and start processing data for hour 0. In the meantime, data for hour 1 arrives, so when you finish to process data for hour 0 you could prefetch data for hour 2 and start processing data for hour 1. And so on.... In this way you could speed up data processing. Of course, depending on your timings (data processing and query times) you should optimize the number of "prefetch" queries.
Also note that the Java Driver does pagination for you automatically, and depending on the size of the retrieved partition, you may want to disable that feature to avoid blocking the data processing, or may want to fetch more data preemptively with something like this:
ResultSet rs = session.execute("your query");
for (Row row : rs) {
if (rs.getAvailableWithoutFetching() == 100 && !rs.isFullyFetched())
rs.fetchMoreResults(); // this is asynchronous
// Process the row ...
}
where you could tune that rs.getAvailableWithoutFetching() == 100 to better suit your prefetch requirements.
You may also want to prefetch more than one partition the first time, so that you ensure your processing won't wait on any data fetching part.
CASE 3
If you need to process data from different partitions together, eg you need both data for hour 3 and 6, then you could try to group data by "dependency" (eg query both hour 3 and 6 in parallel).
If you need all of them then should run 24 queries in parallel and then join them at application level (you already know why you should avoid the IN for multiple partitions). Remember that your data is already ordered, so your application level efforts would be very small.

Resources