What does InMemoryColumnarTableScan do? - apache-spark

When I run a Spark SQL query and head to the Spark UI DAG visualization, the first step that is shown is called InMemoryColumnarTableScan.
Is my data loaded from disk every time I run a query?
If not, what does this step do exactly?

As its name suggests, the InMemoryColumnarTableScan class contains methods which look at a table stored in memory using columnar compression techniques.
It's used to gather, cache, and provide statistics on the data stored within an in-memory table for more efficient querying of the table.
So the engine calls it first to figure out the best way to run your query against the RDD.
It doesn't have anything to do with your actual data load.

Related

Spark: Is this wrong way to cache temp view?

I've seen follow code and i think that it is a wrong way to cache tempview in Spark. What do you think?
spark.sql(
s"""
|...
""".stripMargin).createOrReplaceTempView(s"temp_view")
spark.table(s"temp_view").cache()
For my opinion, this code caches DataFrame that I create by spark.table("temp_view"), but not original temp view.
Am I right?
Imo yes, you are caching what you read from this table, but for example if in next line you are going to read it again you will end up with second scan
I think that maybe you can try to use cache table within your sql
https://spark.apache.org/docs/latest/sql-ref-syntax-aux-cache-cache-table.html
CACHE TABLE statement caches contents of a table or output of a query
with the given storage level. If a query is cached, then a temp view
will be created for this query. This reduces scanning of the original
files in future queries.
For me its seems promising
I think the caching in your example will actually work. Spark does not cache instances of DataFrame. Instead, it uses logical plans as the cache key, and the view is transparent for that purpose. For example, here's the code I've just tried using some local table I have
val df = spark.table("mart.dim_region")
df.createOrReplaceTempView("dim_region")
spark.table("dim_region").cache()
Even though cache is applied to view, if I repeatedly invoke df.show, the execution plan contains InMemoryTableScan - which is precisely the effect of caching.

How to speed up spark sql filter queries if the where clause is already fixed?

In my case, the data resides in spark tables which are created by calling createOrReplaceTempView API on a dataframe. Once the table is created, several queries are going to run on top of the table. Most of the time, the where query is going to be based on a particular column. The concerned columns' name is already known. I would like to know if some sort of optimizations can be done to improve the performance of the filter query.
I tried exploring the approach of indexing but it turns out spark does not support indexing a particular column.
Have you looked at the SPARK UI to see where most of your time is being consumed? Is it really the query where most of the time is spent? Usually reading the data from disk is where most of the time is spent. Learn to read the SPARK UI and find where the real bottleneck is. The SQL tab is a really great way to start figuring things out.
Here's some tricks to run faster in spark that apply to most jobs:
Can you reframe the problem? Was the data you are using in a format that helps you solve the query? Can you change how it's written to change the problem? (Could you start "pre-chewing" the data before you even query it to have it stored in the best format to help you solve the issue you want to solve?) Most performance gains come from changing the parameters of the problem to make them easier/faster to solve.
What format (is the incoming data) you are
storing the data in? Are you using Parquet/Orc? They have a great payoff disk space/compression that are worth using. They also can enable file level filter to speed read. Is their transformation work that you can push upstream to help make the query do less work? Can you be writing the data via a partition schema that would aid lookups?
How many files is your input? Can you consolidate files to maximize read throughput. Reading/listing a lot of small files as input slows down the processing of data.
If the tempView query is of similar size every time you could look at tweaking the partition count so that files are smaller but approximately the size of your HDFS block size. (Assuming you are using hdfs). HDFS you have to read an entire block weather you use all the data or not. Try and fit this to some multiple of your executors so that you are finishing together and not straggling. This is hard to get perfect but you can make decent strides to find a good ratio.
There is no need to optimize filter conditions with spark. spark already is smart enough to optimize its conditions post where query to fetch minimum rows first. The best I guess you can do is by persisting your TempView if querying the same view again and again.

Why is execution time of spark sql query different between first time and second time of execution?

I am using spark sql to run some aggregated query on the parquet data source.
My parquet data source includes a table with columns: id int, time timestamp, location int, counter_1 long, counter_2 long, ..., counter_48. The total data size is about 887 MB.
My spark version is 2.4.0. I run one master and one slave on a single machine (4 cores, 16G memory).
Using spark-shell, I ran the spark command:
spark.time(spark.sql("SELECT location, sum(counter_1)+sum(counter_5)+sum(counter_10)+sum(counter_15)+sum(cou
nter_20)+sum(counter_25)+sum(counter_30)+sum(counter_35 )+sum(counter_40)+sum(counter_45) from parquet.`/home/hungp
han227/spark_data/counters` group by location").show())
The execution time is 17s.
The second time I ran a similar command (only change columns):
spark.time(spark.sql("SELECT location, sum(counter_2)+sum(counter_6)+sum(counter_11)+sum(counter_16)+sum(cou
nter_21)+sum(counter_26)+sum(counter_31)+sum(counter_36 )+sum(counter_41)+sum(counter_46) from parquet.`/home/hungp
han227/spark_data/counters` group by location").show())
The execution time is about 3s.
My first question is: Why are they different? I know it is not data caching because of the parquet format. Is it about reusing something like query planning?
I did another test: The first command is
spark.time(spark.sql("SELECT location, sum(counter_1)+sum(counter_5)+sum(counter_10)+sum(counter_15)+sum(cou
nter_20)+sum(counter_25)+sum(counter_30)+sum(counter_35 )+sum(counter_40)+sum(counter_45) from parquet.`/home/hungp
han227/spark_data/counters` group by location").show())
The execution time is 17s.
In the second command, I change the aggregate function:
spark.time(spark.sql("SELECT location, avg(counter_1)+avg(counter_5)+avg(counter_10)+avg(counter_15)+avg(cou
nter_20)+avg(counter_25)+avg(counter_30)+avg(counter_35 )+avg(counter_40)+avg(counter_45) from parquet.`/home/hungp
han227/spark_data/counters` group by location").show())
The execution time is about 5s.
My second question is: Why is the second command is faster than the first command but the execution time difference is slightly smaller than the first scenario?
Finally, I have a problem related to above scenarios: The are about 200 formulas like:
formula1 = sum(counter_1)+sum(counter_5)+sum(counter_10)+sum(counter_15)+sum(cou
nter_20)+sum(counter_25)+sum(counter_30)+sum(counter_35 )+sum(counter_40)+sum(counter_45)
formula2 = avg(counter_2)+avg(counter_5)+avg(counter_11)+avg(counter_15)+avg(cou
nter_21)+avg(counter_25)+avg(counter_31)+avg(counter_35 )+avg(counter_41)+avg(counter_45)
I have to run the following format frequently:
select formulaX,formulaY, ..., formulaZ from table where time > value1 and time < value2 and location in (value1, value 2...) group by location
My third question is: Is there anyway to optimize the performance (the query used once should be faster if it is used again in the future)? Does spark optimize itself or do I have to write some code, change config?
It's called Exchange Reuse. When Spark runs shuffling (i.e. aggregation, join) it stores a copy of the shuffle data on local worker nodes for potential reuse. This is an internally controlled behavior and cannot be directly influenced by end user. If you find you're keep re-using a particular portion of data (or query outcome), you could consider explicitly CACHING it by using the cache(). However, bear in mind although this allows Spark to reuse cached result for potentially faster query performance (if, and only if the Analyzer Plan of your cached query matches your new query), over using CACHE can cause whole lot of different performance problems.
A bad example is when your dataset is very large, it may cause Disk Spill problem. That is, the dataset doesn't fit into your cluster's available memory and needs to be written to slower hard disks.
Another bad example is when your query only needs to access a subset of the cached data. By caching the entire dataset in memory, Spark is forced to perform full in-memory table scan. Not only that's waste of resource but also results in a slower query performance as oppose to not using cache at all.
The best thing to do is try & error with a few of your own example queries, look at the Spark UI and check if there is sign of Disk Spill or large amount of input data scan.
Every query/data combination is unique hence you'll need to experiment it a bit to find the best performance tuning method for your own workload.
When doing an aggregate spark creates what are called shuffle files. If you run the same query twice, it will reuse the shuffle files which are stored locally on the workers fs. Unfortunately you can't rely on them to always be there because eventually the file handler gets gc'd. If your going to run 10 queries on the same dataset, cache it or use databricks.

How does Spark deal with JDBC data in relation to time?

I am trying to sync my Spark database on S3 with an older Oracle database via daily ETL Spark job. I am trying to understand just what Spark does when it connects to a RDS like Oracle to fetch data.
Does it only grab the data that at the time of Spark's request to the DB (i.e. if it fetches data from an Oracle DB at 2/2 17:00:00, it will only grab data UP to that point in time)? Essentially saying that any new data or updates at 2/2 17:00:01 will not be obtained from the data fetch?
Well, it depends. In general you have to assume that this behavior is non-deterministic, unless explicitly ensured by your application and database design.
By default Spark will fetch data every time you execute an action on the corresponding Spark dataset. It means that every execution might see different state of your database.
This behavior can be affected by multiple factors:
Explicit caching and possible cache evictions.
Implicit caching with shuffle files.
Exacted set of parameters you use with JDBC data source.
In the first two cases Spark can reuse already fetched data without going back to the original data source. The third one is much more interesting. By default Spark fetches data using a single transaction but there methods which enable parallel reads based on column ranges or predicates. If one of these is used Spark will fetch data using multiple transactions, and each one can observe different state of your database.
If consistent point-in-time semantics is required you have basically two options:
Use immutable, append-only and timestamped records in your database and issue timestamp dependent queries from Spark.
Perform consistent database dumps and use these as a direct input to your Spark jobs.
While the first approach is much more powerful it is much harder to implement if you're working with per-existing architecture.

Does registerTempTable cause the table to get cached?

I have a sql statement query which is doing a group by on many fields. The tables that it uses is also big (4TB in size). I'm registering the table as a temp table. However I don't know whether the table gets cached or not when I'm registering it as a temp table? I also don't know whether it is more performant if I convert my query into Scala function (e.g. df.groupby().aggr()...) rather than having it as a sql statement. Any help on that?
SQL is most likely going to be the fastest by far Databricks blog
Did you try to partition/repartition your dataframe as well to see whether it improves the performance?
Regarding registerTempTable: it only registers the table within a spark context. You can check with the UI.
val test = List((1,2,3),(4,5,6)).toDF("bla","blb","blc")
test.createOrReplaceTempView("test")
test.show()
Storage is blank
vs
val test = List((1,2,3),(4,5,6)).toDF("bla","blb","blc")
test.createOrReplaceTempView("test").cache()
test.show()
by the way registerTempTable is deprecated in Spark 2.0 and has been replaced by
createOrReplaceTempView
I have a sql statement query which is doing a group by on many fields. The tables that it uses is also big (4TB in size). I'm registering the table as a temp table. However I don't know whether the table gets cached or not when I'm registering it as a temp table?
The registerTempTabele or createOrReplaceTempView doesn't cache the data into memory or disc itself unless you use cache() function.
I also don't know whether it is more performant if I convert my query into Scala function (e.g. df.groupby().aggr()...) rather than having it as a sql statement. Any help on that?
Keep in mind the sql terms in sql query ultimately call the function inside. so whether you use sql query terms or functions available in code it doesn't matter. that is same thing.

Resources