Is there way to get a rowcount on a query using Snowflake and its Spark Connector? - apache-spark

I am running a query in my Spark application that returns a substantially large amount of data. I would like to know how many rows of data are being queried for logging purposes. I can't seem to find a way to get the number of rows without either manually counting them, or calling a method to count for me, as the data is fairly large this gets expensive for logging. Is there a place that the rowcount is saved and available to grab?
I have read here that the Python connector saves the rowcount into the object model, but i can't seem to find any equivalent for the Spark Connector or its underlying JDBC.
The most optimal way I can find is rdd.collect().size on the RDD that Spark provides. It is about 15% faster than calling rdd.count()
Any help is appreciated 😃

The limitation is within Spark's APIs that do not directly offer metrics of a completed distributed operation such as a row count metric after a save to table or file. Snowflake's Spark Connector is limited to the calls Apache Spark offers for its integration, and the cursor attributes otherwise available in the Snowflake Python and JDBC Connectors are not accessible through Py/Spark.
The simpler form of the question of counting an executed result, removing away Snowflake specifics, has been discussed previously with solutions: Spark: how to get the number of written rows?

Related

apache-spark-Cost Based Optimizer(CBO) stats are not used while evaluating query plans in Spark Sql

We are trying to leverage CBO for getting better plan results for few critical queries run thru spark-sql or thru thrift server using jdbc driver. Following settings added to spark-defaults.conf {code}
spark.sql.cbo.enabled true spark.experimental.extrastrategies
intervaljoin spark.sql.cbo.joinreorder.enabled true {code}
The tables that we are using are not partitioned.
Please let me know if you need further details.
You provide little detail. Check if all steps have been followed as set out below pls.
From Spark 2.2 when I last looked at this excellent article: https://www.waitingforcode.com/apache-spark-sql/spark-sql-cost-based-optimizer/read
the following:
Spark SQL implementation
At the time of writing (2.2.0 released) Spark SQL Cost Based Optimization is disabled by default and can be activated through spark.sql.cbo.enabled property. When enabled, it applies in: filtering, projection, joins and aggregations, as we can see in corresponding estimation objects from org.apache.spark.sql.catalyst.plans.logical.statsEstimation package: FilterEstimation, ProjectEstimation, JoinEstimation and AggregateEstimation.
Even if at first glance the use of estimation objects seems to be conditioned only by the configuration property, it's not always the case. The Spark's CBO is applied only when the statistics about manipulated data are known (read more about them in the post devoted to Statistics in Spark SQL). This condition is expressed by EstimationUtils method:
def rowCountsExist(conf: SQLConf, plans: LogicalPlan*): Boolean =
plans.forall(_.stats(conf).rowCount.isDefined)
The filtering is an exception because it's checked against the number of rows existence:
if (childStats.rowCount.isEmpty) return None
The statistics can be gathered by the execution of ANALYZE TABLE $TABLE_NAME COMPUTE STATISTICS command before the processing execution. When ANALYZE command is called, it's executed by:
org.apache.spark.sql.execution.command.AnalyzeTableCommand#run(SparkSession) that updates
org.apache.spark.sql.catalyst.catalog.SessionCatalog statistics of processed data.
The only problem with ANALYZE command is that it can be called only for Hive and in-memory data stores.
Also, CBO does not work properly with Partitioned Hive Parquet tables; CBO only gives the size and not the number of rows estimated.

Structure streaming: First n rows

Recently, I encounter with the problem 'first n rows' in structure streaming during engineering with real-time data. I need to obtain the 50 newest event-time records as output, but structure streaming give me a whole unbounded table or several updated results. I search a lot online, and several methods is following:
(1) Using TTL, but I think that it is based on ingestion time, which is not my desired event-time;
(2) Using Flink to catch the newest event-time records. It is something messy to use flink and structure streaming in the meantime. As following, I have tried to use flink 1.6, statics is a table? I don't know how to process on because nothing output.
val source: KafkaTableSource = Kafka010JsonTableSource.builder()
.forTopic("BINANCE_BTCUSDT_RESULT")
.withKafkaProperties(properties)
.withSchema(TableSchema.builder()
.field("timestamp", Types.SQL_TIMESTAMP)
.field("future_max", Types.DOUBLE)
.field("future_min", Types.DOUBLE)
.field("close",Types.DOUBLE)
.field("quantities",Types.DOUBLE).build())
.fromEarliest()
.build()
tableEnv.registerTableSource("statics", source)
val statics = tableEnv.scan("statics")
statics.?
Any body could tell me more about the solving method with the first n rows problem? If the problem is solved, how to post the dataframe into url?
I recommend you use Flink 1.5, as 1.6 isn't stable yet (in fact, 1.5 was just released).
When using event time with Flink, Flink needs to be aware of your timestamps, and it needs watermarks, which indicate the flow of event time. To do this with a Kafka010JsonTableSource, you should configure a rowtime attribute.
Note that fetch() is only available when using Flink SQL in batch mode.

How to write data into a Hive table?

I use Spark 2.0.2.
While learning the concept of writing a dataset to a Hive table, I understood that we do it in two ways:
using sparkSession.sql("your sql query")
dataframe.write.mode(SaveMode."type of
mode").insertInto("tableName")
Could anyone tell me what is the preferred way of loading a Hive table using Spark ?
In general I prefer 2. First because for multiple rows you cannot build such a long sql and second because it reduces the chance of errors or other issues like SQL injection attacks.
In the same way that for JDBC I use PreparedStatements as much as possible.
Think in this fashion, we need to achieve updates on daily basis on hive.
This can be achieved in two ways
Process all the data of the hive
Process only effected partitions.
For the first option sql works like a gem, but keep in mind that the data should be less to process entire data.
Second option works well.If you want to process only effected partition. Use data.overwite.partitionby.path
You should write the logic in such a way that it process only effected partitions. This logic will be applied to tables where data is in millions T billions records

Writing SQL vs using Dataframe APIs in Spark SQL

I am a newbie in Spark SQL world. I am currently migrating my application's Ingestion code which includes ingesting data in stage,Raw and Application layer in HDFS and doing CDC(change data capture), this is currently written in Hive queries and is executed via Oozie. This needs to migrate into a Spark application(current version 1.6). The other section of code will migrate later on.
In spark-SQL, I can create dataframes directly from tables in Hive and simply execute queries as it is (like sqlContext.sql("my hive hql") ). The other way would be to use dataframe APIs and rewrite the hql in that way.
What is the difference in these two approaches?
Is there any performance gain with using Dataframe APIs?
Some people suggested, there is an extra layer of SQL that spark core engine has to go through when using "SQL" queries directly which may impact performance to some extent but I didn't find any material substantiating that statement. I know the code would be much more compact with Datafrmae APIs but when I have my hql queries all handy would it really worth to write complete code into Dataframe API?
Thank You.
Question : What is the difference in these two approaches?
Is there any performance gain with using Dataframe APIs?
Answer :
There is comparative study done by horton works. source...
Gist is based on situation/scenario each one is right. there is no
hard and fast rule to decide this. pls go through below..
RDDs, DataFrames, and SparkSQL (infact 3 approaches not just 2):
At its core, Spark operates on the concept of Resilient Distributed Datasets, or RDD’s:
Resilient - if data in memory is lost, it can be recreated
Distributed - immutable distributed collection of objects in memory partitioned across many data nodes in a cluster
Dataset - initial data can from from files, be created programmatically, from data in memory, or from another RDD
DataFrames API is a data abstraction framework that organizes your data into named columns:
Create a schema for the data
Conceptually equivalent to a table in a relational database
Can be constructed from many sources including structured data files, tables in Hive, external databases, or existing RDDs
Provides a relational view of the data for easy SQL like data manipulations and aggregations
Under the hood, it is an RDD of Row’s
SparkSQL is a Spark module for structured data processing. You can interact with SparkSQL through:
SQL
DataFrames API
Datasets API
Test results:
RDD’s outperformed DataFrames and SparkSQL for certain types of data processing
DataFrames and SparkSQL performed almost about the same, although with analysis involving aggregation and sorting SparkSQL had a slight advantage
Syntactically speaking, DataFrames and SparkSQL are much more intuitive than using RDD’s
Took the best out of 3 for each test
Times were consistent and not much variation between tests
Jobs were run individually with no other jobs running
Random lookup against 1 order ID from 9 Million unique order ID's
GROUP all the different products with their total COUNTS and SORT DESCENDING by product name
In your Spark SQL string queries, you won't know a syntax error until runtime (which could be costly), whereas in DataFrames syntax errors can be caught at compile time.
Couple more additions. Dataframe uses tungsten memory representation , catalyst optimizer used by sql as well as dataframe. With Dataset API, you have more control on the actual execution plan than with SparkSQL
If query is lengthy, then efficient writing & running query, shall not be possible.
On the other hand, DataFrame, along with Column API helps developer to write compact code, which is ideal for ETL applications.
Also, all operations (e.g. greater than, less than, select, where etc.).... ran using "DataFrame" builds an "Abstract Syntax Tree(AST)", which is then passed to "Catalyst" for further optimizations. (Source: Spark SQL Whitepaper, Section#3.3)

Spark Decision tree fit runs in 1 task

I am trying to "train" a DecisionTreeClassifier using Apache Spark running in a cluster in Amazon EMR. Even though I can see that there are around 50 Executors added and that the features are created by querying a Postgres database using SparkSQL and stored in a DataFrame.
The DesisionTree fit method takes for many hours even though the Dataset is not that big (10.000 db entries with a couple of hundreds of bytes each row).I can see that there is only one task for this so I assume this is the reason that it's been so slow.
Where should I look for the reason that this is running in one task?
Is it the way that I retrieve the data?
I am sorry if this is a bit vague but I don't know if the code that retrieves the data is relevant, or is it a parameter in the algorithm (although I didn't find anything online), or is it just Spark tuning?
I would appreciate any direction!
Thanks in advance.
Spark relies on data locality. It seems that all the data is located in a single place. Hence spark uses a single partition to process it. You could apply a repartition or state the number of partitions you would like to use at load time. I would also look into the decision tree Api and see if you can set the number of partitions for it specifically.
Basically, partitions are your level of parallelism.

Resources