spark streaming persistent table updates - apache-spark

I have a spark structured streaming application (listening to kafka) that is also reading from a persistent table in s3 I am trying to have each microbatch check for updates to the table. I have tried
var myTable = spark.table("myTable!")
and
spark.sql("select * from parquet.`s3n://myFolder/`")
Both do not work in a streaming context. The issue is that the parquet file is changing at each update, and spark doesn't run any of the normal commands to refresh such as:
spark.catalog.refreshTable("myTable!")
spark.sqlContext.clearCache()
I have also tried:
spark.sqlContext.setConf("spark.sql.parquet.cacheMetadata","false")
spark.conf.set("spark.sql.parquet.cacheMetadata",false)
to no relief. There has to be a way to do this. Would it be smarter to use a jdbc connection to a Database instead?

Assuming I'm reading you right I believe the issue is that because DataFrame's are immutable, you cannot see the changes to your parquet table unless you restart the streaming query and create a new DataFrame. This question has come up on the Spark Mailing List before. The definitive answer appears to be that the only way to capture these updates is to restart the streaming query. If your application cannot tolerate 10 second hiccups you might want to check out this blog post which summarizes the above conversation and discusses how SnappyData enables mutations on Spark DataFrames.
Disclaimer: I work for SnappyData

This will accomplish what I'm looking for.
val df1Schema = spark.read.option("header", "true").csv("test1.csv").schema
val df1 = spark.readStream.schema(df1Schema).option("header", "true").csv("/1")
df1.writeStream.format("memory").outputMode("append").queryName("df1").start()
var df1 = sql("select * from df1")
The downside is that its appending. getting around one issue is to remove duplicates based on ID and with the newest date.
val dfOrder = df1.orderBy(col("id"), col("updateTableTimestamp").desc)
val dfMax = dfOrder.groupBy(col("id")).agg(first("name").as("name"),first("updateTableTimestamp").as("updateTableTimestamp"))

Related

How to paginate hive table in spark?

I wanted to do a pagination on a hive table having ~1.5 billion rows using pyspark. I came across one solution using ROW_NUMBER(). When I tried it, I am running out memory. Not sure whether spark is trying to bring in the complete table to it's memory and then doing a pagination.
After that, I came across this LIMIT clause in Hive SQL (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select#LanguageManualSelect-LIMITClause) and tried it. But it failed in spark, the reason which I figured out was that hiveQL is not completely supported in spark.sql(). Spark SQL limit does not support multiple arguments for offset -> https://spark.apache.org/docs/3.0.0/sql-ref-syntax-qry-select-limit.html
Is there a good approach where in I can do pagination using spark?
PS: The hive table does not have an ID column, with which I can sort and do a pagination. :)
basic use of spark :
# Extract the data
df = spark.read.table("my_table")
# Transform the data
df = df.withColumn("new_col", some_transformation())
# Load the data
df.write ... # write wherever you want

What is best approach to join data in spark streaming application?

Question : Essentially it means , rather than running a join of C* table for each streaming records , is there anyway to run a join for each micro-batch ( micro-batching ) of records in spark streaming ?
We are almost finalized to use spark-sql 2.4.x version , datastax-spark-cassandra-connector for Cassandra-3.x version.
But have one fundamental question regarding the efficiency in the below scenario.
For the streaming data records(i.e. streamingDataSet ) , I need to look up for existing records( i.e. cassandraDataset) from Cassandra(C*) table.
i.e.
Dataset<Row> streamingDataSet = //kafka read dataset
Dataset<Row> cassandraDataset= //loaded from C* table those records loaded earlier from above.
To look up data i need to join above datasets
i.e.
Dataset<Row> joinDataSet = cassandraDataset.join(cassandraDataset).where(//somelogic)
process further the joinDataSet to implement the business logic ...
In the above scenario, my understanding is ,for each record received
from kafka stream it would query the C* table i.e. data base call.
Does not it take huge time and network bandwidth if C* table consists
billions of records? What should be the approach/procedure to be
followed to improve look up C* table ?
What is the best solution in this scenario ? I CAN NOT load once from
C* table and look up as the data keep on adding to C* table ... i.e.
new look ups might need newly persisted data.
How to handle this kind of scenario? any advices plzz..
If you're using Apache Cassandra, then you have only one possibility for effective join with data in Cassandra - via RDD API's joinWithCassandraTable. The open source version of the Spark Cassandra Connector (SCC) supports only it, while in version for DSE, there is a code that allows to perform effective join against Cassandra also for Spark SQL - so-called DSE Direct Join. If you'll use join in Spark SQL against Cassandra table, Spark will need to read all data from Cassandra, and then perform join - that's very slow.
I don't have an example for OSS SCC for doing the join for Spark Structured Streaming, but I have some examples for "normal" join, like this:
CassandraJavaPairRDD<Tuple1<Integer>, Tuple2<Integer, String>> joinedRDD =
trdd.joinWithCassandraTable("test", "jtest",
someColumns("id", "v"), someColumns("id"),
mapRowToTuple(Integer.class, String.class), mapTupleToRow(Integer.class));

How to execute streaming-static join faster than normal to be in sync with batch trigger duration?

I am using spark-sql-2.4.1v to streaming in my PoC.
I have a scenario as below
Dataset staticDf = // previous data from hdfs/cassandra table
Dataset streamingDf = // data from kafka topic for streaming
Dataset<Row> joinDs = streamingDs.join(staticDs, streamingDs.col("companyId").equalTo(staticDs.col("company_id"), "inner"));
Even though this is working fine I have an issue with timings of the join.
Currently my streaming Tigger time is around 10 seconds. Where are this join been run for almost 1 min. So I am not getting the results in the expected time.
How can I make my join trigger at for every 10 seconds ?
Thank you.
In your case, to perform join Spark needs to read all data from Cassandra, and this is slow. As I mentioned before, you need to use DSE Analytics if you want to perform efficient join on the Dataset/Dataframe, or use joinWithCassandra/leftJoinWithCassandra from RDD API.
Update in September 2020th: support for join with Cassandra in dataframes was added in the Spark Cassandra Connector 2.5.0

Does spark saveAsTable really create a table?

This may be a dumb question since lack of some fundamental knowledge of spark, I try this:
SparkSession spark = SparkSession.builder().appName("spark ...").master("local").enableHiveSupport().getOrCreate();
Dataset<Row> df = spark.range(10).toDF();
df.write().saveAsTable("foo");
This creates table under 'default' database in Hive, and of course, I can fetch data from the table anytime I want.
I update above code to get rid of "enableHiveSupport",
SparkSession spark = SparkSession.builder().appName("spark ...").master("local").getOrCreate();
Dataset<Row> df = spark.range(10).toDF();
df.write().saveAsTable("bar");
The code runs fine, without any error, but when I try "select * from bar", spark says,
Caused by: org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'bar' not found in database 'default';
So I have 2 questions here,
1) Is it possible to create a 'raw' spark table, not hive table? I know Hive mantains the metadata in database like mysql, does spark also have similar mechanism?
2) In the 2nd code snippet, what does spark actually create when calling saveAsTable?
Many thanks.
Check answers below:
If you want to create raw table only in spark createOrReplaceTempView could help you. For second part, check next answer.
By default, if you call saveAsTable on your dataframe, it will persistent tables into Hive metastore if you use enableHiveSupport. And if we don't enableHiveSupport, tables will be managed by Spark and data will be under spark-warehouse location. You will loose these tables after restart spark session.

Ignite Spark Dataframe slow performance

I was trying to improve the performance of some existing spark dataframe by adding ignite on top of it. Following code is how we currently read dataframe
val df = sparksession.read.parquet(path).cache()
I managed to save and load spark dataframe from ignite by the example here: https://apacheignite-fs.readme.io/docs/ignite-data-frame. Following code is how I do it now with ignite
val df = spark.read()
.format(IgniteDataFrameSettings.FORMAT_IGNITE()) //Data source
.option(IgniteDataFrameSettings.OPTION_TABLE(), "person") //Table to read.
.option(IgniteDataFrameSettings.OPTION_CONFIG_FILE(), CONFIG) //Ignite config.
.load();
df.createOrReplaceTempView("person");
SQL Query(like select a, b, c from table where x) on ignite dataframe is working but the performance is much slower than spark alone(i.e without ignite, query spark DF directly), an SQL query often take 5 to 30 seconds, and it's common to be 2 or 3 times slower spark alone. I noticed many data(100MB+) are exchanged between ignite container and spark container for every query. Query with same "where" but smaller result is processed faster. Overall I feel ignite dataframe support seems to be a simple wrapper on top of spark. Hence most of the case it is slower than spark alone. Is my understanding correct?
Also by following the code example when the cache is created in ignite it automatically has a name like "SQL_PUBLIC_name_of_table_in_spark". So I could't change any cache configuration in xml (Because I need to specify cache name in xml/code to configure it and ignite will complain it already exists) Is this expected?
Thanks
First of all, it doesn't seem that your test is fair. In the first case you prefetch Parquet data, cache it locally in Spark, and only then execute the query. In case of Ignite DF you don't use caching, so data is fetched during query execution. Typically you will not be able to cache all your data, so performance with Parquet will go down significantly once some of the data needs to be fetched during execution.
However, with Ignite you can use indexing to improve the performance. For this particular case, you should create index on the x field to avoid scanning all the data every time query is executed. Here is the information on how to create an index: https://apacheignite-sql.readme.io/docs/create-index

Resources