I'm trying to do this mainly because I have to save data from a same stream to two cassandra tables, they have almost same schema but different primary key to serve two queries.
Will
rdd.saveToCassandra(keySpace, tableOne, allColumn)
rdd.saveToCassandra(keySpace, tableTwo, allColumn)
do the work?
Is this a normal thing to do? I googled a bit and someone said it may incur performance issue when the rdd is large:
https://groups.google.com/a/lists.datastax.com/forum/#!topic/spark-connector-user/e1nfWWyhZRo
That is OK to do so. To avoid performance issues you need to cache your RDD before first use like this:
rdd.cache()
Also after use it's good practice to unpersist your RDD like this:
rdd.unpersist()
Related
In my case, the data resides in spark tables which are created by calling createOrReplaceTempView API on a dataframe. Once the table is created, several queries are going to run on top of the table. Most of the time, the where query is going to be based on a particular column. The concerned columns' name is already known. I would like to know if some sort of optimizations can be done to improve the performance of the filter query.
I tried exploring the approach of indexing but it turns out spark does not support indexing a particular column.
Have you looked at the SPARK UI to see where most of your time is being consumed? Is it really the query where most of the time is spent? Usually reading the data from disk is where most of the time is spent. Learn to read the SPARK UI and find where the real bottleneck is. The SQL tab is a really great way to start figuring things out.
Here's some tricks to run faster in spark that apply to most jobs:
Can you reframe the problem? Was the data you are using in a format that helps you solve the query? Can you change how it's written to change the problem? (Could you start "pre-chewing" the data before you even query it to have it stored in the best format to help you solve the issue you want to solve?) Most performance gains come from changing the parameters of the problem to make them easier/faster to solve.
What format (is the incoming data) you are
storing the data in? Are you using Parquet/Orc? They have a great payoff disk space/compression that are worth using. They also can enable file level filter to speed read. Is their transformation work that you can push upstream to help make the query do less work? Can you be writing the data via a partition schema that would aid lookups?
How many files is your input? Can you consolidate files to maximize read throughput. Reading/listing a lot of small files as input slows down the processing of data.
If the tempView query is of similar size every time you could look at tweaking the partition count so that files are smaller but approximately the size of your HDFS block size. (Assuming you are using hdfs). HDFS you have to read an entire block weather you use all the data or not. Try and fit this to some multiple of your executors so that you are finishing together and not straggling. This is hard to get perfect but you can make decent strides to find a good ratio.
There is no need to optimize filter conditions with spark. spark already is smart enough to optimize its conditions post where query to fetch minimum rows first. The best I guess you can do is by persisting your TempView if querying the same view again and again.
I need to understand if there is any difference between the below two approaches of caching while using spark sql and is there any performance benefit of one over the another (considering building the dataframes are costly and I want to reuse it many times/hit many actions) ?
1> Cache the original data frame before registering it as temporary table
df.cache()
df.createOrReplaceTempView("dummy_table")
2> Register the dataframe as temporary table and cache the table
df.createOrReplaceTempView("dummy_table")
sqlContext.cacheTable("dummy_table")
Thanks in advance.
df.cache() is a lazy cache, which means that the cache would only occur when the next action is triggered.
sqlContext.cacheTable("dummy_table") is an eager cache, which mean the table will get cached as the command is called. An equivalent of this would be: spark.sql("CACHE TABLE dummy_table")
To answer your question if there is a performance benefit of one over another, it will be hard to tell without understand your entire workflow and how (and where) your cached dataframes are used. I'd recommend using the eager cache, so you won't have to second guess when (and whether) your dataframe is cached.
Just trying to figure out whats the best way to handle this situation. I use dataset.write to write into a oracle database and requirement is find if duplicate exists in the table already ( not within dataset ) and if exists then write those duplicate records in a different table . Has anyone ran into a similar issue ? The table to which i am writing is a huge one and will be costly if I read the existing data from that to compare against before writing the dataset
savemode used is append . Its a kafka streaming application which streams data continuously every 2 mins .
There is no UPSERT mode for I presume you mean DF.write or DS.write.
The question is how often does such a duplicate occur and why? And what is the impact if one slips through every so now and again? am not inclined to have a duplicate key violation occurring with this scenario.
If the duplicate inserts logically are few and there is suitable time-based ORACLE partitioning that restricts the amount of data to check, you can do that on the DBMS side as a periodic process there.
So, I would not be inclined to check on the SPARK side. It also seems a little counter intuitive to ingesting with KAFKA and bang writing it out asap.
An interesting question as any approach to do something has some issue to contend with - caching, re-reading, etc. on the SPARK side.
How can I export data, over a period of time (like hourly or daily) or updated records from a Cassandra database? It seems like using an index with a date field might work, but I definitely get timeouts in my cqlsh when I try that by hand, so I'm concerned that it's not reliable to do that.
If that's not the right way, then how do people get their data out of Cassandra and into a traditional database (for analysis, querying with JOINs, etc..)? It's not a java shop, so using Spark is non-trivial (and we don't want to change our whole system to use Spark instead of cassandra directly). Do I have to read sstables and try to keep track of them that way? Is there a way to say "get me all records affected after point in time X" or "get me all changes after timestamp X" or something similar?
It looks like Cassandra is really awesome at rapidly reading and writing individual records, but beyond that Cassandra seems to not be the right tool if you want to pull its data into anything else for analysis or warehousing or querying...
Spark is the most typical to do exactly that (as you say). It does it efficiently and is used often so pretty reliable. Cassandra is not really designed for OLAP workloads but things like spark connector help bridge the gap. DataStax Enterprise might have some more options available to you but I am not sure their current offerings.
You can still just query and page through the whole data set with normal CQL queries, its just not as fast. You can even use ALLOW FILTERING just be wary as its very expensive and can impact your cluster (creating a separate dc for the workload and using LOCOL_CL queries against it helps). You will probably also in that scenario add a < token() and > token() to the where clause to split up the query and prevent too much work on any one coordinator. Organizing your data so that this query is more efficient would be strongly recommended (ie if doing time slices, put things in a partition bucketed by time and clustering key timeuuids so its sequential read for each part of time).
Kinda cheesy sounding but the CSV dump from cqlsh is actually fast and might work for you if your data set is small enough.
I would not recommend going to the sstables directly unless you are familiar with internals and using hadoop or spark.
I need a big data storage solution for batch inserts of denormalized data which happen infrequently and queries on the inserted data which happen frequently.
I've gone through Cassandra and feel that its not that good for batch inserts, but an OK solution for querying. Also, it would be good if there was a mechanism to segregate data separately based on a data attribute.
As you mentioned Cassandra I will talk about it:
Can you insert in an unbatched way or is this impossed by the system? If you can insert unbatched, Cassandra will probably be able to handle it easily.
Batched inserts should also be handable by Cassandra nodes, but this won't distribute the load properly among all the nodes (NOTE: I'm talking about load balancing, not about data balance, which will be only depending on your partition key setup). If you are not very familiar with Cassandra you could tell us your data structure and your query types and we could suggest you how to use Cassandra's data model to fit it.
For the filtering part of the question, Cassandra has clustering keys and secondary indexes, that are basically like adding another column configuration to the clustering key so that you have both for querying.