Does spark tables keep data permanently stored as RDBMS does and data is available all the time? [duplicate] - apache-spark

This question already has an answer here:
Difference between createTempview and createGlobaltempview and CreateorReplaceTempview in spark 2.1?
(1 answer)
Closed 4 years ago.
I'm quit new to Spark and was trying to understand it's functionality. Basically I'm from database background, and was confused with Spark databases & tables. So my confusion is does spark also stores data permanently on it's own and make it available all the time as RDBMS or other no-sql store does ?
Or it just create a reference point to the incoming data till the duration of processing and once process is over data went off.
SO basically how spark is being utilized where we've to process data on regularly in batches or in continuous streaming. What is the time to live for data in spark tables ?

Spark is not a database. It does not store data permanently by itself. Its a cluster computing framwork/engine which can also work in a standalone environment. What spark exactly does is it pulls the data from various sources like HDFS,S3,local filesystem,rdbms,nosql etc... and do any analysis or transformation in the memory(RAM) of various worker nodes. It has the capability to spill the data to local disk if the data does not fit in the RAM. Once action is finished the data will be flushed out. Though you can cache or persists and it will available till the spark context is running, sometimes even if you cache the data and the memory is full it calculates the LRU(least recently used) rdd and flush it out for storing other rdd. The memory management is an interesting concept in spark.

Related

Spark dataframe creation through already distributed in-memory data sets

I am new to the Spark community. Please ignore if this question doesn't make sense.
My PySpark Dataframe is just taking a fraction of time (in ms) in 'Sorting', but moving data is much expensive (> 14 sec).
Explanation:
I have a huge Arrow RecordBatches collection which is equally distributed on all of my worker node's memories (in plasma_store). Currently, I am collecting all those RecordBatches back in my master node, merging them, and converting them to a single Spark Dataframe. Then I apply sorting function on that dataframe.
Spark dataframe is a cluster distributed data collection.
So my question is:
Is it possible to create a Spark dataframe from all that already distributed Arrow RecordBatches data collections in the worker's nodes memories? So that the data should remain in the respective worker's nodes memories (instead of bringing it to master node, merging, and then creating distributed dataframe).
Thanks!
Yes you can store the data in a spark cache, whenever you try to get the data, it would get you from cache rather than the source.
Please utilize below kinks to understand more on cache,
https://sparkbyexamples.com/spark/spark-dataframe-cache-and-persist-explained/
where does df.cache() is stored
https://unraveldata.com/to-cache-or-not-to-cache/

Concept for temporary data in Apache Cassandra

I have a question regarding the usage of Cassandra for temporary data (Data which is written once to the database, which is read once from the database and then deleted).
We are using Cassandra, to exchange data between processes which are running on different machines / different containers. Process1 is writing some data to the Cassandra, Process2 is reading this data. After that, data can be deleted.
As we learned that Cassandra doesn't like writing and deleting data very often in one table because of tombestones and performance issues, we are creating temporary tables for this.
Process1 : Create table, write data to table.
Process2 : Read data from table, drop table.
But doing this in a very high number (500-1000 tables create and drop per hour) we are facing problems on our schema synchronization between our nodes (we have cluster with 6 nodes).
The Cassandra cluster got very slow, we got a lot of timeout warnings, we got errors about different schemas on the nodes, the CPU load on the cluster nodes grew up to 100% and then the cluster was dead :-).
Is Cassandra the right database for this usecase ?
Is it a problem of how we configured our cluster ?
Will it be a better solution to create temporary keyspaces for this ?
Has anyone experience of how to handle such usecase with Cassandra ?
You don't need any database here. Your use case is to enable your applications to handshake with each other to share data asynchronously. There are two possible solutions:
1) For Batch based writes and reads consider using something like HDFS for intermediate storage. Process 1 writes data files in HDFS directories and Process 2 reads it from HDFS.
2) For message based system consider something like Kafka. Process 1 process the data stream and writes into Kafka Topics and Process 2 consumers reads data from Kafka Topics. Kafka do provides Ack/Nack features.
Continuously creating and deleting number of tables in Cassandra is not a good practice and is never recommended.

Spark's amnesia of parquet partitions when cached in-memory (native spark cache)

I am working on some batch processing with Spark, reading data from a partitioned parquet file which is around 2TB. Right now, I am caching the whole file, in-memory, since I need to restrict the reading of the same parquet file, multiple times (given the way, we are analyzing the data).
Till some time back, the code is working fine. Recently, we have added use-cases which needs to work on some selective partitions (like average of a metric for the last 2 years where the complete data spawns across 6+ years).
When we started taking metrics for the execution times, we have observed that the use-case, which will work on a subset of partitioned data, is also taking similar time when compared to the time taken by the use-case which requires to work on complete data.
So, my question is that whether Spark's in-memory caching honors partitions of a Parquet file i.e., will spark holds the partition information even after caching the data, in-memory ?
Note: Since this is really a general question about Spark's processing style, I didn't added any metrics or the code.

Spark Poor Query performance: How to improve query performance on Spark?

There is a lots of hype over how good and fast spark is for processing large amount of data.
So, we wanted to investigate the query performance of spark.
Machine configuration:
4 worker nodes, r3.2xlarge instances
Data
Our input data is stored in 12 splitted gzip files in S3.
What we did
We created a table using Spark SQL for the aforementioned input data set.
Then we cached the table. We found from Spark UI that Spark did not load all data into memory, rather it loaded some data into memory and some in disk.
UPDATE: We also tested with parquet files. In this case, all data was loaded in memory. Then we execute the same queries as below. Performance is still not good enough.
Query Performance
Let's assume the table name is Fact_data. We executed the following query on that cached table:
select date_key,sum(value) from Fact_data where date_key between 201401 and 201412 group by date_key order by 1
The query takes 1268.93sec to complete. This is huge compared to the execution time in Redshift (dc1.large cluster) which takes only 9.23 sec.
I also tested some other queries e.g, count, join etc. Spark is giving me really poor performance for each of the queries
Questions
Could you suggest anything that might improve the performance of the query? May be I am missing some optimization techniques. Any suggestion will be highly appreciated.
How to compel Spark to load all data in memory? Currently it stored some data in memory and some in disk.
Is there any performance difference in using Dataframe and SQL table? I think, no. Because under the hood they are using the same optimizer.
I suggest you use Parquet as your file format instead of gzipped files.
you can try increasing your --num-executors, --executor-memory and --executor-cores
if you're using YARN and your instance type is r3.2xlarge, make sure you container size yarn.nodemanager.resource.memory-mb is larger than your --executor-memory (maybe around 55G) you also need to set yarn.nodemanager.resource.cpu-vcores to 15.

Apache Spark node asking master for more data?

I'm trying to benchmark a few approaches to putting an image processing algorithm into apache spark. For one step in this algorithm, a computation on a pixel in the image will depend on an unknown amount of surrounding data, so we can't partition the image with guaranteed sufficient overlap a priori.
One solution to that problem I need to benchmark is for a worker node to ask the master node for more data when it encounters a pixel with insufficient surrounding data. I'm not convinced this is the way to do things, but I need to benchmark it anyway because of reasons.
Unfortunately, after a bunch of googling and reading docs I can't find any way for a processingFunc called as part of sc.parallelize(partitions).map(processingFunc) to query the master node for more data from a different partition mid-computation.
Does a way for a worker node to ask the master for more data exist in spark, or will I need to hack something together that kind of goes around spark?
Master Node in Spark is for allocating the resources to a particular job and once the resources are allocated, the Driver ships the complete code with all its dependencies to the various executors.
The first step in every code is to load the data to the Spark cluster. You can read the data from any underlying data repository like Database, filesystem, webservices etc.
Once data is loaded it is wrapped into an RDD which is partitioned across the nodes in the cluster and further stored in the workers/ Executors Memory. Though you can control the number of partitions by leveraging various RDD API's but you should do it only when you have valid reasons to do so.
Now all operations are performed over RDD's using its various methods/ Operations exposed by RDD API. RDD keep tracks of partitions and partitioned data and depending upon the need or request it automatically query the appropriate partition.
In nutshell, you do not have to worry about the way data is partitioned by RDD or which partition stores which data and how they communicate with each other but if you do care, then you can write your own custom partitioner, instructing Spark of how to partition your data.
Secondly if your data cannot be partitioned then I do not think Spark would be an ideal choice because that will result in processing of everything in 1 single machine which itself is contrary to the idea of distributed computing.
Not sure what is exactly your use case but there are people who have been leveraging Spark for Image processing. see here for the comments from Databricks

Resources