I do an evaluation on HDFS and Cassandra's storage amount using the same input data in a single machine. Both HDFS and Cassandra has only 1 replica.
My input data are binary bytes, in total 31M. It turned out to be HDFS has less data than Cassandra.
HDFS : 16.4 M. (use COMPRESS.BLOCK strategy)(
Cassandra: 50M. (use CQL interface, with default setting(e.g. compression))
How could that be possible, since Cassandra use columnar storage ?
Is there anyone could help me figure it out? Thanks very much.
My Cassandra version is 2.1.9.
You will see better C* disk usage if using 3.+. its a 2.1 thing that requires the column name along with each field, so if you have 10 fields it will be a lot worse. 3.x is a lot better as it doesnt store redundant data.
HDFS and C* are two completely different things for solving different kinds of problems. If your looking just for most efficient use in disk space then hdfs is probably what you want, as it can store bulk binary data much more efficiently. If your looking for faster reads/writes, C* may be a better choice. C* adds to your data to organize and make queries more efficient and to provide guarantees about the data (for consistency). Compression will earn some of that back but in a lot of cases its gonna take up more space than just your raw data would.
Related
Intend to read data from an Oracle DB with pyspark (running in local mode) and store locally as parquet. Is there a way to tell whether a spark session dataframe will be able to hold the amount of data from the query (which will be the whole table, ie. select * from mytable)? Are there common solutions for if the data would not be able to fit in a dataframe?
* Saw a similar question here, but was a little confused by the discussion in the comments
As you are running on local, So I assume it is not on a cluster. You can not say exactly how much memory would require? However, you can go close to it. You check your respective table size that how much disk space it's using. Suppose you mytable has occupied 1GB of Hard disk then spark would be required RAM more than that, because Spark's engine required some memory for its own processing. Try to have 2GB extra, for safer side than actual table size.
To check you table size in Oracle, You can use below query:
select segment_name,segment_type,bytes/1024/1024 MB
from dba_segments
where segment_type='TABLE' and segment_name='<yourtablename>';
It will give you a result in MB.
To configure JVM related parameter in Apache-Spark you can check this.
It doesn't matter how big the table is if you are running spark in a distributed manner. You would need to worry about the memory if:-
You are reading the data in the driver and then doing a broadcast.
Caching the dataframe for some computation.
Usually for your spark application a DAG gets generated and if you are using JDBC source then the workers will read the data directly and use the shuffle space and off-heap to disk for memory intensive computation.
I am working on a solution to provide low latency results using spark. For this, I was planning to cache the data beforehand on which a user wants to query.
I am able to achieve good performance on the queries. One thing I noticed is that the data on cluster (parquet format) explodes when caching. I understand this is due to deserializing and decoding the data. I am just wondering if there is any other options to reduce the memory footprint.
I tried using
sqlContext.cacheTable("table_name") and also
tbl.persist(StorageLevel.MEMORY_AND_DISK_SER)
But nothing is helping reduce the memory footprint
Perhaps you want to try orc ? There have been improvements in orc support recently (more here: https://www.slideshare.net/Hadoop_Summit/orc-improvement-in-apache-spark-23-95295487). I am not an expert, but I heard that orc uses in memory columnar format... This format gives opportunities for doing things like compressing via techniques like run length encoding of repeated values -- which tends to lower memory footprint.
It also explodes when not caching.
cache has nothing do with reducing memory footprint. You do not state RDD or DF, but I presume latter. This RDD Memory footprint in spark gives an idea for RDDs and the improvements for DFs / DSs: https://spoddutur.github.io/spark-notes/deep_dive_into_storage_formats.html.
You cannot reuse the data for different users. What you could consider is Apache Ignite. See https://ignite.apache.org/use-cases/spark/shared-memory-layer.html
I am recently working in spark and came across few queries which I still couldn't resolve.
Let's say i have a dataset of 100GB and my ram size of the cluster is
16 GB.
Now, I know in case of simply reading the file and saving it in the HDFS will work as Spark will do it for each partition. What will happen when I perform sorting or aggregation transformation on 100GB data? How will it process 100GB in memory since we need entire data in case of sorting?
I have gone through below link but this only tells us what spark do in case of persisting, what I am looking is Spark aggregations or sorting on dataset greater than ram size.
Spark RDD - is partition(s) always in RAM?
Any help is appreciated.
There are 2 things you might want to know.
Once Spark reaches the memory limit, it will start spilling data to
disk. Please check this Spark faq and also there are severals
question from SO talking about the same, for example, this one.
There is an algorihtm called external sort that allows you to sort datasets which do not fit in memory. Essentially, you divide the large dataset by chunks which actually fit in memory, sort each chunk and write each chunk to disk. Finally, merge every sorted chunk in order to get the whole dataset sorted. Spark supports external sorting as you can see here and here is the implementation.
Answering your question, you do not really need that your data fit in memory in order to sort it, as I explained to you before. Now, I would encourage you to think about an algorithm for data aggregation dividing the data by chunks, just like external sort does.
There are multiple things you need to consider. Because you have 16RAM and 100GB data set, it will be good idea to keep persistence in DISK. It maybe difficult as when aggregating if data set has high cardinality. If the cardinality is low you will be better of to do aggregate at each RDD before merging into whole dataset. Also remember to make sure that each partition in RDD is less than memory (default value 0.4*container_size)
I find that Apache spark is much slower then a MySQL server for the same query and the same table query on a spark data frame.
So where would be spark more efficient then MySQL?
Note : tried on a table with 1 million rows all of 10 columns of type text.
The size of table in json is about 10GB
Using a standalone pyspark notebook with Xeon 16 core and 64gb RAM and on same server MySql
In general I would like to know guidelines on when to use SPARK vs SQL server in terms of the size of target data to get real snappy results from analytic queries.
Ok, so going to try and help here even though it's still very difficult to answer this without knowing more. Assuming there is no contention for resources, there are a number of things that are going on here. If you're running on yarn and your json is stored in hdfs. It is likely split into many blocks, those blocks are then processed in different partitions. Since json doesn't split very well, you'd lose alot of parallel capabilities. Also, spark isn't meant to really have the super low latency queries like a tuned rdbms. Where you benefit from spark is on heavy data processing, large amounts of data (TB or PB). If you are looking for low latency queries you should use Impala or Hive with Tez. You should also consider changing your file format to avro, parquet or ORC.
I have come to this dilemma that I cannot choose what solution is going to be better for me. I have a very large table (couple of 100GBs) and couple of smaller (couple of GBs). In order to create my data pipeline in Spark and use spark ML I need to join these tables and do couple of GroupBy (aggregate) operations. Those operations were really slow for me so I chose to do one of these two:
Use Cassandra and use indexing to speed the GoupBy operations.
Use Parquet and Partitioning based on the layout of the data.
I can say that Parquet partitioning works faster and more scalable with less memory overhead that Cassandra uses. So the question is this:
If developer infers and understands the data layout and the way it is going to be used, wouldn't it better for just use Parquet since you will have more control over it? Why should I pay the price for the overhead that Cassandra causes?
Cassandra is also a good solution for analytics use cases, but in another way. Before you model your keyspaces, you have to know how you need to read the data. You can also use where and range queries, but in a hard restricted way. Sometimes you will hate this restriction, but there are reasons for these restrictions. Cassandra is not like Mysql. In MySQL the performance is not a key feature. It's more about flexibility and consistency. Cassandra is a high performance write/read database. Better in write than in read. Cassandra has also a linear scalability.
Okay, a bit about your use case: Parquet is the better option for you. This is why:
You aggregate raw data on really large and not splitted datasets
Your Spark ML Job sounds like a scheduled, not long-running job. (onces a week, day?)
This fits more in the use cases of Parquet. Parquet is a solution for ad-hoc analysis, filter analysis stuff. Parquet is really nice if you need to run a query 1 or 2 times a month. Parquet is also a nice solution if a marketing guy wants to know one thing and the response time is not so important. Simply and short:
Use Cassandra if you know the queries.
Use Cassandra if a query will be used in a daily business
Use Cassandra if Realtime matters (I talk about a maximum of 30 seconds latency, from, customer makes an action and I can see the result in my dashboard)
Use Parquet if Realtime doesn't matter
Use Parquet if the query will not perform 100x a day.
Use Parquet if you want to do batch processing stuff
It depends on your usecase. Cassandra makes it much easier (also outside of Spark) to access your data with (limited) pseudo-SQL. That makes it a perfect fit for building online-applications on top (e.g. to display the data in an UI) of it.
Also Cassandra makes it easier if you have to deal with updates, that is not only the new data going to be ingested in your data pipeline(e.g. logs) but you also have to take care about updates (e.g. system has to handle corrections of data)
When your usecase is to do analytics with Spark (and you don't care about the topics mentioned above), it should be feasible and considerable cheaper to use Parquet/HDFS - as you've stated. With HDFS you also achieve data locality with Spark and you might have the advantage that your analytic Spark applications are even faster if you are reading large blocks of data.