Is MySQL more efficient in query optimization and general efficiency than Apache spark - apache-spark

I find that Apache spark is much slower then a MySQL server for the same query and the same table query on a spark data frame.
So where would be spark more efficient then MySQL?
Note : tried on a table with 1 million rows all of 10 columns of type text.
The size of table in json is about 10GB
Using a standalone pyspark notebook with Xeon 16 core and 64gb RAM and on same server MySql
In general I would like to know guidelines on when to use SPARK vs SQL server in terms of the size of target data to get real snappy results from analytic queries.

Ok, so going to try and help here even though it's still very difficult to answer this without knowing more. Assuming there is no contention for resources, there are a number of things that are going on here. If you're running on yarn and your json is stored in hdfs. It is likely split into many blocks, those blocks are then processed in different partitions. Since json doesn't split very well, you'd lose alot of parallel capabilities. Also, spark isn't meant to really have the super low latency queries like a tuned rdbms. Where you benefit from spark is on heavy data processing, large amounts of data (TB or PB). If you are looking for low latency queries you should use Impala or Hive with Tez. You should also consider changing your file format to avro, parquet or ORC.

Related

Differences in Execution betwen Hive and Spark

All: I am looking for someone with more knowledge to check my understanding of Hive and Spark
I have been researching different large scale database solutions and I am trying to understand the difference in execution between Hive and Spark. I attempted to install Hadoop, Hive, and Spark to see how they perform. I was able to get Hadoop and Spark to work. I was unable to get Hive to work.
When I ran queries in Spark after they passed through the optimizer, it seems that the biggest advantage is that only the relevant table data is selected from the source at the earliest inception. So if I only needed Table1.columns(A,B,C) in the final answer, but told the system to JOIN Table1 & Table2 on (Table1.A=Table2.B) it immediately reduces the carried table to only the relevant items...I do not think Hive performs that way. I believe it will do the full join and perform the reduction later.
There are also differences in the memory storage (Hive going back the the HDFS frequently, vs Spark keeping things in RAM). This has both advantages and disadvantages depending on the data set/query.
Unfortunately because I cannot get Hive to run, my theory is based off of reading outputs of other people running things in Hive.
I Think hive and spark originally have different goals, and their execution styles are based on those goals.
Apache spark is a framework that allows you to do calculations on big datasets. stored on hdfs
Hive is an SQL interface to retriev data stored in an hdfs, and other clusterized and object store filesystems (S3 is an example) in a structured way.
Spark keeps things on ram because its more focused on making calculations with the data sets. Hive is more focused on retrieving data in a structured way, so it does not focus on speed that much (that being said, there have been improvements in hive, like llap that are meant to improve performance).
I like to use analogies with traditional software tools. On one side, you can have a relational database, and on the other side, a programming language. They both overlap in some functionality (you can write and read to disk with the programming language, and you can do some calculations with the sql engine. However, if the task at hand requires intensive and complex calculations you would probably use the programming language. If you are looking for a system that lets you store data in a structured way, you would go for the sql engine.
Hive on Tez and Spark both use Ram(memory) for operating on data . The number of partitions computed which will be treated as individual tasks would be quite different from Hive on Tez vs Spark . Hive on Tez by default tries to use combiner to merge certain splits into single partition . Hive one Tez seem to handle autoscaling of clusters in a better way than spark and does work most of the time.Spark doesn't work with autoscaling it would have lot of shuffle errors and will fail when there are multiple stages . But given a fixed size of cluster Spark seems to perform better over Hive on TEZ this could be attributed to some of the optimizations done and also how the shuffle ,serialization etc are implemented .

2 million queries against a dataframe

I need to run 2 million queries against a three columns table t (s,p,o) which size is 10 billions rows. The data type of each column is string.
Only two types of queries:
select s p o from t where s = param
select s p o from t where o = param
If I store the table in a Postgresql database takes 6 hours using a Java ThreadPoolExecutor.
Do you think Spark can speed up the queries processing even more?
What would be the best strategy? These are my ideas:
Load the table into a dataframe and launch the queries against the dataframe.
Load the table into a parquet database and launch the queries against this database.
Use Spark 2.4 to launch queries against the Postgresql database instead of querying directly.
Use Spark 3.0 to launch queries against the database loaded into PG-Strom, an extension module of PostgreSQL with GPU support.
Thanks,
Using Apache Spark on top of the existing MySQL or PostgresSQL server(s) (without the need to export or even stream data to Spark or Hadoop) can increase query performance more than ten times. Using multiple MySQL servers (replication or Percona XtraDB Cluster) gives us an additional performance increase for some queries. You can also use the Spark cache function to cache the whole MySQL query results table.
The idea is simple: Spark can read MySQL or PostgresSQL data via JDBC and can also execute SQL queries, so we can connect it directly to DB's and run the queries. Why is this faster? For long-running (i.e., reporting or BI) queries, it can be much faster as Spark is a massively parallel system. For example, MySQL can only use one CPU core per query, whereas Spark can use all cores on all cluster nodes.
But I recommend you use No-SQL(HBase, Cassandra,...) or New-SQL solutions for your analyses because they have better performance when the scale of your data increase.
Static Data? Spark; Otherwise tune Postgres
If the 10 billion rows are static or rarely updated, your best bet is going to be using Spark with appropriate partitions. The magic happens with parallelization, so the more cores you have, the better. You want to aim for partitions that are about half a gig in size each.
Determine the size of the data by running SELECT pg_size_pretty( pg_total_relation_size('tablename')); Divide the result by the number of cores available to Spark until you get between 1/8 and 3/4 gig.
Save as parquet if you really have static data or if you want to recover from a failure quickly.
If the source data are updated frequently, you're going to want to add indices in Postgres. It could be as straightforward as adding an index on each column. Partitioning in Postgres would also help.
Stick to Postgres. Newer databases are not appropriate for structured data such as yours. There are parallelization options. Aurora, if you're on AWS.
PG-Strom is not going to work for you here. You have simple data with few columns. Getting them into and out of a GPU is going to slow you down too much.

Impala vs Spark performance for ad hoc queries

I'm interested only in query performance reasons and architectural differences behind them. All answers I've seen before were outdated or hadn't provide me with enough context of WHY Impala is better for ad hoc queries.
From 3 considerations below only the 2nd point explain why Impala is faster on bigger datasets. Could you please contribute to the following statements?
Impala doesn't miss time for query pre-initialization, means impalad daemons are always running & ready. In other hand, Spark Job Server provide persistent context for the same purposes.
Impala is in-memory and can spill data on disk, with performance penalty, when data doesn't have enough RAM. The same is true for Spark. The main difference is that Spark is written on Scala and have JVM limitations, so workers bigger than 32 GB aren't recommended (because of GC). In turn, [wrong, see UPD] Impala is implemented on C++, and has high hardware requirements: 128-256+ GBs of RAM recommended. This is very significant, but should benefit Impala only on datasets that requires 32-64+ GBs of RAM.
Impala is integrated with Hadoop infrastructure. AFAIK the main reason to use Impala over another in-memory DWHs is the ability to run over Hadoop data formats without exporting data from Hadoop. Means Impala usually use the same storage/data/partitioning/bucketing as Spark can use, and do not achieve any extra benefit from data structure comparing to Spark. Am I right?
P.S. Is Impala faster than Spark in 2019? Have you seen any performance benchmarks?
UPD:
Questions update:
I. Why Impala recommends 128+ GBs RAM? What is an implementation language of each Impala's component? Docs say that "Impala daemons run on every node in the cluster, and each daemon is capable of acting as the query planner, the query coordinator, and a query execution engine.". If impalad is Java, than what parts are written on C++? Is there smth between impalad & columnar data? Are 256 GBs RAM required for impalad or some other component?
II. Impala loose all in-memory performance benefits when it comes to cluster shuffles (JOINs), right? Does Impala have any mechanics to boost JOIN performance compared to Spark?
III. Impala use Multi-Level Service Tree (smth like Dremel Engine see "Execution model" here) vs Spark's Directed Acyclic Graph. What does actually MLST vs DAG mean in terms of ad hoc query performance? Or it's a better fit for multi-user environment?
First off, I don't think comparison of a general purpose distributed computing framework and distributed DBMS (SQL engine) has much meaning. But if we would still like to compare a single query execution in single-user mode (?!), then the biggest difference IMO would be what you've already mentioned -- Impala query coordinators have everything (table metadata from Hive MetaStore + block locations from NameNode) cached in memory, while Spark will need time to extract this data in order to perform query planning.
Second biggie would probably be shuffle implementation, with Spark writing temp files to disk at stage boundaries against Impala trying to keep everything in-memory. Leading to a radical difference in resilience - while Spark can recover from losing an executor and move on by recomputing missing blocks, Impala will fail the entire query after a single impalad daemon crash.
Less significant performance-wise (since it typically takes much less time compared to everything else) but architecturally important is work distribution mechanism -- compiled whole stage codegens sent to the workers in Spark vs. declarative query fragments communicated to daemons in Impala.
As far as specific query optimization techniques (query vectorization, dynamic partition pruning, cost-based optimization) -- they could be on par today or will be in the near future.

Parquet vs Cassandra using Spark and DataFrames

I have come to this dilemma that I cannot choose what solution is going to be better for me. I have a very large table (couple of 100GBs) and couple of smaller (couple of GBs). In order to create my data pipeline in Spark and use spark ML I need to join these tables and do couple of GroupBy (aggregate) operations. Those operations were really slow for me so I chose to do one of these two:
Use Cassandra and use indexing to speed the GoupBy operations.
Use Parquet and Partitioning based on the layout of the data.
I can say that Parquet partitioning works faster and more scalable with less memory overhead that Cassandra uses. So the question is this:
If developer infers and understands the data layout and the way it is going to be used, wouldn't it better for just use Parquet since you will have more control over it? Why should I pay the price for the overhead that Cassandra causes?
Cassandra is also a good solution for analytics use cases, but in another way. Before you model your keyspaces, you have to know how you need to read the data. You can also use where and range queries, but in a hard restricted way. Sometimes you will hate this restriction, but there are reasons for these restrictions. Cassandra is not like Mysql. In MySQL the performance is not a key feature. It's more about flexibility and consistency. Cassandra is a high performance write/read database. Better in write than in read. Cassandra has also a linear scalability.
Okay, a bit about your use case: Parquet is the better option for you. This is why:
You aggregate raw data on really large and not splitted datasets
Your Spark ML Job sounds like a scheduled, not long-running job. (onces a week, day?)
This fits more in the use cases of Parquet. Parquet is a solution for ad-hoc analysis, filter analysis stuff. Parquet is really nice if you need to run a query 1 or 2 times a month. Parquet is also a nice solution if a marketing guy wants to know one thing and the response time is not so important. Simply and short:
Use Cassandra if you know the queries.
Use Cassandra if a query will be used in a daily business
Use Cassandra if Realtime matters (I talk about a maximum of 30 seconds latency, from, customer makes an action and I can see the result in my dashboard)
Use Parquet if Realtime doesn't matter
Use Parquet if the query will not perform 100x a day.
Use Parquet if you want to do batch processing stuff
It depends on your usecase. Cassandra makes it much easier (also outside of Spark) to access your data with (limited) pseudo-SQL. That makes it a perfect fit for building online-applications on top (e.g. to display the data in an UI) of it.
Also Cassandra makes it easier if you have to deal with updates, that is not only the new data going to be ingested in your data pipeline(e.g. logs) but you also have to take care about updates (e.g. system has to handle corrections of data)
When your usecase is to do analytics with Spark (and you don't care about the topics mentioned above), it should be feasible and considerable cheaper to use Parquet/HDFS - as you've stated. With HDFS you also achieve data locality with Spark and you might have the advantage that your analytic Spark applications are even faster if you are reading large blocks of data.

Spark Poor Query performance: How to improve query performance on Spark?

There is a lots of hype over how good and fast spark is for processing large amount of data.
So, we wanted to investigate the query performance of spark.
Machine configuration:
4 worker nodes, r3.2xlarge instances
Data
Our input data is stored in 12 splitted gzip files in S3.
What we did
We created a table using Spark SQL for the aforementioned input data set.
Then we cached the table. We found from Spark UI that Spark did not load all data into memory, rather it loaded some data into memory and some in disk.
UPDATE: We also tested with parquet files. In this case, all data was loaded in memory. Then we execute the same queries as below. Performance is still not good enough.
Query Performance
Let's assume the table name is Fact_data. We executed the following query on that cached table:
select date_key,sum(value) from Fact_data where date_key between 201401 and 201412 group by date_key order by 1
The query takes 1268.93sec to complete. This is huge compared to the execution time in Redshift (dc1.large cluster) which takes only 9.23 sec.
I also tested some other queries e.g, count, join etc. Spark is giving me really poor performance for each of the queries
Questions
Could you suggest anything that might improve the performance of the query? May be I am missing some optimization techniques. Any suggestion will be highly appreciated.
How to compel Spark to load all data in memory? Currently it stored some data in memory and some in disk.
Is there any performance difference in using Dataframe and SQL table? I think, no. Because under the hood they are using the same optimizer.
I suggest you use Parquet as your file format instead of gzipped files.
you can try increasing your --num-executors, --executor-memory and --executor-cores
if you're using YARN and your instance type is r3.2xlarge, make sure you container size yarn.nodemanager.resource.memory-mb is larger than your --executor-memory (maybe around 55G) you also need to set yarn.nodemanager.resource.cpu-vcores to 15.

Resources