2 million queries against a dataframe - apache-spark

I need to run 2 million queries against a three columns table t (s,p,o) which size is 10 billions rows. The data type of each column is string.
Only two types of queries:
select s p o from t where s = param
select s p o from t where o = param
If I store the table in a Postgresql database takes 6 hours using a Java ThreadPoolExecutor.
Do you think Spark can speed up the queries processing even more?
What would be the best strategy? These are my ideas:
Load the table into a dataframe and launch the queries against the dataframe.
Load the table into a parquet database and launch the queries against this database.
Use Spark 2.4 to launch queries against the Postgresql database instead of querying directly.
Use Spark 3.0 to launch queries against the database loaded into PG-Strom, an extension module of PostgreSQL with GPU support.
Thanks,

Using Apache Spark on top of the existing MySQL or PostgresSQL server(s) (without the need to export or even stream data to Spark or Hadoop) can increase query performance more than ten times. Using multiple MySQL servers (replication or Percona XtraDB Cluster) gives us an additional performance increase for some queries. You can also use the Spark cache function to cache the whole MySQL query results table.
The idea is simple: Spark can read MySQL or PostgresSQL data via JDBC and can also execute SQL queries, so we can connect it directly to DB's and run the queries. Why is this faster? For long-running (i.e., reporting or BI) queries, it can be much faster as Spark is a massively parallel system. For example, MySQL can only use one CPU core per query, whereas Spark can use all cores on all cluster nodes.
But I recommend you use No-SQL(HBase, Cassandra,...) or New-SQL solutions for your analyses because they have better performance when the scale of your data increase.

Static Data? Spark; Otherwise tune Postgres
If the 10 billion rows are static or rarely updated, your best bet is going to be using Spark with appropriate partitions. The magic happens with parallelization, so the more cores you have, the better. You want to aim for partitions that are about half a gig in size each.
Determine the size of the data by running SELECT pg_size_pretty( pg_total_relation_size('tablename')); Divide the result by the number of cores available to Spark until you get between 1/8 and 3/4 gig.
Save as parquet if you really have static data or if you want to recover from a failure quickly.
If the source data are updated frequently, you're going to want to add indices in Postgres. It could be as straightforward as adding an index on each column. Partitioning in Postgres would also help.
Stick to Postgres. Newer databases are not appropriate for structured data such as yours. There are parallelization options. Aurora, if you're on AWS.
PG-Strom is not going to work for you here. You have simple data with few columns. Getting them into and out of a GPU is going to slow you down too much.

Related

The mechanism behind the join operation between one local table and one DB table

When I register one table from local RDD and one table from DB, I found the join operation between two tables was really slow.
The table from DB is actually a SQL that has multiple join operations, and the local RDD only has 20 records.
I am curious about the mechanism behind it.
Do we pull data from remote DB and execute all tasks in the local Spark cluster?
or Do we have an 'interesting' SQL engine to send an optimized query to DB and wait for the query result back? In my opinion, this way does not make sense, because the query executes really fast in DB.
Spark SQL will run the queries in its side. When you define some tables to join in query in first step it will fetch tables in cluster and save them in memory as RDD or Dataframe, and at last runs some tasks to do query operations.
In your example we suppose the first RDD is already in memory, but the second need to fetch. The datasource will request to your SQL engine to take tables. But before delivering the table, as it is a query with multiple join, your SQL engine will run the query in its side (different from spark cluster) and deliver the results when the table is ready. Let suppose SQL engine take TA seconds to run query (with total result, not top results when you can see in SQL client), and moving data to sql cluster (possibly over the network) takes TB seconds. After TA+TB second data is ready for spark to run.
and If TC is the time for join operation, the total time will be total = TA+TB+TC. You need to check where is you bottleneck. I think TB may be a critical one for large data.
However, when using a cluster with two or more workers, be assure that all nodes involve in the operation. Sometimes, because of wrong coding, the spark will use just one node to do operations. Make sure your data is spread out over the cluster to benefit from data locality.

Mysql or Spark Processing of 400gb data

If I use spark in my case, based on block and cores will it be useful ?
I have 400 GB of data in single table i.e. User_events with multiple columns in MySQL. This table stores all user events from application. Indexes are there on required columns. I have an user interface where user can try different permutation and combination of fields under user_events
Currently I am facing the performance issues where query either takes 15/20 seconds or even longer or times out.
I have gone through couple of Spark tutorial but I am not sure if it can help here. Per mine understanding from spark,
First Spark has to bring all the data in memory. Bring 100 M record on netwok will be costly operation and I will be needing big memory for the
same. Isn't it ?
Once data in memory, Spark can distribute the data among partition based on cores and input data size. Then it can filter the data on each partition
in parallel. Here Spark can be beneficial as it can do the parallel operation while MySQL will be sequential. Is that correct ?
Is my understanding correct ?

best failsafe strategy to store result of spark sql for structured streaming and OLAP queries

I would like to store result of continuous queries running against streaming data in such a manner so that results are persisted into distributed nodes to ensure failover and scalability.
Can Spark SQL experts please shed some light on
- (1) which storage option I should choose so that OLAP queries are faster
- (2) how to ensure data available for query even if one node is down
- (3) internally how does Spark SQL store the resultset ?
Thanks
Kaniska
It depends what kind of latency you can afford.
One way is to persist the result into HDFS/Cassandra using Persist() API. If your data is small then cache() of each RDD should give you a good result.
Store where your spark executors are co-located. For example:
It is also possible to use Memory based storage like tachyon to persist your stream (i.e. each RDD of your stream) and query against it.
If latency is not an issue then persist(MEMORY_OR_DISK_2) should give you what you need. Mind you performance is a hit or miss in that scenario. Also this stores the data in two executors.
In other cases if your clients are more comfortable in OLTP like database where they just need to query the constantly updating result you can use conventional database like postgres or mysql. This is a preferred method among many as query time is consistent and predictable. If the result is not update heavy but partitioned (say by time) then Greenplum like systems are also a choice.

Is MySQL more efficient in query optimization and general efficiency than Apache spark

I find that Apache spark is much slower then a MySQL server for the same query and the same table query on a spark data frame.
So where would be spark more efficient then MySQL?
Note : tried on a table with 1 million rows all of 10 columns of type text.
The size of table in json is about 10GB
Using a standalone pyspark notebook with Xeon 16 core and 64gb RAM and on same server MySql
In general I would like to know guidelines on when to use SPARK vs SQL server in terms of the size of target data to get real snappy results from analytic queries.
Ok, so going to try and help here even though it's still very difficult to answer this without knowing more. Assuming there is no contention for resources, there are a number of things that are going on here. If you're running on yarn and your json is stored in hdfs. It is likely split into many blocks, those blocks are then processed in different partitions. Since json doesn't split very well, you'd lose alot of parallel capabilities. Also, spark isn't meant to really have the super low latency queries like a tuned rdbms. Where you benefit from spark is on heavy data processing, large amounts of data (TB or PB). If you are looking for low latency queries you should use Impala or Hive with Tez. You should also consider changing your file format to avro, parquet or ORC.

spark datasax cassandra connector slow to read from heavy cassandra table

I am new to Spark/ Spark Cassandra Connector. We are trying spark for the first time in our team and we are using spark cassandra connector to connect to cassandra Database.
I wrote a query which is using a heavy table of the database and I saw that Spark Task didn't start until the query to the table fetched all the records.
It is taking more than 3 hours just to fetch all the records from the database.
To get the data from the DB we use.
CassandraJavaUtil.javaFunctions(sparkContextManager.getJavaSparkContext(SOURCE).sc())
.cassandraTable(keyspaceName, tableName);
Is there a way to tell spark to start working even if all the data didn't finish to download ?
Is there an option to tell spark-cassandra-connector to use more threads for the fetch ?
thanks,
kokou.
If you look at the Spark UI, how many partitions is your table scan creating? I just did something like this and I found that Spark was creating too many partitions for the scan and it was taking much longer as a result. The way I decreased the time on my job was by setting the configuration parameter spark.cassandra.input.split.size_in_mb to a value higher than the default. In my case it took a 20 minute job down to about four minutes. There are also a couple more Cassandra read specific Spark variables that you can set found here.
These stackoverflow questions are what I referenced originally, I hope they help you out as well.
Iterate large Cassandra table in small chunks
Set number of tasks on Cassandra table scan
EDIT:
After doing some performance testing with regards to fiddling with some Spark configuration parameters, I found that Spark was creating far too many table partitions when I wasn't giving the Spark executors enough memory. In my case, upping the memory by a gigabyte was enough to render the input split size parameter unnecessary. If you can't give the executors more memory, you may still need to set spark.cassandra.input.split.size_in_mbhigher as a form of workaround.

Resources