Spark Sql - Running twice - apache-spark

I come across spark code for ETL process, in that code long complex sql statements are written and facing oom error and sometime single job takes 4 hours with multple execution of same code.
They have many etl processes like this and i pasted the example query here with long complex joins with nesting and aggregation, group by, ordr by etc. Btw still it is not full query
Please look at below query and they are using it as
SqlContext.sql(below query).write.mode(append).insertinto(hivetbl)
Is this the right way of utilizing spark ??
SELECT
above 30 joins with aggregations

I also had the same problem. Spark infers the type of column by running the user's query with an empty result. (check out https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala line 112, getSchemaQuery function)
Simple queries like (select ... from ...) where 1=0 would be super fast, but the queries that has join operation will be slow as much as join operation works.
Maybe you can use (https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html) "customSchema" option to prevent spark infer your types of result.

Did you try breaking the SQL into multiple fragments? Probably one join per fragment?
take a join write the output into parquet file
join the output with another table and dump it into parquet file
repeat the above.
Joins in big-data are distributed and have a lot of assumptions underlying. Incidentally, for your use-case, prestodb seems to be the right fit.

Related

Incremental and parallelism read from RDBMS in Spark using JDBC

I'm working on a project that involves reading data from RDBMS using JDBC and I succeeded reading the data. This is something I will be doing fairly constantly, weekly. So I've been trying to come up with a way to ensure that after the initial read, subsequent ones should only pull updated records instead of pulling the entire data from the table.
I can do this with sqoop incremental import by specifying the three parameters (--check-column, --incremental last-modified/append and --last-value). However, I dont want to use sqoop for this. Is there a way I can replicate same in Spark with Scala?
Secondly, some of the tables do not have unique column which can be used as partitionColumn, so I thought of using a row-number function to add a unique column to these table and then get the MIN and MAX of the unique column as lowerBound and upperBound respectively. My challenge now is how to dynamically parse these values into the read statement like below:
val queryNum = "select a1.*, row_number() over (order by sales) as row_nums from (select * from schema.table) a1"
val df = spark.read.format("jdbc").
option("driver", driver).
option("url",url ).
option("partitionColumn",row_nums).
option("lowerBound", min(row_nums)).
option("upperBound", max(row_nums)).
option("numPartitions", some value).
option("fetchsize",some value).
option("dbtable", queryNum).
option("user", user).
option("password",password).
load()
I know the above code is not right and might be missing a whole lot of processes but I guess it'll give a general overview of what I'm trying to achieve here.
It's surprisingly complicated to handle incremental JDBC reads in Spark. IMHO, it severely limits the ease of building many applications and may not be worth your trouble if Sqoop is doing the job.
However, it is doable. See this thread for an example using the dbtable option:
Apache Spark selects all rows
To keep this job idempotent, you'll need to read in the max row of your prior output either directly from loading all data files or via a log file that you write out each time. If your data files are massive you may need to use the log file, if smaller you could potentially load.

Forcing pyspark join to occur sooner

PROBLEM: I have two tables that are vastly different in size. I want to join on some id by doing a left-outer join. Unfortunately, for some reason even after caching my actions after the join are being executed on all records even though I only want the ones from the left table. See below:
MY QUESTIONS:
1. How can I set this up so only the records that match the left table get processed through the costly wrangling steps?
LARGE_TABLE => ~900M records
SMALL_TABLE => 500K records
CODE:
combined = SMALL_TABLE.join(LARGE_TABLE SMALL_TABLE.id==LARGE_TABLE.id, 'left-outer')
print(combined.count())
...
...
# EXPENSIVE STUFF!
w = Window().partitionBy("id").orderBy(col("date_time"))
data = data.withColumn('diff_id_flag', when(lag('id').over(w) != col('id'), lit(1)).otherwise(lit(0)))
Unfortunately, my execution plan shows the expensive transformation operation above is being done on ~900M records. I find this odd since I ran df.count() to force the join to execute eagerly rather than lazily.
Any Ideas?
ADDITIONAL INFORMATION:
- note that the expensive transformation in my code flow occurs after the join (at least that is how I interpret it) but my DAG shows the expensive transformation occurring as a part of the join. This is exactly what I want to avoid as the transformation is expensive. I want the join to execute and THEN the result of that join to be run through the expensive transformation.
- Assume the smaller table CANNOT fit into memory.
The best way to do this is to broadcast the tiny dataframe. Caching is good for multiple actions, which doesnt seem to be applicable ro your particular use case.
df.count has no effect on the execution plan at all. It is just expensive operation executed without any good reason.
Window function application in this requires the same logic as join. Because you join by id and partitionBy idboth stages will require the same hash partitioning and full data scan for both sides. There is no acceptable reason to separate these two.
In practice join logic should be applied before window, serving as a filter for the the downstream transformations in the same stage.

why spark still slow than mysql?

I am trying to work with Apache spark with data source MySQL. I have a cluster having 1 master and 1 slave node and both have 8 GB ram and 2 cores I am submitting my SQL query to spark using spark-shell and that table having 6402821 this many rows. I am performing a group by onto that table. and time taken by MySQL is 5.2secs and using spark when I am performing query the time is 21Secs. why is this happening?
i am also setting some configurations like partitionColumn, upperBound, lowerBound, and numofPartitions but still no change.
I have also tried with executing the query using 1,2,4 cores but the time taken by the spark is same 21Secs.
is this problem occurs because of my MySQL database is on a single machine and
all spark nodes try to query on data onto that single machine?
Can any one help me to solve this issue?
the database having a table called demo_call_stats on which i am trying to query is:
val jdbcDF = spark.read.format("jdbc").options( Map("url" -> "jdbc:mysql://192.168.0.31:3306/cmanalytics?user=root&password=","zeroDateTimeBehaviour"->"convertToNull", "dbtable" -> "cmanalytics.demo_call_stats", "fetchSize" -> "10000", "partitionColumn" -> "newpartition", "lowerBound" -> "0", "upperBound" -> "4", "numPartitions" -> "4")).load()
jdbcDF.createOrReplaceTempView("call_stats")
val sqlDF = sql("select Count(*), classification_id from call_stats where campaign_id = 77 group by classification_id")
sqlDF.show()
Any help will be most appreciated.
Thanks
There is a couple of things you should understand here:
Despite what you might have heard, Spark isn't 'faster than MySQL', simply because this kind of generality doesn't mean anything.
Spark is faster than MySQL for some queries, and MySQL is faster than Spark for others.
Generally speaking, MySQL is a relational database, meaning it has been conceived to serve
as a back-end for an application. It is optimized to access records efficiently as long as they are indexed.
When thinking about databases, I like to think of them as a library with one librarian to help you get the books you want
(I am speaking about a very old school library here, without any computer to help the librarian).
If you ask your librarian:
"I want to know how many books you have that are about Geopolitics",
the librarian can go to the Geopolitics shelf and count the number of books on that shelf.
If you ask your librarian:
"I want to know how many books you have that have at least 500 pages",
the librarian will have to look at every single book in the library to answer your query.
In SQL this is called a full table scans.
Of course you can have several librarians (processors) working on the query to go faster,
but you cannot have more than a few of them (let's say up to 16) inside your library (computer).
Now, Spark has been designed to handle large volume of data, namely libraries that are so big
that they won't fit into a single buildings, and even if it does, they will be so many that
even 16 librarians will take days to look at them all to answer your second query.
What makes Spark faster than MySQL is just this: if you put your books in several buildings,
you can have 16 librarians per building working on your answer.
You can also handle a larger number of books.
Also, since Spark is mostly made to answer the second type of queries rather than queries like "Please bring me 'The Portrait of Dorian Gray', by Oscar Wilde", it means that Spark doesn't care, at least by default, to sort your books in any particular way.
This means that if you want to find that particular book with spark, your librarians will have
to go through the entire library to find it.
Of course, Spark uses many other type of optimizations to perform some queries more efficiently,
but indexation is not one of them (if you are familiar with the notion of Primary Key in mySQL, there are no such thing in Spark).
Other optimizations include storage format like Parquet and ORC that allow you to read only the columns that are useful
to answer your queries, and compression (e.g. Snappy), which are aimed at increasing the number of books you can fit
in your library without having to push the walls.
I hope this metaphor helped you, but please bear in mind that this is just a metaphor and
doesn't fit reality perfectly.
Now, to get back to your question specific details:
Assuming campaign_id is your primary key or you created an index on this column, MySQL will only have
to read the rows for which campaign_id = 77.
On the other hand, Spark will have to ask for mySQL to send all the rows in that table to Spark.
If Spark is clever, it will only ask for the one with campaign_id = 77, and maybe it will send multiple queries to mySQL to get ranges in parallel.
But this means that all the data that MySQL could just read and aggregate will have to be serialized, sent to Spark, and be aggregated by Spark.
I hope you see why this should take longer.
If you want Spark to answer your queries faster than MySQL, you should try copying your table in another format like this.
// replace this line :
// jdbcDF.createOrReplaceTempView("call_stats")
// with :
jdbcDF.write.format("orc").saveAsTable("call_stats")
Another thing you could try is caching your data like this:
jdbcDF.cache().createOrReplaceTempView("call_stats")
Caching won't be bring any improvement for the first query as it will cache the data while performing it, but if you continue querying the same view, it might be faster.
But as I explained above, this doesn't mean Spark will be faster than mySQL for everything.
For small data and local deployements, you can also get a perf improvement by changing this configuration
parameter: spark.sql.shuffle.partitions=4 which is 200 by default.
Hope this helps.

Spark SQL - READ and WRITE in sequence or pipeline?

I am working on a cost function for Spark SQL.
While modelling the TABLE SCAN behaviour I cannot understand if READ and WRITE are carried out in pipeline or in sequence.
Let us consider the following SQL query:
SELECT * FROM table1 WHERE columnA = ‘xyz’;
Each task:
Reads a data block (either locally or from a remote node)
Filter out the tuples that do not satisfy the predicate
Write to the disk the remaining tuples
Are (1), (2) and (3) carried out in sequence or in pipeline? In other words, the data block is completely read (all the disk pages composing it) first and then it is filtered and then it is rewritten to the disk or are these activities carried out in pipeline? (i.e. while reading the (n+1)-tuple, n-tuple can be processed and written).
Thanks in advance.
Whenever you submit a job, first thing spark does is create DAG (Directed acyclic graph) for your job.
After creating DAG, spark knows, which tasks it can run in parallel, which task are dependent on output of previous step and so on.
So, in your case,
Spark will read your data in parallel (which you can see in partition), filter them out (in each partition).
Now, since saving required filtering, so it will wait for filtering to finish for at least one partition, then start to save it.
After some more digging I found out that Spark SQL uses a so called "volcano style pull model".
According to such model, a simple scan-filter-write query whould be executed in pipeline and are fully distributed.
In other words, while reading the partition (HDFS block), filtering can be executed on read rows. No need to read the whole block to kick off the filtering. Writing is performed accordingly.

Inner Join in cassandra CQL

How do I write subqueries/nested queries in cassandra. Is this facility is provided in CQL?
Example I tried:
cqlsh:testdb> select itemname from item where itemid = (select itemid from orders where customerid=1);
It just throws the following error -
Bad Request: line 1:87 no viable alternative at input ';'
Because of its distributed nature, Cassandra has no support for RDBMS style joins. You have a few options for when you want something like a join.
One option perform separate queries and then have your application join the data itself. This makes sense if the data is relatively small and you only have to perform a small number of queries. Based on the example you gave above, this would probably be a good solution for you.
For more complicated joins, the usual strategy is to denormalize the data and store a materialized view of the join. The advantage to this is that fetching this data will be much faster than having to build it join in your application every time you need it. The cost is now you have multiple places where you are storing the same data and you will need to keep it all in sync. You can either update all your views when new data comes into the system or you can have a periodic batch job that rebuilds thems.
You might find this article useful: Do You Really Need SQL to Do It All in Cassandra? Its a bit old but its principles still apply.

Resources