As I am just starting out in the Big Data field, I am looking for advice on how it would be most efficient way to get some data into Spark in order to analyze it.
The SQL query is rather large, with multiple sub-queries, each with it's own "when", "group by" etc.
THe final data would have somewhere between 1 million and 20 million rows.
Is it the same thing (performance wise) if I run a spark sql query and save it into a dataframe using pyspark, or if I extract each subquery into different spark dataframes and use spark to do the grouping / filtering / etc. ?
For example, are these two methods equivalent on the amount of resources / time they use to process my data?
method 1:
df_final = spark.sql("""
With subquery 1 as(...),
subquery 2 as(...),
subquery 3 as(...),
...
select * from subquery n
"""
method 2:
df1 = spark.sql(subquery 1)
df2 = spark.sql(subquery 2)
...
df_final = *spark manipulation of dataframes here"
I would appreciate any advice. Thanks
Spark will create a DAG which should be equivalent in both cases. Performance should be the same in both cases.
Related
I want to group all items in a source based on a specified, pre-defined category. The number of items per category could be in the order of millions. The groupBy helps me achieve this, but I want to understand if repartitioning on the product-type before grouping, would be more efficient?
The source for the spark jobs is hive tables. The version of spark is latest 2.4.4. The problem statement for me is that I want to run a customised similarity algorithm for every item with every other item in a given category. So, by the end of this operation, for every item, I would have the 10 most similar items to it.
Since this involves a groupBy operation and since groupBy involves shuffling of data, I thought first I would repartition the data based upon the category. I can even set the number of partitions to the number of categories that I have(in the magnitude of 100s).
Once data is re-partitioned sent on individual workers, running groupBy should be a local operation- if I do the groupBy on the same type. Is this assumption correct?
// For demo, I am reading from CSV. The final source is a hive table
Dataset<Row> rows = spark.read().option("sep", "\t")
.csv("<some path>")
.repartition(20, new Column("category"))
.cache();
Dataset<Row> ids_grouped_by_category = rows.map((MapFunction<Row, Row>) items -> {
// Some transformation returns a row in the format I need.
return new-row;
}, <encoder>)
.groupBy(functions.col("category"))
.agg(functions.collect_list("category").as("ids"));
At the end of this operation, I have been able to group all item-ids for a given category into a list. Something like this:
+---------------------------+------------------------------------------+
|category | ids |
+---------------------------+------------------------------------------+
|category-1 | [id1, id2...] |
|category-2 | [idx, idy...] |
+---------------------------+------------------------------------------+
I have been able to get the data in the format I need but wanted to understand is this way of doing a group-by correct?
Also, what are the implications of doing a collectList operation? Does it load everything in-memory?
I am using Hive with Spark 1.6.3
I have a large dataset (40000 rows, 20 columns or so and each column contains maybe 500 Bytes - 3KB of data)
The query is a join to 3 datasets
I wish to be able to page the final join dataset, and i have found that i can use row_number() OVER (ORDER BY 1) to generate a unique row number for each row in the dataset.
After this I can do
SELECT * FROM dataset WHERE row between 1 AND 100
However, there are resources which advise not to use ORDER BY as it puts all data into 1 partition (I can see this is the case in the logs where the shuffle allocation is moving the data to one partition), when this happens I get out of memory exceptions.
How would i go about paging through the dataset in a more efficient way?
I have enabled persist - MEMORY_AND_DISK so that if a partition is too large it will spill to disk (and for some of the transformation I can see that at least some of the data is spilling to disk when I am not using row_number() )
One strategy could be select only the unique_key of the dataset first and apply row_number function on that dataset only. Since you are selecting a single column from a large dataset chances are higher that it will fit in a single partition.
val dfKey = df.select("uniqueKey")
dfKey.createOrUpdateTempTable("dfKey")
val dfWithRowNum = spark.sql(select dfKey*, row_number() as row_number OVER (ORDER BY 1))
// save dfWithRowNum
After to complete the row_number operation on the uniqueKey; save that dataframe. Now in the next stage join this dataframe with the bigger dataframe and append the row_number column to that.
dfOriginal.createOrUpdateTempTable("dfOriginal")
dfWithRowNum.createOrUpdateTempTable("dfWithRowNum")
val joined = spark.sql("select dfOriginal.* from dfOriginal join dfWithRowNum on dfOriginal.uniqueKey = dfWithRowNum.uniqueKey")
// save joined
Now you can query
SELECT * FROM joineddataset WHERE row between 1 AND 100
For the persist with MEMORY_DISK, I found that occasionally fail with insufficient memory. I would rather use DISK_ONLY where performance is penalized although the execution is guaranteed.
Well, you can apply this method on your final join dataframe.
You should also persist the dataframe as a file to guarantee the ordering, as reevaluation could creates a different order.
Suppose there is a dataset with some number of rows.
I need to find out the Heterogeneity i.e.
distinct number of rows divide by total number of rows.
Please help me with spark query to execute the same.
Dataset and dataframe supports distinct function which finds distinct rows in the dataset.
So essentially you need to do
val heterogeneity = dataset.distinct.count / dataset.count
Only thing is if the dataset is big the distinct could be expensive and you might need to set the spark shuffle partition correctly.
I am working on spark streaming application, where I partition the data as per a certain ID in the data.
For eg: partition 0-> contains all data with id 100
partition 1 -> contains all data with id 102
Next I want to execute query on whole dataframe for final result. But my query is specific to each partition.
For eg: I need to run
select(col1 * 4) in case of partiton 0
while
select(col1 * 10) in case of parition 1.
I have looked into documentation but didnt find any clue. One solution i have is to create different RDDs/ Dataframe for different id in data. But that is not scalable in my case.
Any suggestion how to run query on dataframe where query can be specific to each partition.
Thanks
I think you should not couple your business logic with Spark's way of partitioning your data (you won't be able to repartition your data if required). I would suggest to add an artificial column in your DataFrame that equals with the partitionId value.
In any case, you can always do
df.rdd.mapPartitionsWithIndex{ case (partId, iter: Iterable[Row]) => ...}
See also the docs.
Dataframe A (millions of records) one of the column is create_date,modified_date
Dataframe B 500 records has start_date and end_date
Current approach:
Select a.*,b.* from a join b on a.create_date between start_date and end_date
The above job takes half hour or more to run.
how can I improve the performance
DataFrames currently doesn't have an approach for direct joins like that. It will fully read both tables before performing a join.
https://issues.apache.org/jira/browse/SPARK-16614
You can use the RDD API to take advantage of the joinWithCassandraTable function
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md#using-joinwithcassandratable
As others suggested, one of the approach is to broadcast the smaller dataframe. This can be done automatically also by configuring the below parameter.
spark.sql.autoBroadcastJoinThreshold
If the dataframe size is smaller than the value specified here, Spark automatically broadcasts the smaller dataframe instead of performing a join. You can read more about this here.