Select specific columns in a PySpark dataframe to improve performance - apache-spark

Working with Spark dataframes imported from Hive, sometimes I end up with several columns that I don't need. Supposing that I don't want to filter them with
df = SqlContext.sql('select cols from mytable')
and I'm importing the entire table with
df = SqlContext.table(mytable)
does a select and subsequent cache improves performance/decrease memory usage, like
df = df.select('col_1', 'col_2', 'col_3')
df.cache()
df.count()
or is just waste of time? I will do lots of operations and data manipulations on df, like avg, withColumn, etc.

IMO it makes sense to filter them beforehand:
df = SqlContext.sql('select col_1, col_2, col_3 from mytable')
so you won't waste resources...
If you can't do it this way, then you can do it as you did it...

It is certainly a good practice but it is rather unlikely to result in a performance boost unless you try to pass data through Python RDD or do something similar. If certain columns are not required to compute the output optimizer should automatically infer projections and push these as early as possible in the execution plan.
Also it is worth noting that using df.count() after df.cache() will be useless most of the time (if not always). In general count is rewritten by the optimizer as
SELECT SUM(1) FROM table
so what is typically requested from the source is:
SELECT 1 FROM table
Long story short there is nothing useful to cache here.

Related

Need to release the memory used by unused spark dataframes

I am not caching or persisting the spark dataframe. If I have to do many additional things in the same session by aggregating and modifying content of the dataframe as part of the process then when and how would the initial dataframe be released from memory?
Example:
I load a dataframe DF1 with 10 million records. Then I do some transformation on the dataframe which creates a new dataframe DF2. Then there are a series of 10 steps I do on DF2. All through this, I do not need DF1 anymore. How can I be sure that DF1 no longer exists in memory and hampering performance? Is there any approach using which I can directly remove DF1 from memory? Or does DF1 get automatically removed based on Least Recently Used (LRU) approach?
That's not how spark work. Dataframes are lazy ... the only things stored in memories are the structures and the list of tranformation you have done on your dataframes. The data are not stored in memory (unless you cache them and apply an action).
Therefore, I do not see any problem in your question.
Prompted by a question from A Pantola, in comments I'm returning here to post a better answer to this question. Note there are MANY possible correct answers how to optimize RAM usage which will depend on the work being done!
First, write the dataframe to DBFS, something like this:
spark.createDataFrame(data=[('A',0)],schema=['LETTERS','NUMBERS'])\
.repartition("LETTERS")\
.write.partitionBy("LETTERS")\
.parquet(f"/{tmpdir}",mode="overwrite")
Now,
df = spark.read.parquet(f"/{tmpdir}")
Assuming you don't set up any caching on the above df, then each time spark finds a reference to df it will parallel read the dataframe and compute whatever is specified.
Note the above solution will minimize RAM usage, but may require more CPU on every read. Also, the above solution will have a cost of writing to parquet.

Spark dataframe select using SQL without createOrReplaceTempView

I'd like to perform SQL like syntax on Spark data frame df.
Let's say I need a calculation
cal_col = 113.4*col1 +41.4*col2....
What I do at the moment is either :
1/ Broadcasting as temp view:
df.createOrReplaceTempView("df_view")
df = spark.sql("select *, 113.4*col1 +41.4*col2... AS cal_col from df_view")
Question : Is there a lot of overhead by broadcasting a big df as view ? If yes, at which point it no longer makes sense ? Let's say df has 250 columns, 15Million records.
2/ Pyspark dataframe syntax, which is a bit more difficult to read and need modification from the formula :
df = df.withColumn("cal_col", 113.4*F.col("col1") + 41.4*F.col("col2")+...)
The formula may be lengthy and become difficult to read.
Question: Is there a way to write as SQL-like syntax without F.col ?
Something along the line
df = df.select("*, (113.4*col1 +41.4*col2...) as cal_col")
You can use df.selectExpr("") to write spark in SQL like syntax on your dataframe.
df.selectExpr("*, (113.4*col1 +41.4*col2...) as cal_col")
Also, a better way to do want you want instead of creating a view, is to df.persist() before your logic to send the dataframe to memory(and spill to disk- by default) and then run your selectExpr on it.
Link: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.selectExpr

Efficient pyspark join

I've read a lot about how to do efficient joins in pyspark. The ways to achieve efficient joins I've found are basically:
Use a broadcast join if you can. (I usually can't because the dataframes are too large)
Consider using a very large cluster. (I'd rather not because of $$$).
Use the same partitioner.
The last one is the one i'd rather try, but I can't find a way to do it in pyspark. I've tried:
df.repartition(numberOfPartitions,['parition_col1','partition_col2'])
but it doesn't help, it still takes way too long until I stop it, because spark get's stucked in the last few jobs.
So, how can I use the same partitioner in pyspark and speed up my joins, or even get rid of the shuffles that takes forever ? Which code do I need to use ?
PD: I've checked other articles, even on stackoverflow, but I still can't see code.
you can also use a two-pass approach, in case it suits your requirement.First, re-partition the data and persist using partitioned tables (dataframe.write.partitionBy()). Then, join sub-partitions serially in a loop, "appending" to the same final result table.
It was nicely explained by Sim. see link below
two pass approach to join big dataframes in pyspark
based on case explained above I was able to join sub-partitions serially in a loop and then persisting joined data to hive table.
Here is the code.
from pyspark.sql.functions import *
emp_df_1.withColumn("par_id",col('emp_id')%5).repartition(5, 'par_id').write.format('orc').partitionBy("par_id").saveAsTable("UDB.temptable_1")
emp_df_2.withColumn("par_id",col('emp_id')%5).repartition(5, 'par_id').write.format('orc').partitionBy("par_id").saveAsTable("UDB.temptable_2")
So, if you are joining on an integer emp_id, you can partition by the ID modulo some number and this way you can re distribute the load across the spark partitions and records having similar keys will be grouped together and reside on same partition.
you can then read and loop through each sub partition data and join both the dataframes and persist them together.
counter =0;
paritioncount = 4;
while counter<=paritioncount:
query1 ="SELECT * FROM UDB.temptable_1 where par_id={}".format(counter)
query2 ="SELECT * FROM UDB.temptable_2 where par_id={}".format(counter)
EMP_DF1 =spark.sql(query1)
EMP_DF2 =spark.sql(query2)
df1 = EMP_DF1.alias('df1')
df2 = EMP_DF2.alias('df2')
innerjoin_EMP = df1.join(df2, df1.emp_id == df2.emp_id,'inner').select('df1.*')
innerjoin_EMP.show()
innerjoin_EMP.write.format('orc').insertInto("UDB.temptable")
counter = counter +1
I have tried this and this is working fine. This is just an example to demo the two-pass approach. your join conditions may vary and the number of partitions also depending on your data size.
Thank you #vikrantrana for your answer, I will try it if I ever need it. I say these because I found out the problem wasn't with the 'big' joins, the problem was the amount of calculations prior to the join. Imagine this scenario:
I read a table and I store in a dataframe, called df1. I read another table, and I store it in df2. Then, I perfome a huge amount of calculations and joins to both, and I end up with a join between df1 and df2. The problem here wasn't the size, the problem was spark's execution plan was huge and it couldn't maintain all the intermediate tables in memory, so it started to write to disk and it took so much time.
The solution that worked to me was to persist df1 and df2 in disk before the join (I also persisted other intermediate dataframes that were the result of big and complex calculations).

Spark very slow performance with wide dataset

I have a small parquet file (7.67 MB) in HDFS, compressed with snappy. The file has 1300 rows and 10500 columns, all double values. When I create a data frame from the parquet file and perform a simple operation like count, it takes 18 seconds.
scala> val df = spark.read.format("parquet").load("/path/to/parquet/file")
df: org.apache.spark.sql.DataFrame = [column0_0: double, column1_1: double ... 10498 more fields]
scala> df.registerTempTable("table")
scala> spark.time(sql("select count(1) from table").show)
+--------+
|count(1)|
+--------+
| 1300|
+--------+
Time taken: 18402 ms
Can anything be done to improve performance of wide files?
Hey Glad you are here on the community,
Count is a lazy operation.Count,Show all these operations are costly in spark as they run over each and every record so using them will always take a lot of time instead you can write the results back to a file or database to make it fast, if you want to check out the result you can use DF.printSchema()
A simple way to check if a dataframe has rows, is to do a Try(df.head). If Success, then there's at least one row in the dataframe. If Failure, then the dataframe is empty.
When operating on the data frame, you may want to consider selecting only those columns that are of interest to you (i.e. df.select(columns...)) before performing any aggregation. This may trim down the size of your set considerably. Also, if any filtering needs to be done, do that first as well.
I find this answer which may be helpful to you.
Spark SQL is not suitable to process wide data (column number > 1K). If it's possible, you can use vector or map column to solve this problem.

Spark: Apply multiple transformations without recalculating or caching

Is it possible to take the output of a transformation (RDD/Dataframe) and feed it to two independent transformations without recalculating the first transformation and without caching the whole dataset?
Long version
Consider the case.
I have a very large dataset that doesn't fit in memory. Now I do some transformations on it which prepare the data to be worked on efficiently (grouping, filtering, sorting....):
DATASET --(TF1: transformation with group by, etc)--> DF1
DF1 --(TF2: more_transformations_some_columns)--> output
DF1 --(TF3: more_transformations_other_columns)--> output2
I was wondering if there is any way (or planned in dev) to tell Spark that, after TF1, it must reuse the same results (at partition level, without caching everything!) to serve both TF2 and TF3.
This can be conceptually imagined as a cache() at each partition, with automatic unpersist() when the partition was consumed by the further transformations.
I searched for a long time but couldn't find any way of doing it.
My attempt:
DF1 = spark.read()... .groupBy().agg()...
DF2 = DF1.select("col1").cache() # col1 fits in mem
DF3 = DF1.select("col1", transformation(other_cols)).write()... # Force evaluation of col1
Unfortunately, DF3 cannot guess it could to the caching of col1. So apparently it isn't possible to ask spark to only cache a few columns. That would already alleviate the problem.
Any ideas?
I don't think it is possible to cache just some of the columns,
but will this solve your problem?
DF1 = spark.read()... .groupBy().agg()...
DF3 = DF1.select("col1", transformation(other_cols)).cache()
DF3.write()
DF2 = DF3.select("col1")

Resources