Does Spark optimize the storage when a nested column is cached - apache-spark

I read a DataFrame from parquet and I want to cache it after selecting some nested structures.
df.select($"a.b.c" as "c").cache()
I know that the whole a column will be read from the input (Spark 2.5. should solved that : SPARK-17636) but I'm wondering if the storage will be more clever and stored only the result of the selection (so not the whole a).

Yes, only the result of the selection is cached after an action is take. The select statement returns a dataframe, that is the dataframe that is cached.
Note that in your code caching has not yet occurred because no action is taken. You would need to perform some action to populate the cache like
df.select($"a.b.c" as "c").cache().count()

Related

Spark: Is this wrong way to cache temp view?

I've seen follow code and i think that it is a wrong way to cache tempview in Spark. What do you think?
spark.sql(
s"""
|...
""".stripMargin).createOrReplaceTempView(s"temp_view")
spark.table(s"temp_view").cache()
For my opinion, this code caches DataFrame that I create by spark.table("temp_view"), but not original temp view.
Am I right?
Imo yes, you are caching what you read from this table, but for example if in next line you are going to read it again you will end up with second scan
I think that maybe you can try to use cache table within your sql
https://spark.apache.org/docs/latest/sql-ref-syntax-aux-cache-cache-table.html
CACHE TABLE statement caches contents of a table or output of a query
with the given storage level. If a query is cached, then a temp view
will be created for this query. This reduces scanning of the original
files in future queries.
For me its seems promising
I think the caching in your example will actually work. Spark does not cache instances of DataFrame. Instead, it uses logical plans as the cache key, and the view is transparent for that purpose. For example, here's the code I've just tried using some local table I have
val df = spark.table("mart.dim_region")
df.createOrReplaceTempView("dim_region")
spark.table("dim_region").cache()
Even though cache is applied to view, if I repeatedly invoke df.show, the execution plan contains InMemoryTableScan - which is precisely the effect of caching.

Spark dataframe is being re-evaluated after a cache

I am running into some issues using cache on a spark dataframe. My expectation is that after a cache on a dataframe, the dataframe is created and cached the fist time it is needed. Any further calls to the dataframe should be from the cache
here's my code:
val mydf = spark.sql("read about 400 columns from a hive table").
withColumn ("newcol", someudf("existingcol")).
cache()
To test I ran a mydf.count() twice. I would expect the first time to take some time since the data is being cached. But the second time should be instantaneous?
What I am actually seeing is that it takes the same time for both the counts. This first one comes back pretty quickly which I think tells me that the data was not cached. If I remove the withColumn part of the code and just cache the raw data, the second count is instantaneous
Am I doing something wrong? How can I load raw data from hive, add columns and then cache the dataframe for further use? Using spark 2.3
Any help will be great!
the problem with your case is that mydf.count() is not actually materializing the dataframe (i.e. not all columns are read, your udf will no be called). That is because count() is highly optimized.
To make sure the entire dataframe is cached into memory, you should repeat your experiment with mydf.rdd.count() or another query (e.g. using sorting and/or aggregation)
See e.g. this SO question
As you are caching a dataset/dataframe, se the documented default behavior:
def cache(): Dataset.this.type
Persist this Dataset with the default storage level (MEMORY_AND_DISK).
So for your case you can try persist(MEMORY_ONLY)
def persist(newLevel: StorageLevel): Dataset.this.type
Persist this Dataset with the given storage level.
newLevel One of: MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, DISK_ONLY, MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.
If its relevant
.cache/persist is lazy evaluation, to force it you can use the spark SQL's API which have the capability change form lazy to eager.
CACHE [ LAZY ] TABLE table_identifier
[ OPTIONS ( 'storageLevel' [ = ] value ) ] [ [ AS ] query ]
Unless LAZY specified it would be eager mode, you need to register a temp table prior to this.
Pseudo code would be:
df.createOrReplaceTempView("dummyTbl")
spark.sql("cache table dummyTbl")
More on the document reference - https://spark.apache.org/docs/latest/sql-ref-syntax-aux-cache-cache-table.html

Spark 1.6 Dataframe cache not working correctly

My understanding is that if I have a dataframe if I cache() it and trigger an action like df.take(1) or df.count() it should compute the dataframe and save it in memory, And whenever that cached dataframe is called in the program it uses already computed dataframe from cache.
but that is not how my program is working.
I have a dataframe like below which I am caching it, and then immediately I run a df.count action.
val df = inputDataFrame.select().where().withColumn("newcol" , "").cache()
df.count
When I run the program. In Spark UI I see that first line runs for 4 min and
when it comes to second line it again runs 4 min basically first line is re computed twice?
Shouldn't first line computed and cached when second line triggers?
how to resolve this behavior. I am stuck, please advise.
My understanding is that if I have a dataframe if I cache() it and trigger an action like df.take(1) or df.count() it should compute the dataframe and save it in memory,
It is not correct. Simple cache and count (take wouldn't work on RDD either) is a valid method for RDDs but it is not the case with Datasets, which use much more advanced optimizations. With query:
df.select(...).where(...).withColumn("newcol" , "").count()
any column, which is not used in where clause can be ignored.
There is an important discussion on the developer list and quoting Sean Owen
I think the right answer is "don't do that" but if you really had to you could trigger a Dataset operation that does nothing per partition. I presume that would be more reliable because the whole partition has to be computed to make it available in practice. Or, go so far as to loop over every element.
Translated to code:
df.foreach(_ => ())
There is
df.registerAsTempTable("df")
sqlContext.sql("CACHE TABLE df")
which is eager but it is no longer (Spark 2 and forward) documented and should be avoided.
No, if you call cache on a DataFrame it's not cached in this moment, it's only "marked" for potential future caching. The actual caching is only done when an action is followed later. You can also see your cached DataFrame in Spark UI under "Storage"
Another problem in your code is that count on DataFrame does not compute the entire DataFrame because not all columns need to be computed for that. You can use df.rdd.count() to force the entire evualation (see How to force DataFrame evaluation in Spark).
The question is why your first operation takes so long, even if no action is called. I think this is related to the caching logic (e.g. size estimations etc) being computed when calling cache (see eg. Why is rdd.map(identity).cache slow when rdd items are big?)

switch from collect() to avoid java OOM when loading large table from Oracle to Hadoop

BACKGROUND:
I was trying to use load a huge table from Oracle to Hadoop and store the table in parquet format.
Since the table takes too long to be loaded, the Oracle section expired. So my new approach is to load the table in parallel with ranges of the index column in where clause.
So I used the following function to create an array of the list to be used in loops later but seems there will be OOM when the index column "indexUsed" has too many records. Any way to avoid using collect()?
distinctKeyValDf.sort("indexUsed").collect().map(_(0)).toList.sliding(recordNoInBatch, recordNoInBatch).toArray
this is what i am working on now, but seems the type is List[Nothing] instead of List[Any]
df.select(collect_list("indexUsed")).sort().first().getList(0).sliding(recordNoInBatch, recordNoInBatch).toArray
Latest update:
I was looking for a sliding function for dataframe but nothing pops up, so I used collect() for change the df into array. But that makes the java heap overloaded, so I switch to using collect_list in df, but seems the result returns array[nothing], which is not what i am looking for. I end up registering a temp table for the index column and create a for-loop + a do-while loop as a workaround for the sliding function.

How to Identify the list of available RDDs?

I am using the below command to get the list of available registered Temp tables
sqlContext.sql("show tables").collect().foreach(println)
Is there any similar command to get list of available RDDs?
Here is my requirement (using scala)
1. Need to create some RDD on the fly
2. Identify list of available RDDs
3. remove/delete/clear the unwanted RDDs and move forward
How to delete an RDD in PySpark for the purpose of releasing resources?
An additional note, I went through this link, but it doesn't answer all my questions... also i tried the below but don't find any difference before and after unpersist, so not sure how to confirm that my RDD has been released the memory
val tempRDD1 = RDD1.reduceByKey((acc,value)=> acc+value)
tempRDD1.collect.foreach(println)
tempRDD1.unpersist()
tempRDD1.collect.foreach(println)
The RDD data is not saved until it is 1. persisted (cached) and 2. an action occurs to force the preceding transformations to occur. If either of these do not occur, no data will be stored. Any RDD that appears to be "created", will just create an action plan to produce the data if it is needed later. This model is called lazy evaluation.
In your example, no RDD is ever cached, so no data will ever be stored in memory. And the unpersist call will have no effect.

Resources