Spark HiveContext : Insert Overwrite the same table it is read from - apache-spark

I want to apply SCD1 and SCD2 using PySpark in HiveContext. In my approach, I am reading incremental data and target table. After reading, I am joining them for upsert approach. I am doing registerTempTable on all the source dataframes. I am trying to write final dataset into target table and I am facing the issue that Insert overwrite is not possible in the table it is read from.
Please suggest some solution for this. I do not want to write intermediate data into a physical table and read it again.
Is there any property or way to store the final data set without keeping the dependency on the table it is read from. This way, It might be possible to overwrite the table.
Please suggest.

You should never overwrite a table from which you are reading. It can result in anything between data corruption and complete data loss in case of failure.
It is also important to point out that correctly implemented SCD2 shouldn't never overwrite a whole table and can be implemented as a (mostly) append operation. As far as I am aware SCD1 cannot be efficiently implemented without mutable storage, therefore is not a good fit for Spark.

I was going through the documentation of spark and a thought clicked to me when I was checking one property there.
As my table was parquet, I used hive meta store to read the data by setting this property to false.
hiveContext.conf("spark.sql.hive.convertMetastoreParquet","false")
This solution is working fine for me.

Related

Spark: Is this wrong way to cache temp view?

I've seen follow code and i think that it is a wrong way to cache tempview in Spark. What do you think?
spark.sql(
s"""
|...
""".stripMargin).createOrReplaceTempView(s"temp_view")
spark.table(s"temp_view").cache()
For my opinion, this code caches DataFrame that I create by spark.table("temp_view"), but not original temp view.
Am I right?
Imo yes, you are caching what you read from this table, but for example if in next line you are going to read it again you will end up with second scan
I think that maybe you can try to use cache table within your sql
https://spark.apache.org/docs/latest/sql-ref-syntax-aux-cache-cache-table.html
CACHE TABLE statement caches contents of a table or output of a query
with the given storage level. If a query is cached, then a temp view
will be created for this query. This reduces scanning of the original
files in future queries.
For me its seems promising
I think the caching in your example will actually work. Spark does not cache instances of DataFrame. Instead, it uses logical plans as the cache key, and the view is transparent for that purpose. For example, here's the code I've just tried using some local table I have
val df = spark.table("mart.dim_region")
df.createOrReplaceTempView("dim_region")
spark.table("dim_region").cache()
Even though cache is applied to view, if I repeatedly invoke df.show, the execution plan contains InMemoryTableScan - which is precisely the effect of caching.

Apply ForEach (Structured Streaming Programming) to move data from one delta table to another

Let's say that I have a delta table saved which was processed using the ForEachBatch to apply transformations and finally saved a final delta table (let's call these table Table1).
However for some requeriments the data of this table need to be merged or appended to another delta table (Table2) which is being updated by another stream.
My question here is how can I use the ForEach option instead of the ForEachBatch in the new streaming to save that data in the Table2? Considering that for requeriments we need to do append the data of the Table1 to the Table2 record by record 'cause using the option ForEachBatch when the process fail it generates duplicate data and ends breaking up the streaming?
Or is there another way to aproach the problem not using it?
It is important to consider that each table is an streaming table.
We have tried to implement these idea using two stream writting using the ForEachBatch however we have got error and duplicates in different scenarios. First due the fact that we need to use a surrogate key (an indentity) it makes the two streamings to fail.
We avoided it doing one staging table without the identity and later applying it, but the problem is that in some moments if the stream fail using the foreachbatch generates duplicated data and breaks the whole process.
That's why we tought that we could use the foreach to append the data to the table2 but we have no idea how it works and how implement it because it must be record to record and we havent' find an example or anything about how to implement it.
So any help would be appreciate.
If code it's needed, I could try to provide it.

How to solve the maximum view depth error in Spark?

I have a very long task that creates a bunch of views using Spark SQL and I get the following error at some step: pyspark.sql.utils.AnalysisException: The depth of view 'foobar' exceeds the maximum view resolution depth (100).
I have been searching in Google and SO and couldn't find anyone with a similar error.
I have tried caching the view foobar, but that doesn't help. I'm thinking of creating temporary tables as a workaround, as I would like not to change the current Spark Configuration if possible, but I'm not sure if I'm missing something.
UPDATE:
I tried creating tables in parquet format to reference tables and not views, but I still get the same error. I applied that to all the input tables to the SQL query that causes the error.
If it makes a difference, I'm using ANSI SQL, not the python API.
Solution
Using parque tables worked for me after all. I spotted that I was still missing one table to persist so that's why it wouldn't work.
So I changed my SQL statements from this:
CREATE OR REPLACE TEMPORARY VIEW `VIEW_NAME` AS
SELECT ...
To:
CREATE TABLE `TABLE_NAME` USING PARQUET AS
SELECT ...
To move all the critical views to parquet tables under spark_warehouse/ - or whatever you have configured.
Note:
This will write the table on the master node's disk. Make sure you have enough disk or consider dumping in an external data store like s3 or what have you. Read this as an alternative - and now preferred - solution using checkpoints.

Write to a datepartitioned Bigquery table using the beam.io.gcp.bigquery.WriteToBigQuery module in apache beam

I'm trying to write a dataflow job that needs to process logs located on storage and write them in different BigQuery tables. Which output tables are going to be used depends on the records in the logs. So I do some processing on the logs and yield them with a key based on a value in the log. After which I group the logs on the keys. I need to write all the logs grouped on the same key to a table.
I'm trying to use the beam.io.gcp.bigquery.WriteToBigQuery module with a callable as the table argument as described in the documentation here
I would like to use a date-partitioned table as this will easily allow me to write_truncate on the different partitions.
Now I encounter 2 main problems:
The CREATE_IF_NEEDED gives an error because it has to create a partitioned table. I can circumvent this by making sure the tables exist in a previous step and if not create them.
If i load older data I get the following error:
The destination table's partition table_name_x$20190322 is outside the allowed bounds. You can only stream to partitions within 31 days in the past and 16 days in the future relative to the current date."
This seems like a limitation of streaming inserts, any way to do batch inserts ?
Maybe I'm approaching this wrong, and should use another method.
Any guidance as how to tackle these issues are appreciated.
Im using python 3.5 and apache-beam=2.13.0
That error message can be logged when one mixes the use of an ingestion-time partitioned table a column-partitioned table (see this similar issue). Summarizing from the link, it is not possible to use column-based partitioning (not ingestion-time partitioning) and write to tables with partition suffixes.
In your case, since you want to write to different tables based on a value in the log and have partitions within each table, forgo the use of the partition decorator when selecting which table (use "[prefix]_YYYYMMDD") and then have each individual table be column-based partitioned.

Does registerTempTable cause the table to get cached?

I have a sql statement query which is doing a group by on many fields. The tables that it uses is also big (4TB in size). I'm registering the table as a temp table. However I don't know whether the table gets cached or not when I'm registering it as a temp table? I also don't know whether it is more performant if I convert my query into Scala function (e.g. df.groupby().aggr()...) rather than having it as a sql statement. Any help on that?
SQL is most likely going to be the fastest by far Databricks blog
Did you try to partition/repartition your dataframe as well to see whether it improves the performance?
Regarding registerTempTable: it only registers the table within a spark context. You can check with the UI.
val test = List((1,2,3),(4,5,6)).toDF("bla","blb","blc")
test.createOrReplaceTempView("test")
test.show()
Storage is blank
vs
val test = List((1,2,3),(4,5,6)).toDF("bla","blb","blc")
test.createOrReplaceTempView("test").cache()
test.show()
by the way registerTempTable is deprecated in Spark 2.0 and has been replaced by
createOrReplaceTempView
I have a sql statement query which is doing a group by on many fields. The tables that it uses is also big (4TB in size). I'm registering the table as a temp table. However I don't know whether the table gets cached or not when I'm registering it as a temp table?
The registerTempTabele or createOrReplaceTempView doesn't cache the data into memory or disc itself unless you use cache() function.
I also don't know whether it is more performant if I convert my query into Scala function (e.g. df.groupby().aggr()...) rather than having it as a sql statement. Any help on that?
Keep in mind the sql terms in sql query ultimately call the function inside. so whether you use sql query terms or functions available in code it doesn't matter. that is same thing.

Resources