How to solve the maximum view depth error in Spark? - apache-spark

I have a very long task that creates a bunch of views using Spark SQL and I get the following error at some step: pyspark.sql.utils.AnalysisException: The depth of view 'foobar' exceeds the maximum view resolution depth (100).
I have been searching in Google and SO and couldn't find anyone with a similar error.
I have tried caching the view foobar, but that doesn't help. I'm thinking of creating temporary tables as a workaround, as I would like not to change the current Spark Configuration if possible, but I'm not sure if I'm missing something.
UPDATE:
I tried creating tables in parquet format to reference tables and not views, but I still get the same error. I applied that to all the input tables to the SQL query that causes the error.
If it makes a difference, I'm using ANSI SQL, not the python API.

Solution
Using parque tables worked for me after all. I spotted that I was still missing one table to persist so that's why it wouldn't work.
So I changed my SQL statements from this:
CREATE OR REPLACE TEMPORARY VIEW `VIEW_NAME` AS
SELECT ...
To:
CREATE TABLE `TABLE_NAME` USING PARQUET AS
SELECT ...
To move all the critical views to parquet tables under spark_warehouse/ - or whatever you have configured.
Note:
This will write the table on the master node's disk. Make sure you have enough disk or consider dumping in an external data store like s3 or what have you. Read this as an alternative - and now preferred - solution using checkpoints.

Related

spark sql databricks - transaction log error after optimize

I have two tables, written like this:
f_em.write.format('delta').mode("overwrite").saveAsTable('rens.f_em')
f_dial.write.format('delta').mode("overwrite").saveAsTable('rens.f_dial')
These tables work fine. I can query them. However, they are large (ca. 11 billion rows), so to enhance performance, I want to optimize them.
%sql
optimize rens.f_em
zorder by (RKNR)
and
%sql
optimize rens.f_dial
zorder by (rknr)
I have no clue how optimize exactly works and what zorder by exactly does. I used the optimize function before on another table, and just used the attribute I use the most for linking/joining in the zorder by statement. This enhanced performed significantly so I tried the same approach here.
After running the optimize statement, I cannot query from the tables any longer:
For one of the tables I receive this error after a simple select statement
You are trying to read from `dbfs:/user/hive/warehouse/rens.db/f_em` using Databricks Delta, but there is no
transaction log present. Check the upstream job to make sure that it is writing
using format("delta") and that you are trying to read from the table base path.
To disable this check, SET spark.databricks.delta.formatCheck.enabled=false
To learn more about Delta, see https://learn.microsoft.com/azure/databricks/delta/index
;
and other error:
Error in SQL statement: FileNotFoundException: dbfs:/user/hive/warehouse/rens.db/f_dial/_delta_log/00000000000000000000.json: Unable to reconstruct state at version 2 as the transaction log has been truncated due to manual deletion or the log retention policy (delta.logRetentionDuration=30 days) and checkpoint retention policy (delta.checkpointRetentionDuration=2 days)
Just guessing: Check the filename for the new query. It looks for the data in the path
dbfs:/user/hive/warehouse/rens.db/f_em
But most likely you saved the table to:
dbfs:/user/hive/warehouse/rens.f_em .
This might be due to your dot-notation in you saveAsTable(rens.f_em).
In the SQL query, the dot is interpreted by the SQL API as a database, not as a delta table called rens.f_em.
EDIT: Given your reply, I would like to propose such a workaround, which I personally always use and favor due to robustness.
table_dir = "/path/to/table"
f_em.write.format('delta').mode("overwrite").save(f"{table_dir} + /rens.f_em")
spark.sql("CREATE DATABASE rens")
spark.sql(f"CREATE rens.f_em USING DELTA delta.`{table_dir}/f_em`")
spark.sql(f"OPTIMIZE delta.`{table_dir}/f_em` zorder by (RKNR)")

Chaining spark sql queries over temporary views?

We're exploring the possibility to use temporary views in spark and relate this to some actual file storage - or to other temporary views. We want to achieve something like:
Some user uploads data to some S3/hdfs file storage.
A (temporary) view is defined so that spark sql queries can be run against the data.
Some other temporary view (referring to some other data) is created.
A third temporary view is created that joins data from (2) and (3).
By selecting from the view from (4) the user gets a table that reflects the joined data from (2) and (3). This result can be further processed and stored in a new temporary view and so on.
So we end up with a tree of temporary views - querying their parent temporary views until they end up loading data from the filesystem. Basically we want to store transformation steps (selecting, joining, filtering, modifying etc) on the data - without storing new versions. The spark SQL-support and temporary views seems like a good fit.
We did some successful testing. The idea is to store the specification of these temporary views in our application and recreate them during startup (as temporary or global views).
Not sure if this is viable solution? One problem is that we need to know how the temporary views are related (which one queries which). We create them like:
sparkSession.sql("select * from other_temp_view").createTempView(name)
So, when this is run we have to make sure that other_temp_view is already created in the session. Not sure how this can be achieved. One idea is to store a timestamp and recreate them in the same order. This could be ok since out views most likely will have to be "immutable". We're not allowed to change a query that other queries relies on.
Any thoughts would be most appreciated.
I would definately go with the SessionCatalog object :
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-SessionCatalog.html
You can access it with spark.sessionState.catalog

Write to a datepartitioned Bigquery table using the beam.io.gcp.bigquery.WriteToBigQuery module in apache beam

I'm trying to write a dataflow job that needs to process logs located on storage and write them in different BigQuery tables. Which output tables are going to be used depends on the records in the logs. So I do some processing on the logs and yield them with a key based on a value in the log. After which I group the logs on the keys. I need to write all the logs grouped on the same key to a table.
I'm trying to use the beam.io.gcp.bigquery.WriteToBigQuery module with a callable as the table argument as described in the documentation here
I would like to use a date-partitioned table as this will easily allow me to write_truncate on the different partitions.
Now I encounter 2 main problems:
The CREATE_IF_NEEDED gives an error because it has to create a partitioned table. I can circumvent this by making sure the tables exist in a previous step and if not create them.
If i load older data I get the following error:
The destination table's partition table_name_x$20190322 is outside the allowed bounds. You can only stream to partitions within 31 days in the past and 16 days in the future relative to the current date."
This seems like a limitation of streaming inserts, any way to do batch inserts ?
Maybe I'm approaching this wrong, and should use another method.
Any guidance as how to tackle these issues are appreciated.
Im using python 3.5 and apache-beam=2.13.0
That error message can be logged when one mixes the use of an ingestion-time partitioned table a column-partitioned table (see this similar issue). Summarizing from the link, it is not possible to use column-based partitioning (not ingestion-time partitioning) and write to tables with partition suffixes.
In your case, since you want to write to different tables based on a value in the log and have partitions within each table, forgo the use of the partition decorator when selecting which table (use "[prefix]_YYYYMMDD") and then have each individual table be column-based partitioned.

Spark HiveContext : Insert Overwrite the same table it is read from

I want to apply SCD1 and SCD2 using PySpark in HiveContext. In my approach, I am reading incremental data and target table. After reading, I am joining them for upsert approach. I am doing registerTempTable on all the source dataframes. I am trying to write final dataset into target table and I am facing the issue that Insert overwrite is not possible in the table it is read from.
Please suggest some solution for this. I do not want to write intermediate data into a physical table and read it again.
Is there any property or way to store the final data set without keeping the dependency on the table it is read from. This way, It might be possible to overwrite the table.
Please suggest.
You should never overwrite a table from which you are reading. It can result in anything between data corruption and complete data loss in case of failure.
It is also important to point out that correctly implemented SCD2 shouldn't never overwrite a whole table and can be implemented as a (mostly) append operation. As far as I am aware SCD1 cannot be efficiently implemented without mutable storage, therefore is not a good fit for Spark.
I was going through the documentation of spark and a thought clicked to me when I was checking one property there.
As my table was parquet, I used hive meta store to read the data by setting this property to false.
hiveContext.conf("spark.sql.hive.convertMetastoreParquet","false")
This solution is working fine for me.

Cassandra Database Problem

I am using Cassandra database for large scale application. I am new to using Cassandra database. I have a database schema for a particular keyspace for which I have created columns using Cassandra Command Line Interface (CLI). Now when I copied dataset in the folder /var/lib/cassandra/data/, I was not able to access the values using the key of a particular column. I am getting message zero rows present. But the files are present. All these files are under extension, XXXX-Data.db, XXXX-Filter.db, XXXX-Index.db. Can anyone tell me how to access the columns for existing datasets.
(a) Cassandra doesn't expect you to move its data files around out from underneath it. You'll need to restart if you do any manual surgery like that.
(b) if you didn't also copy the schema definition it will ignore data files for unknown column families.
For what you are trying to achieve it may probably be better to export and import your SSTables.
You should have a look at bin/sstable2json and bin/json2sstable.
Documentation is there (near the end of the page): Cassandra Operations

Resources