Write to a datepartitioned Bigquery table using the beam.io.gcp.bigquery.WriteToBigQuery module in apache beam - python-3.x

I'm trying to write a dataflow job that needs to process logs located on storage and write them in different BigQuery tables. Which output tables are going to be used depends on the records in the logs. So I do some processing on the logs and yield them with a key based on a value in the log. After which I group the logs on the keys. I need to write all the logs grouped on the same key to a table.
I'm trying to use the beam.io.gcp.bigquery.WriteToBigQuery module with a callable as the table argument as described in the documentation here
I would like to use a date-partitioned table as this will easily allow me to write_truncate on the different partitions.
Now I encounter 2 main problems:
The CREATE_IF_NEEDED gives an error because it has to create a partitioned table. I can circumvent this by making sure the tables exist in a previous step and if not create them.
If i load older data I get the following error:
The destination table's partition table_name_x$20190322 is outside the allowed bounds. You can only stream to partitions within 31 days in the past and 16 days in the future relative to the current date."
This seems like a limitation of streaming inserts, any way to do batch inserts ?
Maybe I'm approaching this wrong, and should use another method.
Any guidance as how to tackle these issues are appreciated.
Im using python 3.5 and apache-beam=2.13.0

That error message can be logged when one mixes the use of an ingestion-time partitioned table a column-partitioned table (see this similar issue). Summarizing from the link, it is not possible to use column-based partitioning (not ingestion-time partitioning) and write to tables with partition suffixes.
In your case, since you want to write to different tables based on a value in the log and have partitions within each table, forgo the use of the partition decorator when selecting which table (use "[prefix]_YYYYMMDD") and then have each individual table be column-based partitioned.

Related

How to solve the maximum view depth error in Spark?

I have a very long task that creates a bunch of views using Spark SQL and I get the following error at some step: pyspark.sql.utils.AnalysisException: The depth of view 'foobar' exceeds the maximum view resolution depth (100).
I have been searching in Google and SO and couldn't find anyone with a similar error.
I have tried caching the view foobar, but that doesn't help. I'm thinking of creating temporary tables as a workaround, as I would like not to change the current Spark Configuration if possible, but I'm not sure if I'm missing something.
UPDATE:
I tried creating tables in parquet format to reference tables and not views, but I still get the same error. I applied that to all the input tables to the SQL query that causes the error.
If it makes a difference, I'm using ANSI SQL, not the python API.
Solution
Using parque tables worked for me after all. I spotted that I was still missing one table to persist so that's why it wouldn't work.
So I changed my SQL statements from this:
CREATE OR REPLACE TEMPORARY VIEW `VIEW_NAME` AS
SELECT ...
To:
CREATE TABLE `TABLE_NAME` USING PARQUET AS
SELECT ...
To move all the critical views to parquet tables under spark_warehouse/ - or whatever you have configured.
Note:
This will write the table on the master node's disk. Make sure you have enough disk or consider dumping in an external data store like s3 or what have you. Read this as an alternative - and now preferred - solution using checkpoints.

spark sql databricks - transaction log error after optimize

I have two tables, written like this:
f_em.write.format('delta').mode("overwrite").saveAsTable('rens.f_em')
f_dial.write.format('delta').mode("overwrite").saveAsTable('rens.f_dial')
These tables work fine. I can query them. However, they are large (ca. 11 billion rows), so to enhance performance, I want to optimize them.
%sql
optimize rens.f_em
zorder by (RKNR)
and
%sql
optimize rens.f_dial
zorder by (rknr)
I have no clue how optimize exactly works and what zorder by exactly does. I used the optimize function before on another table, and just used the attribute I use the most for linking/joining in the zorder by statement. This enhanced performed significantly so I tried the same approach here.
After running the optimize statement, I cannot query from the tables any longer:
For one of the tables I receive this error after a simple select statement
You are trying to read from `dbfs:/user/hive/warehouse/rens.db/f_em` using Databricks Delta, but there is no
transaction log present. Check the upstream job to make sure that it is writing
using format("delta") and that you are trying to read from the table base path.
To disable this check, SET spark.databricks.delta.formatCheck.enabled=false
To learn more about Delta, see https://learn.microsoft.com/azure/databricks/delta/index
;
and other error:
Error in SQL statement: FileNotFoundException: dbfs:/user/hive/warehouse/rens.db/f_dial/_delta_log/00000000000000000000.json: Unable to reconstruct state at version 2 as the transaction log has been truncated due to manual deletion or the log retention policy (delta.logRetentionDuration=30 days) and checkpoint retention policy (delta.checkpointRetentionDuration=2 days)
Just guessing: Check the filename for the new query. It looks for the data in the path
dbfs:/user/hive/warehouse/rens.db/f_em
But most likely you saved the table to:
dbfs:/user/hive/warehouse/rens.f_em .
This might be due to your dot-notation in you saveAsTable(rens.f_em).
In the SQL query, the dot is interpreted by the SQL API as a database, not as a delta table called rens.f_em.
EDIT: Given your reply, I would like to propose such a workaround, which I personally always use and favor due to robustness.
table_dir = "/path/to/table"
f_em.write.format('delta').mode("overwrite").save(f"{table_dir} + /rens.f_em")
spark.sql("CREATE DATABASE rens")
spark.sql(f"CREATE rens.f_em USING DELTA delta.`{table_dir}/f_em`")
spark.sql(f"OPTIMIZE delta.`{table_dir}/f_em` zorder by (RKNR)")

How to persist data to Hive from PySpark - Avoiding duplicates

I am working with graphframes, pyspark, and hive to work with graph data. As I process data I will be building a graph and eventually will be persisting this data into a Hive table, where I will not update it ever again.
Subsequent runs may have relationships to nodes from previous runs, so I will want to ensure I don't duplicate data.
For example, run #1 might find nodes: A, B, C. Run #2 might re-find node A, and also find new nodes X, Y, Z. I do not want A to appear twice in my table.
I am looking for the best way to handle this and would like to address the following issues:
I will need to track the status of the node as I process metadata associated with it. I will only want to persist the node's data to Hive after I have finished this processing.
I want to ensure that I don't create duplicate data when I encounter the same node (e.g. when I re-find A node above, I don't want to insert another row into Hive)
I am currently tinkering with the best way to do this. I know hive supports ACID transactions now, but it does not appear as though pyspark currently supports CRUD type operations. So here is what I'm planning on:
On each run, create a dataframe to store the nodes I have found.
When a new node is found: Check if the node already exists in Hive (e.g. sqlContext.sql("SELECT * FROM existingTable WHERE name="<NAME>"). If it does not exist update the dataframe with x = vertices.withColumn("name", F.when(F.col("id")=="a", "<THE-NEW-NAME>").otherwise(F.col("name"))) to add it to our Dataframe.
Once all the nodes have finished processing, create a temporary view: x.createOrReplaceTempView("myTmpView")
Finally, insert data from my temporary view into an existing table with sqlContext.sql("INSERT INTO TABLE existingTable SELECT * FROM myTmpView")
I think this will work, but it seems extremely hacky. I'm not sure if this is a function of my lack of understanding of Hive/Spark, or if this is just the nature of the tech stack. Is there a better way to do this? Is there a performance cost to handling it in this way?
In deltalake api, upserts(Merge) are supported using scala and also python. Which is exactly you are trying to implement.
https://docs.delta.io/latest/delta-update.html#merge-examples
Here is an alternate solution
Have a column updated_time timestamp in your table
union prev_run_results and current_run_results
group by 'node', select the latest timestamp
save the results

Incremental and parallelism read from RDBMS in Spark using JDBC

I'm working on a project that involves reading data from RDBMS using JDBC and I succeeded reading the data. This is something I will be doing fairly constantly, weekly. So I've been trying to come up with a way to ensure that after the initial read, subsequent ones should only pull updated records instead of pulling the entire data from the table.
I can do this with sqoop incremental import by specifying the three parameters (--check-column, --incremental last-modified/append and --last-value). However, I dont want to use sqoop for this. Is there a way I can replicate same in Spark with Scala?
Secondly, some of the tables do not have unique column which can be used as partitionColumn, so I thought of using a row-number function to add a unique column to these table and then get the MIN and MAX of the unique column as lowerBound and upperBound respectively. My challenge now is how to dynamically parse these values into the read statement like below:
val queryNum = "select a1.*, row_number() over (order by sales) as row_nums from (select * from schema.table) a1"
val df = spark.read.format("jdbc").
option("driver", driver).
option("url",url ).
option("partitionColumn",row_nums).
option("lowerBound", min(row_nums)).
option("upperBound", max(row_nums)).
option("numPartitions", some value).
option("fetchsize",some value).
option("dbtable", queryNum).
option("user", user).
option("password",password).
load()
I know the above code is not right and might be missing a whole lot of processes but I guess it'll give a general overview of what I'm trying to achieve here.
It's surprisingly complicated to handle incremental JDBC reads in Spark. IMHO, it severely limits the ease of building many applications and may not be worth your trouble if Sqoop is doing the job.
However, it is doable. See this thread for an example using the dbtable option:
Apache Spark selects all rows
To keep this job idempotent, you'll need to read in the max row of your prior output either directly from loading all data files or via a log file that you write out each time. If your data files are massive you may need to use the log file, if smaller you could potentially load.

How to update or even reset rows in persistent table given multiple simultaneous readers?

I have an exchangeRates table that gets updated in batch once per week. This is to be used by other batch and streaming jobs, across different clusters - thus I want to save this as a persistent, shared table for all to jobs share.
allExchangeRatesDF.write.saveAsTable("exchangeRates")
How best then (for the batch job that manages this data) to gracefully update the table contents (actually overwrite it completely) - considering the various spark job as consumers of it and particularily giving its use in some 24/7 structured streaming streams?
Ive checked the APIs, maybe I am missing something obvious! Very likely.
Thanks!
I think you expect some kind of transaction support from Spark so when there's saveAsTable in progress Spark would hold all writes until the update/reset has finished.
I think that the best way to deal with the requirement is to append new records (using insertInto) with the batch id that would denote the rows that belong to a "new table".
insertInto(tableName: String): Unit Inserts the content of the DataFrame to the specified table. It requires that the schema of the DataFrame is the same as the schema of the table.
You'd then use the batch id to deal with the rows as if they were the only rows in the dataset.

Resources