How to persist data to Hive from PySpark - Avoiding duplicates - apache-spark

I am working with graphframes, pyspark, and hive to work with graph data. As I process data I will be building a graph and eventually will be persisting this data into a Hive table, where I will not update it ever again.
Subsequent runs may have relationships to nodes from previous runs, so I will want to ensure I don't duplicate data.
For example, run #1 might find nodes: A, B, C. Run #2 might re-find node A, and also find new nodes X, Y, Z. I do not want A to appear twice in my table.
I am looking for the best way to handle this and would like to address the following issues:
I will need to track the status of the node as I process metadata associated with it. I will only want to persist the node's data to Hive after I have finished this processing.
I want to ensure that I don't create duplicate data when I encounter the same node (e.g. when I re-find A node above, I don't want to insert another row into Hive)
I am currently tinkering with the best way to do this. I know hive supports ACID transactions now, but it does not appear as though pyspark currently supports CRUD type operations. So here is what I'm planning on:
On each run, create a dataframe to store the nodes I have found.
When a new node is found: Check if the node already exists in Hive (e.g. sqlContext.sql("SELECT * FROM existingTable WHERE name="<NAME>"). If it does not exist update the dataframe with x = vertices.withColumn("name", F.when(F.col("id")=="a", "<THE-NEW-NAME>").otherwise(F.col("name"))) to add it to our Dataframe.
Once all the nodes have finished processing, create a temporary view: x.createOrReplaceTempView("myTmpView")
Finally, insert data from my temporary view into an existing table with sqlContext.sql("INSERT INTO TABLE existingTable SELECT * FROM myTmpView")
I think this will work, but it seems extremely hacky. I'm not sure if this is a function of my lack of understanding of Hive/Spark, or if this is just the nature of the tech stack. Is there a better way to do this? Is there a performance cost to handling it in this way?

In deltalake api, upserts(Merge) are supported using scala and also python. Which is exactly you are trying to implement.
https://docs.delta.io/latest/delta-update.html#merge-examples
Here is an alternate solution
Have a column updated_time timestamp in your table
union prev_run_results and current_run_results
group by 'node', select the latest timestamp
save the results

Related

In Azure Synapse, is it true that creating a temporary table is discouraged?

I had a good discussion with one of my colleagues and he mentioned creating a temporary table degrades the performance in Azure Synapse because Synapse creates the temporary table first in the master node then distribute them to child node. Is it true? He recommended me to create create permanent table instead of temporary table.
That’s not correct. Temp tables don’t necessarily funnel through the control node. Let’s say you are selecting from a table distributed on ProductKey and loading it into a #temp table distributed on ProductKey. The data will never leave each compute node since it’s a distribution compatible insert.
On the other hand, if you run a query that uses a ROW_NUMBER function, for example, that would have to be calculated on the control node and then the data would be sent back to the compute nodes to be stored in the distributed temp table. But that only happens in the presence of some types of functions and some types of queries. It is not the norm. If you are worried about a particular query then add the word EXPLAIN to the front of it and paste the explain plan XML into your question so we can help you interpret it.
If you load a #temp table with a SELECT INTO statement you can’t specify the table geometry so it will be a round robin distributed columnstore. Usually this isn’t ideal since it takes extra time and memory to compress a columnstore and because round robin distribution isn’t ideal unless there is no good distribution key. Usually the next query which uses the round robin distributed temp table will just reshuffle it so it’s best to properly hash distribute a temp table initially. To do this do a CTAS statement as described here.

how to avoid errors when querying hive table being loaded by Spark at the same time

We have a use case where we run an ETL written in spark on top of some streaming data, the ETL writes results to the target hive table every hour, but users are commonly running queries to the target table and we have faced cases of having query errors due to spark loading the table at the same time. What alternatives do we have to avoid or minimize this errors? Any property to the spark job(or to the hive table)? or something like creating a temporary table?
The error is:
java.io.FileNotFoundException: File does not exist [HDFS PATH]
Which i think happens because the metadata says there is a file A that gets deleted during the job execution.
The table is partitioned by year, month, day(using HDFS as storage) and every time the ETL runs it updates(via a partition overwrite) only current date partition. Currently no "transactional" tables are enabled in the cluster(even if they were i tested the use case on a test cluster without luck)
The easy option is to use a table format thats designed to handle concurrent reads and writes like hudi or delta lake. The more complicated version involves using a partitioned append only table that the writer writes to. On completion the writer updates a view to point to the new data. Another possible option is to partition the table on insert time.
Have a set of two tables and a view over them:
CREATE TABLE foo_a (...);
CREATE TABLE foo_b (...);
CREATE VIEW foo AS SELECT x, y, z, ... FROM foo_a;
First iteration of ETL process needs to:
Synchronize foo_a -> foo_b
Do the work on foo_b
Drop view foo and recreate it pointing to foo_b
Until step 3 user queries run against table foo_a. From the moment of switch they run against foo_b. Next iteration of ETL will work in the opposite way.
This is not perfect. You need double storage and some extra complexity in the ETL. And anyway this approach might fail if:
user is unlucky enough to hit a short time between dropping and
recreating the view
user submits a query that's heavy enough to run across two iterations of ETL
not sure but check it out
CREATE TABLE foo_a (...);
CREATE TABLE foo_b (...);

Cassandra 3.7 CDC / incremental data load

I'm very new to the ETL world and I wish to implement Incremental Data Loading with Cassandra 3.7 and Spark. I'm aware that later versions of Cassandra do support CDC, but I can only use Cassandra 3.7. Is there a method through which I can track the changed records only and use spark to load them, thereby performing incremental data loading?
If it can't be done on the cassandra end, any other suggestions are also welcome on the Spark side :)
It's quite a broad topic, and efficient solution will depend on the amount of data in your tables, table structure, how data is inserted/updated, etc. Also, specific solution may depend on the version of Spark available. One downside of Spark-only method is you can't easily detect deletes of the data, without having a complete copy of previous state, so you can generate a diff between 2 states.
In all cases you'll need to perform full table scan to find changed entries, but if your table is organized specifically for this task, you can avoid reading of all data. For example, if you have a table with following structure:
create table test.tbl (
pk int,
ts timestamp,
v1 ...,
v2 ...,
primary key(pk, ts));
then if you do following query:
import org.apache.spark.sql.cassandra._
val data = spark.read.cassandraFormat("tbl", "test").load()
val filtered = data.filter("""ts >= cast('2019-03-10T14:41:34.373+0000' as timestamp)
AND ts <= cast('2019-03-10T19:01:56.316+0000' as timestamp)""")
then Spark Cassandra Connector will push this query down to the Cassandra, and will read only data where ts is in the given time range - you can check this by executing filtered.explain and checking that both time filters are marked with * symbol.
Another way to detect changes is to retrieve the write time from Cassandra, and filter out the changes based on that information. Fetching of writetime is supported in RDD API for all recent versions of SCC, and is supported in the Dataframe API since release of SCC 2.5.0 (requires at least Spark 2.4, although may work with 2.3 as well). After fetching this information, you can apply filters on the data & extract changes. But you need to keep in mind several things:
there is no way to detect deletes using this method
write time information exists only for regular & static columns, but not for columns of primary key
each column may have its own write time value, in case if there was a partial update of the row after insertion
in most versions of Cassandra, call of writetime function will generate error when it's done for collection column (list/map/set), and will/may return null for column with user-defined type
P.S. Even if you had CDC enabled, it's not a trivial task to use it correctly:
you need to de-duplicate changes - you have RF copies of the changes
some changes could be lost, for example, when node was down, and then propagated later, via hints or repairs
TTL isn't easy to handle
...
For CDC you may look for presentations from 2019th DataStax Accelerate conference - there were several talks on that topic.

Incremental and parallelism read from RDBMS in Spark using JDBC

I'm working on a project that involves reading data from RDBMS using JDBC and I succeeded reading the data. This is something I will be doing fairly constantly, weekly. So I've been trying to come up with a way to ensure that after the initial read, subsequent ones should only pull updated records instead of pulling the entire data from the table.
I can do this with sqoop incremental import by specifying the three parameters (--check-column, --incremental last-modified/append and --last-value). However, I dont want to use sqoop for this. Is there a way I can replicate same in Spark with Scala?
Secondly, some of the tables do not have unique column which can be used as partitionColumn, so I thought of using a row-number function to add a unique column to these table and then get the MIN and MAX of the unique column as lowerBound and upperBound respectively. My challenge now is how to dynamically parse these values into the read statement like below:
val queryNum = "select a1.*, row_number() over (order by sales) as row_nums from (select * from schema.table) a1"
val df = spark.read.format("jdbc").
option("driver", driver).
option("url",url ).
option("partitionColumn",row_nums).
option("lowerBound", min(row_nums)).
option("upperBound", max(row_nums)).
option("numPartitions", some value).
option("fetchsize",some value).
option("dbtable", queryNum).
option("user", user).
option("password",password).
load()
I know the above code is not right and might be missing a whole lot of processes but I guess it'll give a general overview of what I'm trying to achieve here.
It's surprisingly complicated to handle incremental JDBC reads in Spark. IMHO, it severely limits the ease of building many applications and may not be worth your trouble if Sqoop is doing the job.
However, it is doable. See this thread for an example using the dbtable option:
Apache Spark selects all rows
To keep this job idempotent, you'll need to read in the max row of your prior output either directly from loading all data files or via a log file that you write out each time. If your data files are massive you may need to use the log file, if smaller you could potentially load.

Write to a datepartitioned Bigquery table using the beam.io.gcp.bigquery.WriteToBigQuery module in apache beam

I'm trying to write a dataflow job that needs to process logs located on storage and write them in different BigQuery tables. Which output tables are going to be used depends on the records in the logs. So I do some processing on the logs and yield them with a key based on a value in the log. After which I group the logs on the keys. I need to write all the logs grouped on the same key to a table.
I'm trying to use the beam.io.gcp.bigquery.WriteToBigQuery module with a callable as the table argument as described in the documentation here
I would like to use a date-partitioned table as this will easily allow me to write_truncate on the different partitions.
Now I encounter 2 main problems:
The CREATE_IF_NEEDED gives an error because it has to create a partitioned table. I can circumvent this by making sure the tables exist in a previous step and if not create them.
If i load older data I get the following error:
The destination table's partition table_name_x$20190322 is outside the allowed bounds. You can only stream to partitions within 31 days in the past and 16 days in the future relative to the current date."
This seems like a limitation of streaming inserts, any way to do batch inserts ?
Maybe I'm approaching this wrong, and should use another method.
Any guidance as how to tackle these issues are appreciated.
Im using python 3.5 and apache-beam=2.13.0
That error message can be logged when one mixes the use of an ingestion-time partitioned table a column-partitioned table (see this similar issue). Summarizing from the link, it is not possible to use column-based partitioning (not ingestion-time partitioning) and write to tables with partition suffixes.
In your case, since you want to write to different tables based on a value in the log and have partitions within each table, forgo the use of the partition decorator when selecting which table (use "[prefix]_YYYYMMDD") and then have each individual table be column-based partitioned.

Resources