Pyspark - read periodically from an incremental hive table - apache-spark

I am working on a use case with pyspark.
My pyspark job should read from Hive tables periodically and apply some aggregations and transformations on top of it.
But I cant read the full table each time as i would need to append the output to another table.Can anyone please suggest any ideas. One approach I am thinking is to keep track of the rowId or rownum of the hive table after each process.
Ps: this is not a streaming useCase
Note: I am new to spark.
Thanks,
Albin

Lets break the problem down.
Create two tables to replace the existing one:
Create a Base table, and a Delta table.
Create a view that is the union of both tables.
(Used to give you a complete view of all data as of 'now'. Excluded
"Processing" tagged data from the view. I'll explain why later.)
When data is added it's added to the delta table.
When it's time to start processing data: Tag the data in the delta table as "Processing"
Copy "Processing" data to the Base table, and complete any required Process/update to aggregations.
Delete the "Processing" data from the delta table once you have completed your calculations.
It's hopefully now clear why you'd exclude data tagged with "Processing" from your view of 'now'.

Related

Delta tables in Databricks and into Power BI

I am connecting to a delta table in Azure gen 2 data lake by mounting in Databricks and creating a table ('using delta'). I am then connecting to this in Power BI using the Databricks connector.
Firstly, I am unclear as to the relationship between the data lake and the Spark table in Databricks. Is it correct that the Spark table retrieves the latest snapshot from the data lake (delta lake) every time it is itself queried? Is it also the case that it is not possible to effect changes in the data lake via operations on the Spark table?
Secondly, what is the best way to reduce the columns in the Spark table (ideally before it is read into Power BI)? I have tried creating the Spark table with specified subset of columns but get a cannot change schema error. Instead I can create another Spark table that selects from the first Spark table, but this seems pretty inefficient and (I think) will need to be recreated frequently in line with the refresh schedule of the Power BI report. I don't know if it's possible to have a Spark delta table that references another Spark Delta table so that the former is also always the latest snapshot when queried?
As you can tell, my understanding of this is limited (as is the documentation!) but any pointers very much appreciated.
Thanks in advance and for reading!
Table in Spark is just a metadata that specify where the data is located. So when you're reading the table, Spark under the hood just looking up in the metastore for information where data is stored, what schema, etc., and access that data. Changes made on the ADLS will be also reflected in the table. It's also possible to modify table from the tools, but it depends on what access rights are available to the Spark cluster that processes data - you can set permissions either on the ADLS level, or using table access control.
For second part - you just need to create a view over the original table, and that view will select only limited set of columns - the data is not copied and latest updates in the original table will be always available for querying. Something like:
CREATE OR REPLACE VIEW myview
AS SELECT col1, col2 FROM mytable
P.S. If you're only accessing via PowerBI or other BI tools, you may look onto Databricks SQL (when it will be in the public preview) that is heavily optimized for BI use cases.

Insert or Update a delta table from a dataframe in Pyspark

I have a pyspark dataframe currently from which I initially created a delta table using below code -
df.write.format("delta").saveAsTable("events")
Now, since the above dataframe populates the data on daily basis in my requirement, hence for appending new records into delta table, I used below syntax -
df.write.format("delta").mode("append").saveAsTable("events")
Now this whole thing I did in databricks and in my cluster. I want to know how can I write generic pyspark code in python that will create delta table if it does not exists and append records if delta table exists.This thing I want to do because if I give my python package to someone, they will not have the same delta table in their environment so it should get created dynamically from code.
If you don't have Delta table yet, then it will be created when you're using the append mode. So you don't need to write any special code to handle the case when table doesn't exist yet, and when it exits.
P.S. You'll need to have such code only in case if you're performing merge into the table, not append. In this case the code will looks like this:
if table_exists:
do_merge
else:
df.write....
P.S. here is a generic implementation of that pattern
There are eventually two operations available with spark
saveAsTable:- create or replace the table if present or not with the current DataFrame
insertInto:- Successful if the table present and perform operation based on the mode('overwrite' or 'append'). it requires the table to be available in the database.
The .saveAsTable("events") Basically rewrites the table every time you call it. which means that, even if you have a table present earlier or not, it will replace the table with the current DataFrame value. Instead, you can perform the below operation to be in the safer side:
Step 1: Create the table even if it is present or not. If present, remove the data from the table and append the new data frame records, else create the table and append the data.
df.createOrReplaceTempView('df_table')
spark.sql("create table IF NOT EXISTS table_name using delta select * from df_table where 1=2")
df.write.format("delta").mode("append").insertInto("events")
So, every time it will check if the table is available or not, else it will create the table and move to next step. Else, if the table is available, then append the data into the table.

Azure Data Factory DataFlow Filter is taking a lot of time

I have an ADF Pipleline which executes a DataFlow.
The Dataflow has Source A table which has around 1 Million Rows,
Filter which has a query to select only yesterday's records from the source table,
Alter Row settings which uses upsert,
Sink which is archival table where the records are getting upsert
This whole pipeline is taking around 2 hours or so which is not acceptable. Actually, the records being transferred / upserted are around 3000 only.
Core count is 16. Tried the partitioning with round robin and 20 partitions.
Similar archival doesn't take more than 15 minutes for another table which has around 100K records.
I thought of creating source which would select only yesterday's record but the dataset we can select only table.
Please suggest if I am missing anything to optimize it.
The table of the Data Set really doesn't matter. Whichever activity you use to access that Data Set can be toggled to use a query instead of the whole table, so that you can pass in a value to select only yesterday's data from the database.
Or course, if you have the ability to create a stored procedure on the source, you could also do that.
When migrating really large sets of data, you'll get much better performance using a Copy activity to stage the data into an Azure Storage Blob before using another Copy activity to pull from that Blob into the source. But, for what you're describing here, that doesn't seem necessary.

pyspark: insert into dataframe if key not present or row.timestamp is more recent

I have a Kudu database with a table in it. Every day, I launch a batch job which receives new data to ingest (an ETL pipeline).
I would like to insert the new data if:
the key is not present
if the key is present, update the row only if the timestamp column of the new row is more recent
I think what you need is a left outer join of the new data with the existing table, the result of which you first have to save into a temporary table, and then move it to the original table, with SaveMode.Append.
You might also be interested in using Spark Structured Streaming or Kafka instead of batch jobs. I even found an example on GitHub (didn't check how well it works, though, and whether it takes existing data into account).

How to quickly migrate from one table into another one with different table structure in the same/different cassandra?

I had one table with more than 10,000,000 records in Cassandra, but for some reason, I want to build another Cassandra table with the same fields and several additional fields, and I will migrate the previous data into it. And now the two tables are in the same Cassandra cluster.
I want to ask how to finish this task in a shortest time?
And If my new table in the different Cassandra, How to do it?
Any advice will be appreciated!
If you just need to add blank fields to a table, then the best thing to do is use the alter table command to add the fields to the existing table. Then no copying of the data would be needed and the new fields would show up as null in the existing rows until you set them to something.
If you want to change the structure of the data in the new table, or write it to a different cluster, then you'd probably need to write an application to read each row of the old table, transform the data as needed, and then write each row to the new location.
You could also do this by exporting the data to a csv file, write a program to restructure the csv file as needed, then import the csv file into the new location.
Another possible method would be to use Apache Spark. You'd read the existing table into an RDD, transform and filter the data into a new RDD, then save the transformed RDD to the new table. That would only work within the same cluster and would be fairly complex to set up.

Resources