Incremental batch processing in pyspark - apache-spark

In our spark application, we are running multiple batch processes everyday. sources for these batch process are different like Oracle, mongoDB, Files. We are storing different value for incremental processing based on source like latest timestamp for some oracle tables, ID for some oracle table, list for some file system and using those values for next incremental run.
Currently calculation of these offset value are dependent on source types, we need to customize code to store this value every time when we add new source type.
Is there any generic way to resolve this issue like checkpoint in streaming.

I always like to look into the destination for the last written partition, or get some max(primary_key) and then based on that value select data from the source database to write during the current run.
There would be no need to store anything, you would just need to supply to your batch processing algorithm the table name, source type, and primary key/timestamp column. The algorithm would then find the latest value you already have.
It really depends on your load philosophy and how your storage is divided; if you have raw/source/prepared layers. It is a good idea to load data in a raw format which can be easily compared to the original source in order to do what I described above.
Alternatives include:
Writing a file which contains that primary column and the latest value, your batch job would read this file to determine what to read next.
Updating the job execution configuration with an argument corresponding to the latest value, so on the next run the latest value is passed to your algorithm.

Related

Spark SQL - Identifying What Partitions Were Output by an INSERT Statement

I am trying to find out if there is a way for a Spark SQL job to identify what partition values were output in both static and dynamic partition scenarios.
I am already familiar ApplicationListeners and QueryListeners and have found that combined you can access what exact static partition values were written to. When a dynamic partition write is performed, however, you only get the partition keys and not the corresponding values.
For these jobs, there is a constraint that they run off of only SQL so I cannot use some sort of accumulator or broadcast variable.
e.g.
spark.sql("INSERT OVERWRITE .....")
As an example, if I have a job write out to the base path
s3://bucket/table
With partition columns (part_a String, part_b String)
I want to be able to, as a post process, see the partition values and/or paths I've written to.
If I wrote to part_a=a and part_b=b (via SQL), and then part_a=1 and part_b=2 resulting in the below paths:
s3://bucket/table/part_a=a/part_b=b
s3://bucket/table/part_a=1/part_b=2
How can I either obtain the above paths post run or at a bare minimum the dynamic partition values? I have used AspectJ to this effect to catch the actual filesystem paths being passed on write but this is very brittle to the Spark version itself and essentially breaks on every upgrade.

Cassandra 3.7 CDC / incremental data load

I'm very new to the ETL world and I wish to implement Incremental Data Loading with Cassandra 3.7 and Spark. I'm aware that later versions of Cassandra do support CDC, but I can only use Cassandra 3.7. Is there a method through which I can track the changed records only and use spark to load them, thereby performing incremental data loading?
If it can't be done on the cassandra end, any other suggestions are also welcome on the Spark side :)
It's quite a broad topic, and efficient solution will depend on the amount of data in your tables, table structure, how data is inserted/updated, etc. Also, specific solution may depend on the version of Spark available. One downside of Spark-only method is you can't easily detect deletes of the data, without having a complete copy of previous state, so you can generate a diff between 2 states.
In all cases you'll need to perform full table scan to find changed entries, but if your table is organized specifically for this task, you can avoid reading of all data. For example, if you have a table with following structure:
create table test.tbl (
pk int,
ts timestamp,
v1 ...,
v2 ...,
primary key(pk, ts));
then if you do following query:
import org.apache.spark.sql.cassandra._
val data = spark.read.cassandraFormat("tbl", "test").load()
val filtered = data.filter("""ts >= cast('2019-03-10T14:41:34.373+0000' as timestamp)
AND ts <= cast('2019-03-10T19:01:56.316+0000' as timestamp)""")
then Spark Cassandra Connector will push this query down to the Cassandra, and will read only data where ts is in the given time range - you can check this by executing filtered.explain and checking that both time filters are marked with * symbol.
Another way to detect changes is to retrieve the write time from Cassandra, and filter out the changes based on that information. Fetching of writetime is supported in RDD API for all recent versions of SCC, and is supported in the Dataframe API since release of SCC 2.5.0 (requires at least Spark 2.4, although may work with 2.3 as well). After fetching this information, you can apply filters on the data & extract changes. But you need to keep in mind several things:
there is no way to detect deletes using this method
write time information exists only for regular & static columns, but not for columns of primary key
each column may have its own write time value, in case if there was a partial update of the row after insertion
in most versions of Cassandra, call of writetime function will generate error when it's done for collection column (list/map/set), and will/may return null for column with user-defined type
P.S. Even if you had CDC enabled, it's not a trivial task to use it correctly:
you need to de-duplicate changes - you have RF copies of the changes
some changes could be lost, for example, when node was down, and then propagated later, via hints or repairs
TTL isn't easy to handle
...
For CDC you may look for presentations from 2019th DataStax Accelerate conference - there were several talks on that topic.

Write to a datepartitioned Bigquery table using the beam.io.gcp.bigquery.WriteToBigQuery module in apache beam

I'm trying to write a dataflow job that needs to process logs located on storage and write them in different BigQuery tables. Which output tables are going to be used depends on the records in the logs. So I do some processing on the logs and yield them with a key based on a value in the log. After which I group the logs on the keys. I need to write all the logs grouped on the same key to a table.
I'm trying to use the beam.io.gcp.bigquery.WriteToBigQuery module with a callable as the table argument as described in the documentation here
I would like to use a date-partitioned table as this will easily allow me to write_truncate on the different partitions.
Now I encounter 2 main problems:
The CREATE_IF_NEEDED gives an error because it has to create a partitioned table. I can circumvent this by making sure the tables exist in a previous step and if not create them.
If i load older data I get the following error:
The destination table's partition table_name_x$20190322 is outside the allowed bounds. You can only stream to partitions within 31 days in the past and 16 days in the future relative to the current date."
This seems like a limitation of streaming inserts, any way to do batch inserts ?
Maybe I'm approaching this wrong, and should use another method.
Any guidance as how to tackle these issues are appreciated.
Im using python 3.5 and apache-beam=2.13.0
That error message can be logged when one mixes the use of an ingestion-time partitioned table a column-partitioned table (see this similar issue). Summarizing from the link, it is not possible to use column-based partitioning (not ingestion-time partitioning) and write to tables with partition suffixes.
In your case, since you want to write to different tables based on a value in the log and have partitions within each table, forgo the use of the partition decorator when selecting which table (use "[prefix]_YYYYMMDD") and then have each individual table be column-based partitioned.

Spark tagging file names for purpose of possible later deletion/rollback?

I am using Spark 2.4 in AWS EMR.
I am using Pyspark and SparkSQL for my ELT/ETL and using DataFrames with Parquet input and output on AWS S3.
As of Spark 2.4, as far as I know, there is no way to tag or to customize the file name of output files (parquet). Please correct me?
When I store parquet output files on S3 I end up with file names which look like this:
part-43130-4fb6c57e-d43b-42bd-afe5-3970b3ae941c.c000.snappy.parquet
The middle part of the file name looks like it has embedded GUID/UUID :
part-43130-4fb6c57e-d43b-42bd-afe5-3970b3ae941c.c000.snappy.parquet
I would like to know if I can obtain this GUID/UUID value from the PySpark or SparkSQL function at run-time, to log/save/display this value in a text file?
I need to log this GUID/UUID value because I may need to later remove the files with this value as part of their names, for a manual rollback purposes (for example, I may discover a day or a week later that this data is somehow corrupt and needs to be deleted, so all files tagged with GUID/UUID can be identified and removed).
I know that I can partition the table manually on a GUID column but then I end up with too many partitions, so it hurts performance. What I need is to somehow tag the files, for each data load job, so I can identify and delete them easily from S3, hence GUID/UUID value seems like one possible solution.
Open for any other suggestions.
Thank you
Is this with the new "s3a specific committer"? If so, it means that they're using netflix's code/trick of using a GUID on each file written so as to avoid eventual consistency problems. That doesn't help much though.
consider offering a patch to Spark which lets you add a specific prefix to a file name.
Or for Apache Hadoop & Spark (i.e. not EMR), an option for the S3A committers to put that prefix in when they generate temporary filenames.
Short term: well, you can always list the before-and-after state of the directory tree (tip: use FileSystem.listFiles(path, recursive) for speed), and either remember the new files, or rename them (which will be slow: Remembering the new filenames is better)
Spark already writes files with UUID in names. Instead of creating too many partitions you can setup customer file naming (e.g. add some id). May be this is solution for you - https://stackoverflow.com/a/43377574/1251549
Not tried yet (but planning) - https://github.com/awslabs/amazon-s3-tagging-spark-util
In theory, you can tag with jobid (or whatever) and then run something
Both solutions lead to perform multiple s3 list objects API request check tags/filename and delete file one by one.

Incremental load in Azure Data Lake

I have a big blob storage full of log files organized according to their identifiers at a number of levels: repository, branch, build number, build step number.
These are JSON files that contain an array of objects, each object has a timestamp and an entry value. I've already implemented a custom extractor (extending IExtractor) that takes an input stream and produces a number of plain-text lines.
Initial load
Now I am trying to load all of that data to ADL Store. I created a query that looks similar to this:
#entries =
EXTRACT
repo string,
branch string,
build int,
step int,
Line int,
Entry string
FROM #"wasb://my.blob.core.windows.net/{repo}/{branch}/{build}/{step}.json"
USING new MyJSONExtractor();
When I run this extraction query I get a compiler error - it exceeds the limit of 25 minutes of compilation time. My guess is: too many files. So I add a WHERE clause in the INSERT INTO query:
INSERT INTO Entries
(Repo, Branch, Build, Step, Line, Entry)
SELECT * FROM #entries
WHERE (repo == "myRepo") AND (branch == "master");
Still no luck - compiler times out.
(It does work, however, when I process a single build, leaving {step} as the only wildcard, and hard-coding the rest of names.)
Question: Is there a way to perform a load like that in a number of jobs - but without the need to explicitly (manually) "partition" the list of input files?
Incremental load
Let's assume for a moment that I succeeded in loading those files. However, a few days from now I'll need to perform an update - how am I supposed to specify the list of files? I have a SQL Server database where all the metadata is kept, and I could extract exact log file paths - but U-SQL's EXTRACT query forces me to provide a static string that specifies the input data.
A straightforward scenario would be to define a top-level directory for each date and process them day by day. But the way the system is designed makes this very difficult, if not impossible.
Question: Is there a way to identify files by their creation time? Or maybe there is a way to combine a query to a SQL Server database with the extraction query?
For your first question: Sounds like your FileSet pattern is generating a very large number of input files. To deal with that you may want to try the FileSets v2 preview which is documented under U-SQL Preview Features section in:
https://github.com/Azure/AzureDataLake/blob/master/docs/Release_Notes/2017/2017_04_24/USQL_Release_Notes_2017_04_24.md
Input File Set scales orders of magnitudes better (opt-in statement is
now provided)
Previously, U-SQL's file set pattern on EXTRACT expressions ran into
compile time time-outs around 800 to 5000 files.
U-SQL's file set pattern now scales to many more files and generates
more efficient plans.
For example, a U-SQL script querying over 2500 files in our telemetry
system previously took over 10 minutes to compile now compiles in 1
minute and the script now executes in 9 minutes instead of over 35
minutes using a lot less AUs. We also have compiled scripts that
access 30'000 files.
The preview feature can be turned on by adding the following statement
to your script:
SET ##FeaturePreviews = "FileSetV2Dot5:on";
If you wanted to generate multiple extract statements based on partitions of your filepaths, you'd have to do it with some external code that generates one or more U-SQL scripts.
I don't have a good answer to your second question so I will get a colleague to respond. Hopefully the first part can get you unblocked for now.
To address your second question:
You could read your data from the SQL Server database using a federated query, and then use the information in a join with the virtual columns that you create from the fileset. The problem with that is that the values are only known at execution time and not at compile time, so you would not get the reduction in the accessed files.
Alternatively, you could write a SQL query that gets you the data you need and then parameterize your U-SQL script so you can pass that information into the U-SQL script.
As to the ability to select files based on their creation time: This is a feature on our backlog. I would recommend to upvote and add a comment to the following feature request: https://feedback.azure.com/forums/327234-data-lake/suggestions/10948392-support-functionality-to-handle-file-properties-fr and add a comment you want to also query on them over a fileset.

Resources