Incremental load in Azure Data Lake - azure

I have a big blob storage full of log files organized according to their identifiers at a number of levels: repository, branch, build number, build step number.
These are JSON files that contain an array of objects, each object has a timestamp and an entry value. I've already implemented a custom extractor (extending IExtractor) that takes an input stream and produces a number of plain-text lines.
Initial load
Now I am trying to load all of that data to ADL Store. I created a query that looks similar to this:
#entries =
EXTRACT
repo string,
branch string,
build int,
step int,
Line int,
Entry string
FROM #"wasb://my.blob.core.windows.net/{repo}/{branch}/{build}/{step}.json"
USING new MyJSONExtractor();
When I run this extraction query I get a compiler error - it exceeds the limit of 25 minutes of compilation time. My guess is: too many files. So I add a WHERE clause in the INSERT INTO query:
INSERT INTO Entries
(Repo, Branch, Build, Step, Line, Entry)
SELECT * FROM #entries
WHERE (repo == "myRepo") AND (branch == "master");
Still no luck - compiler times out.
(It does work, however, when I process a single build, leaving {step} as the only wildcard, and hard-coding the rest of names.)
Question: Is there a way to perform a load like that in a number of jobs - but without the need to explicitly (manually) "partition" the list of input files?
Incremental load
Let's assume for a moment that I succeeded in loading those files. However, a few days from now I'll need to perform an update - how am I supposed to specify the list of files? I have a SQL Server database where all the metadata is kept, and I could extract exact log file paths - but U-SQL's EXTRACT query forces me to provide a static string that specifies the input data.
A straightforward scenario would be to define a top-level directory for each date and process them day by day. But the way the system is designed makes this very difficult, if not impossible.
Question: Is there a way to identify files by their creation time? Or maybe there is a way to combine a query to a SQL Server database with the extraction query?

For your first question: Sounds like your FileSet pattern is generating a very large number of input files. To deal with that you may want to try the FileSets v2 preview which is documented under U-SQL Preview Features section in:
https://github.com/Azure/AzureDataLake/blob/master/docs/Release_Notes/2017/2017_04_24/USQL_Release_Notes_2017_04_24.md
Input File Set scales orders of magnitudes better (opt-in statement is
now provided)
Previously, U-SQL's file set pattern on EXTRACT expressions ran into
compile time time-outs around 800 to 5000 files.
U-SQL's file set pattern now scales to many more files and generates
more efficient plans.
For example, a U-SQL script querying over 2500 files in our telemetry
system previously took over 10 minutes to compile now compiles in 1
minute and the script now executes in 9 minutes instead of over 35
minutes using a lot less AUs. We also have compiled scripts that
access 30'000 files.
The preview feature can be turned on by adding the following statement
to your script:
SET ##FeaturePreviews = "FileSetV2Dot5:on";
If you wanted to generate multiple extract statements based on partitions of your filepaths, you'd have to do it with some external code that generates one or more U-SQL scripts.
I don't have a good answer to your second question so I will get a colleague to respond. Hopefully the first part can get you unblocked for now.

To address your second question:
You could read your data from the SQL Server database using a federated query, and then use the information in a join with the virtual columns that you create from the fileset. The problem with that is that the values are only known at execution time and not at compile time, so you would not get the reduction in the accessed files.
Alternatively, you could write a SQL query that gets you the data you need and then parameterize your U-SQL script so you can pass that information into the U-SQL script.
As to the ability to select files based on their creation time: This is a feature on our backlog. I would recommend to upvote and add a comment to the following feature request: https://feedback.azure.com/forums/327234-data-lake/suggestions/10948392-support-functionality-to-handle-file-properties-fr and add a comment you want to also query on them over a fileset.

Related

Incremental batch processing in pyspark

In our spark application, we are running multiple batch processes everyday. sources for these batch process are different like Oracle, mongoDB, Files. We are storing different value for incremental processing based on source like latest timestamp for some oracle tables, ID for some oracle table, list for some file system and using those values for next incremental run.
Currently calculation of these offset value are dependent on source types, we need to customize code to store this value every time when we add new source type.
Is there any generic way to resolve this issue like checkpoint in streaming.
I always like to look into the destination for the last written partition, or get some max(primary_key) and then based on that value select data from the source database to write during the current run.
There would be no need to store anything, you would just need to supply to your batch processing algorithm the table name, source type, and primary key/timestamp column. The algorithm would then find the latest value you already have.
It really depends on your load philosophy and how your storage is divided; if you have raw/source/prepared layers. It is a good idea to load data in a raw format which can be easily compared to the original source in order to do what I described above.
Alternatives include:
Writing a file which contains that primary column and the latest value, your batch job would read this file to determine what to read next.
Updating the job execution configuration with an argument corresponding to the latest value, so on the next run the latest value is passed to your algorithm.

Azure Data Factory split file by file size

out of my two weeks of Azure experience. I want to split files based on a size. For example there is a table with 200k rows I would like to set a parameter to split that table into multiple files with a limit of 100Mb per file (if that makes sense). It will return N number of files depending of the table size. something like:
my_file_1ofN.csv
I was walking through the documentation, blogs and videos and could do some POC with Azure Functions, Azure Batch, and Databricks with a python script in my personal account. The problem is the company doesn't let me use any of these approaches.
So I split the file using the number of partitions but these files are with a different sizes depending on the table and the partition.
Is there a way to accomplish this? I'm experimenting with lookups and foreach activities in the pipeline now but with not good results.
Any idea or clue will be welcome. Thanks!!
I haven't been able to figure this out by size, but if you can get a total row count, you can use DataFlow to output a rough approximation based on row count.
IN THE PIPELINE:
In this example, I am reading data out of an Azure Synapse SQL Pool, so I'm running a Lookup to calculate the number of "Partitions" based on 8,000,000 rows per partition:
I then capture the result as a variable:
Next, pass the variable to the DataFlow:
NOTE: the #int cast is because DataFlow supports int but pipeline's do not, so in the pipeline the data is stored in a string variable.
IN THE DATAFLOW:
Create an int parameter for "partitionCount", which is passed in from the pipeline:
SOURCE:
In the Optimize tab you can control how the source the data is partitioned on read. For this purpose, switch to "Set Partitioning" and select Round Robin based on the partitionCount variable:
This will split the incoming data into X number of buckets based on the parameter.
SINK:
Under the Settings tab, experiment with the "File name option" settings to control the output name. The options are a bit limited, so you may have trouble getting exactly what you want:
Since you have already partitioned the data, just use the default Source Optimization settings:
RESULT:
This will produce X number of files with a numbered naming scheme and consistent file size:

Spark tagging file names for purpose of possible later deletion/rollback?

I am using Spark 2.4 in AWS EMR.
I am using Pyspark and SparkSQL for my ELT/ETL and using DataFrames with Parquet input and output on AWS S3.
As of Spark 2.4, as far as I know, there is no way to tag or to customize the file name of output files (parquet). Please correct me?
When I store parquet output files on S3 I end up with file names which look like this:
part-43130-4fb6c57e-d43b-42bd-afe5-3970b3ae941c.c000.snappy.parquet
The middle part of the file name looks like it has embedded GUID/UUID :
part-43130-4fb6c57e-d43b-42bd-afe5-3970b3ae941c.c000.snappy.parquet
I would like to know if I can obtain this GUID/UUID value from the PySpark or SparkSQL function at run-time, to log/save/display this value in a text file?
I need to log this GUID/UUID value because I may need to later remove the files with this value as part of their names, for a manual rollback purposes (for example, I may discover a day or a week later that this data is somehow corrupt and needs to be deleted, so all files tagged with GUID/UUID can be identified and removed).
I know that I can partition the table manually on a GUID column but then I end up with too many partitions, so it hurts performance. What I need is to somehow tag the files, for each data load job, so I can identify and delete them easily from S3, hence GUID/UUID value seems like one possible solution.
Open for any other suggestions.
Thank you
Is this with the new "s3a specific committer"? If so, it means that they're using netflix's code/trick of using a GUID on each file written so as to avoid eventual consistency problems. That doesn't help much though.
consider offering a patch to Spark which lets you add a specific prefix to a file name.
Or for Apache Hadoop & Spark (i.e. not EMR), an option for the S3A committers to put that prefix in when they generate temporary filenames.
Short term: well, you can always list the before-and-after state of the directory tree (tip: use FileSystem.listFiles(path, recursive) for speed), and either remember the new files, or rename them (which will be slow: Remembering the new filenames is better)
Spark already writes files with UUID in names. Instead of creating too many partitions you can setup customer file naming (e.g. add some id). May be this is solution for you - https://stackoverflow.com/a/43377574/1251549
Not tried yet (but planning) - https://github.com/awslabs/amazon-s3-tagging-spark-util
In theory, you can tag with jobid (or whatever) and then run something
Both solutions lead to perform multiple s3 list objects API request check tags/filename and delete file one by one.

SQLyog Job Agent (SJA) for Linux: how to write something like 'all table but a single one'

I am trying to learn SQLyog Job Agent (SJA).
I am on a Linux machine, and use SJA from within a bash script by a line command: ./sja myschema.xml
I need to sync an almost 100 tables db and its local clone.
Since a single table stores a few config data, which I do not wish to sync, it seems I need to write a myschema.xml where I list all the remaining 99 tables.
Question is: is there a way to write to sync all the table but a single one?
I hope my question is clear. I appreciate your help.
If you are using the latest version of sqlyog:You are given the table below, and the option to generate an xml job file at the end of the database syncronisation wizard reflecting the operation you've opted to perform. This will in effect list the other 99 tables in the xml file itself for you, but it will give you what you are looking for, and I dont think you would be doing anything in particular with an individual table, since you are specifying all tables in a single element.

Best way to handle Flat File Import to SQL Server using C#.Net

I wrote a Console Application that reads list of flat files
and Parse the data type on a row basis
and inserts records one after other in respective tables.
there are few Flat Files which contains about 63k records(rows).
for such files, my program is taking about 6 hours for one file of 63k
records to complete.
This is a test data file. In production i have to deal with 100 time more load.
I am worried, if i can do this any better to speed up.
Can any one suggest a best way to handle this job?
Work Flow is as below:
Read the FlatFile from Local Machine using File.ReadAllLines("location")
Create a Record Entity object after parsing each field of the row.
Insert this current row in to the Entity
Purpose of making this as console application is,
this application should be run(scheduled application) on weekly basis
and there is conditional logic in it, based on some variable there will be
full table replace or
update a existing table or
delete records in table.
You can try to use 'bulk insert' operation for inserting a huge data into database.

Resources