out of my two weeks of Azure experience. I want to split files based on a size. For example there is a table with 200k rows I would like to set a parameter to split that table into multiple files with a limit of 100Mb per file (if that makes sense). It will return N number of files depending of the table size. something like:
my_file_1ofN.csv
I was walking through the documentation, blogs and videos and could do some POC with Azure Functions, Azure Batch, and Databricks with a python script in my personal account. The problem is the company doesn't let me use any of these approaches.
So I split the file using the number of partitions but these files are with a different sizes depending on the table and the partition.
Is there a way to accomplish this? I'm experimenting with lookups and foreach activities in the pipeline now but with not good results.
Any idea or clue will be welcome. Thanks!!
I haven't been able to figure this out by size, but if you can get a total row count, you can use DataFlow to output a rough approximation based on row count.
IN THE PIPELINE:
In this example, I am reading data out of an Azure Synapse SQL Pool, so I'm running a Lookup to calculate the number of "Partitions" based on 8,000,000 rows per partition:
I then capture the result as a variable:
Next, pass the variable to the DataFlow:
NOTE: the #int cast is because DataFlow supports int but pipeline's do not, so in the pipeline the data is stored in a string variable.
IN THE DATAFLOW:
Create an int parameter for "partitionCount", which is passed in from the pipeline:
SOURCE:
In the Optimize tab you can control how the source the data is partitioned on read. For this purpose, switch to "Set Partitioning" and select Round Robin based on the partitionCount variable:
This will split the incoming data into X number of buckets based on the parameter.
SINK:
Under the Settings tab, experiment with the "File name option" settings to control the output name. The options are a bit limited, so you may have trouble getting exactly what you want:
Since you have already partitioned the data, just use the default Source Optimization settings:
RESULT:
This will produce X number of files with a numbered naming scheme and consistent file size:
Related
I followed the example below, and all is going well.
https://learn.microsoft.com/en-gb/azure/data-factory/tutorial-data-flow
Below is about the output files and rows:
If you followed this tutorial correctly, you should have written 83
rows and 2 columns into your sink folder.
Below is the result from my example, which is correct having the same number of rows and columns.
Below is the output. Please note that the total number of files is 77, not 83, not 1.
Question:: Is it correct to have so many csv files (77 items)?
Question:: How to combine all files into one file without slowing down the process?
I can create one file by following the link below, which warns of slowing down the process.
How to remove extra files when sinking CSV files to Azure Data Lake Gen2 with Azure Data Factory data flow?
The number of files generated from the process is dependent upon a number of factors. If you've set the default partitioning in the optimize tab on your sink, that will tell ADF to use Spark's current partitioning mode, which will be based on the number of cores available on the worker nodes. So the number of files will vary based upon how your data is distributed across the workers. You can manually set the number of partitions in the sink's optimize tab. Or, if you wish to name a single output file, you can do that, but it will result in Spark coalescing to a single partition, which is why you see that warning. You may find it takes a little longer to write that file because Spark has to coalesce existing partitions. But that is the nature of a big data distributed processing cluster.
I've seen many answers and blob posts suggesting that:
df.repartition('category').write().partitionBy('category')
Will output one file per category, but this doesn't appear to be true if the number of unique 'category' values in df is less than the number of default partitions (usually 200).
When I use the above code on a file with 100 categories, I end up with 100 folders each containing between 1 and 3 "part" files, rather than having all rows with a given "category" value in the same "part". The answer at https://stackoverflow.com/a/42780452/529618 seems to explain this.
What is the fastest way get exactly one file per partition value?
Things I've tried
I've seen many claims that
df.repartition(1, 'category').write().partitionBy('category')
df.repartition(2, 'category').write().partitionBy('category')
Will create "exactly one file per category" and "exactly two files per category" respectively, but this doesn't appear to be how this parameter works. The documentation makes it clear that the numPartitions argument is the total number of partitions to create, not the number of partitions per column value. Based on that documentation, specifying this argument as 1 should (accidentally) output a single file per partition when the file is written, but presumably only because it removes all parallelism and forces your entire RDD to be shuffled / recalculated on a single node.
required_partitions = df.select('category').distinct().count()
df.repartition(required_partitions, 'category').write().partitionBy('category')
The above seems like a workaround based on the documented behaviour, but one that would be costly for several reasons. For one, a separate count if df is expensive and not cached (and/or so big that it would be wasteful to cache just for this purpose), and also any repartitioning of a dataframe can cause unnecessary shuffling in a multi-stage workflow that has various dataframe outputss along the way.
The "fastest" way probably depends on the actual hardware set-up and actual data (in case it is skewed). To my knowledge, I also agree that df.repartition('category').write().partitionBy('category') will not help solving your problem.
We faced a similar problem in our application but instead of doing first a count and then the repartition, we separated the writing of the data and the requirement to have only a single file per partition into two different Spark jobs. The first job is optimized to write the data. The second job just iterates over the partitioned folder structure and simply reads the data per folder/partition, coalesces its data to one partition and overwrites them back. Again, I can not tell if that is the fastest way also to your environment, but for us it did the trick.
Having done some research on this topic lead to the Auto Optimize Writes feature on Databricks for writing to a Delta Table. Here, they use a similar approach: First writing the data and then running a separate OPTIMIZE job to aggregate the files into a single file. In the mentioned link you will find this explanation:
"After an individual write, Azure Databricks checks if files can further be compacted, and runs an OPTIMIZE job [...] to further compact files for partitions that have the most number of small files."
As a side note: Make sure to keep the configuration spark.sql.files.maxRecordsPerFile to 0 (default value) or to a negative number. Otherwise, this configuration alone could lead to multiple files for data with the same value in the column "category".
You can try coalesce(n); coalesce is used to decrease the number of partitions, which is an optimized version of repartition.
n = The number of partitions you want to be output.
In our spark application, we are running multiple batch processes everyday. sources for these batch process are different like Oracle, mongoDB, Files. We are storing different value for incremental processing based on source like latest timestamp for some oracle tables, ID for some oracle table, list for some file system and using those values for next incremental run.
Currently calculation of these offset value are dependent on source types, we need to customize code to store this value every time when we add new source type.
Is there any generic way to resolve this issue like checkpoint in streaming.
I always like to look into the destination for the last written partition, or get some max(primary_key) and then based on that value select data from the source database to write during the current run.
There would be no need to store anything, you would just need to supply to your batch processing algorithm the table name, source type, and primary key/timestamp column. The algorithm would then find the latest value you already have.
It really depends on your load philosophy and how your storage is divided; if you have raw/source/prepared layers. It is a good idea to load data in a raw format which can be easily compared to the original source in order to do what I described above.
Alternatives include:
Writing a file which contains that primary column and the latest value, your batch job would read this file to determine what to read next.
Updating the job execution configuration with an argument corresponding to the latest value, so on the next run the latest value is passed to your algorithm.
I'm trying to write a dataflow job that needs to process logs located on storage and write them in different BigQuery tables. Which output tables are going to be used depends on the records in the logs. So I do some processing on the logs and yield them with a key based on a value in the log. After which I group the logs on the keys. I need to write all the logs grouped on the same key to a table.
I'm trying to use the beam.io.gcp.bigquery.WriteToBigQuery module with a callable as the table argument as described in the documentation here
I would like to use a date-partitioned table as this will easily allow me to write_truncate on the different partitions.
Now I encounter 2 main problems:
The CREATE_IF_NEEDED gives an error because it has to create a partitioned table. I can circumvent this by making sure the tables exist in a previous step and if not create them.
If i load older data I get the following error:
The destination table's partition table_name_x$20190322 is outside the allowed bounds. You can only stream to partitions within 31 days in the past and 16 days in the future relative to the current date."
This seems like a limitation of streaming inserts, any way to do batch inserts ?
Maybe I'm approaching this wrong, and should use another method.
Any guidance as how to tackle these issues are appreciated.
Im using python 3.5 and apache-beam=2.13.0
That error message can be logged when one mixes the use of an ingestion-time partitioned table a column-partitioned table (see this similar issue). Summarizing from the link, it is not possible to use column-based partitioning (not ingestion-time partitioning) and write to tables with partition suffixes.
In your case, since you want to write to different tables based on a value in the log and have partitions within each table, forgo the use of the partition decorator when selecting which table (use "[prefix]_YYYYMMDD") and then have each individual table be column-based partitioned.
I have a big blob storage full of log files organized according to their identifiers at a number of levels: repository, branch, build number, build step number.
These are JSON files that contain an array of objects, each object has a timestamp and an entry value. I've already implemented a custom extractor (extending IExtractor) that takes an input stream and produces a number of plain-text lines.
Initial load
Now I am trying to load all of that data to ADL Store. I created a query that looks similar to this:
#entries =
EXTRACT
repo string,
branch string,
build int,
step int,
Line int,
Entry string
FROM #"wasb://my.blob.core.windows.net/{repo}/{branch}/{build}/{step}.json"
USING new MyJSONExtractor();
When I run this extraction query I get a compiler error - it exceeds the limit of 25 minutes of compilation time. My guess is: too many files. So I add a WHERE clause in the INSERT INTO query:
INSERT INTO Entries
(Repo, Branch, Build, Step, Line, Entry)
SELECT * FROM #entries
WHERE (repo == "myRepo") AND (branch == "master");
Still no luck - compiler times out.
(It does work, however, when I process a single build, leaving {step} as the only wildcard, and hard-coding the rest of names.)
Question: Is there a way to perform a load like that in a number of jobs - but without the need to explicitly (manually) "partition" the list of input files?
Incremental load
Let's assume for a moment that I succeeded in loading those files. However, a few days from now I'll need to perform an update - how am I supposed to specify the list of files? I have a SQL Server database where all the metadata is kept, and I could extract exact log file paths - but U-SQL's EXTRACT query forces me to provide a static string that specifies the input data.
A straightforward scenario would be to define a top-level directory for each date and process them day by day. But the way the system is designed makes this very difficult, if not impossible.
Question: Is there a way to identify files by their creation time? Or maybe there is a way to combine a query to a SQL Server database with the extraction query?
For your first question: Sounds like your FileSet pattern is generating a very large number of input files. To deal with that you may want to try the FileSets v2 preview which is documented under U-SQL Preview Features section in:
https://github.com/Azure/AzureDataLake/blob/master/docs/Release_Notes/2017/2017_04_24/USQL_Release_Notes_2017_04_24.md
Input File Set scales orders of magnitudes better (opt-in statement is
now provided)
Previously, U-SQL's file set pattern on EXTRACT expressions ran into
compile time time-outs around 800 to 5000 files.
U-SQL's file set pattern now scales to many more files and generates
more efficient plans.
For example, a U-SQL script querying over 2500 files in our telemetry
system previously took over 10 minutes to compile now compiles in 1
minute and the script now executes in 9 minutes instead of over 35
minutes using a lot less AUs. We also have compiled scripts that
access 30'000 files.
The preview feature can be turned on by adding the following statement
to your script:
SET ##FeaturePreviews = "FileSetV2Dot5:on";
If you wanted to generate multiple extract statements based on partitions of your filepaths, you'd have to do it with some external code that generates one or more U-SQL scripts.
I don't have a good answer to your second question so I will get a colleague to respond. Hopefully the first part can get you unblocked for now.
To address your second question:
You could read your data from the SQL Server database using a federated query, and then use the information in a join with the virtual columns that you create from the fileset. The problem with that is that the values are only known at execution time and not at compile time, so you would not get the reduction in the accessed files.
Alternatively, you could write a SQL query that gets you the data you need and then parameterize your U-SQL script so you can pass that information into the U-SQL script.
As to the ability to select files based on their creation time: This is a feature on our backlog. I would recommend to upvote and add a comment to the following feature request: https://feedback.azure.com/forums/327234-data-lake/suggestions/10948392-support-functionality-to-handle-file-properties-fr and add a comment you want to also query on them over a fileset.