Azure Data Factory DYNAMICALLY partition a csv/txt file based on rowcount - azure

I am using azure dataflow to transform delimited files (csv/txt) to json. But I want to separate the files dynamically based on a max row count of 5,000 because I will not know the row count every time. So if I have a csv file with 10,000 rows the pipeline will output two equal json files, file1.json and file2.json. What is the best way to actually get the row count of my sources and the correct n number of partitions based on that row count within Azure Data Factory?

One way to achieve this is to use the mod or % operator.
To start with set a surrogate key on the CSV file or use any sequential key in the data.
Add a aggregate step with a group by clause that is your key % row count
Set the Aggregates function to collect()
Your output should now be a array of rows with the expected count in each.

We can't specify the row number to split the csv file. The closest workaround is specify the partition of the sink.
For example, I have a csv file contains 700 rows data. I successfully copy to two equal json files.
My source csv data in Blob storage:
Sink settings: each partition output a new file: json1.json and json2.json:
Optimize:
Partition operation: Set partition
Partition type: Dynamic partition
Number of partitions: 2 (means split the csv data to 2 partitions)
Stored ranges in columns: id(split based on the id column)
Run the Data flow and the csv file will split to two json files which each contains 350 rows data.
For your situation, the csv file with 10,000 rows the pipeline will output two equal json files(each contains 5,000 row data).

Related

I am getting OOM ARRAY SIZE EXCEEDS VM LIMIT when I run my pyspark job

I have a cluster of 10 workers each with 4CPUs and 16GB, executor memory 10 GB, and driver 10 GB.
about data:
the input data is a huge text file of size 1.5GB, inside the text file each record is separated by 1%% as a delimiter.
Each record is yet another file (each file may contain lines from 10 to 60000+) with a unique report id. the report_id should be extracted from the first line of the record.
And let's say record number 3 has a file of 1 GB.
when I try regexp_replace, I get the java array size exceeds the VM limit. I was trying to replace all the "\n" with "\r\n".
I have repartitioned the data to 30 partitions
I figured out a way around it, it is specific to my use case,
My input is not a collection of rows and columns separated by a comma or tab. My input contains 100 text files each file ends with 1%% marking the end of the file.
and simply put I just want to parse each file (100) and write out output for each file in txt format.
So my df contains one column named = value. row1 in the value column contains file1, row2 - file2 and so on.
the OOM error happened because the size of a particular file was over 1GB. So I created two jobs,
first job maps each file to each row and creates an index column and writes out the output as partition by index.
and the second job reads page by page instead of file from the partial output generated by the first job.

How Spark stores Parquet Table?

I have a fact table which is 10Tb (Parquet) which contains 100+ columns. When I have created another table with just 10 columns from the fact table and size is 2TB.
I was expecting the size should be in some GBs because I am storing just few (10) columns?
My question is when we have more columns does Parquet format stores in more efficient manner?
Parquet is a column based storage. Say if I have a table with fields userId, name, address, state, phone no.
In non-parquet storage If I do a select * where state = "TN" it will go through every record in my table (i.e all the columns of each row) and output the records that match my where condition. However in parquet format all the columns are stored together so I don't need to go through all the other columns. The same select query will directly go to column 'state' and output records that match the where condition. Parquet is good for faster retrieval (to get results faster). It doesn't matter how many columns are present in total.
Parquet uses snappy compression. Since all the columns are stored together it makes compression very effective.

Delimited File with Varying Number of Rows Azure Data Factory

I have a delimited file separated by hashes that looks somewhat like this,
value#value#value#value#value#value##value
value#value#value#value##value#####value#####value
value#value#value#value###value#value####value##value
As you can see, when separated by hashes, there are more columns in the 2nd and 3rd rows than there is in the first. I want to be able to ingest this into a database using a ADF Data Flow after some transformations. However, whenever I try to do any kind of mapping, I always only see 7 columns (the number of columns in the first row).
Is there any way to get all of the values? As many columns as there are in the row with most number of items? I do not mind the nulls.
Note: I do not have a header row for this.
Azure Data Factory directly will not be able to Import schema -row with the maximum number of column. Hence, it is important to make sure you have same number of columns in your file.
You can use Azure functions to validate your file and update it to get equal number of columns in all rows.
You could give it a try to have a local file with row with the maximum number of column and import the schema from the file, else you have to go for Azure Functions where you have to convert the file and then trigger the pipeline.

Parquet Format - split columns in different files

On the parquet documentation is explicitly mentioned that the design supports splitting the metadata and data into different files , including also the possibility that different column groups can be stored in different files.
However , I could not find any instructions on how to achieve that. In my use case I would like to store the metadata in one file , store columns 1-100 data in one file and 101-200 in a second file .
Any idea how to achieve this ?
If you are using PySpark, it's as easy as this:
df = spark.createDataFrameFrom(...)
df.write.parquet('file_name.parquet')
and it will create a folder called file_name.parquet in the default location in HDFS. You can just create two dataframes, one with columns 1-100, and the other dataframe with columns 101-200 and save them separately. It automatically will save the metadata, if you mean the data frame schema.
You can select a range of columns like this:
df_first_hundred = df.select(df.columns[:100])
df_second_hundred = df.select(df.columns[100:])
Save them as separate files:
df_first_hundred.write.parquet('df_first_hundred')
df_second_hundred.write.parquet('df_second_hundred')

How does Parquet file size changes with the count in Spark Dataset

I came across a scenario where I had spark dataset with 24 columns, of which I was grouping by First 22 columns and summing up last two columns.
I removed the group by from the query and I have all 24 columns selected now.
Initial count of the dataset was 79,304.
After I removed group by the count has increased to 138,204 which is understood because I have removed the group by.
But I was not clear with the behavior that The initial size of parquet file was 2.3MB but later it got reduced to 1.5MB . Can anyone please help me understand this.
Also not every time the size reduces,
I had a similar scenario for 22 columns
count before was 35,298,226 and after removing group by was 59,874,208
and here the size has increased from 466.5MB to 509.8MB
When dealing with parquet sizes it's not about the number of rows it's about the data it self.
Parquet is columnar oriented format and therefore it store data column wise and compress the data column wise. Therefore it's not about the number of rows but rather the diversity of it's columns.
Parquet will do better compression as the diversity of the most diverse column in the table. So if you have one column dataframe it will be compress good as the distance between the values of the column.

Resources