pyspark csv format - mergeschema - apache-spark

I have a large dump of data that spans in TB's. The files contain activity data on a daily basis.
Day 1 can have 2 columns and Day 2 can have 3 columns. The file dump was in csv format. Now I need to read all these files and load it into a table. Problem is the format is csv and I am not sure how to merge the schema so as to lose not any columns. I know this can be achieved in parquet through mergeschema, but I cant convert these files one by one into parquet as the data is huge. Is there any way to merge schema with format as csv?

Related

Parquet Format - split columns in different files

On the parquet documentation is explicitly mentioned that the design supports splitting the metadata and data into different files , including also the possibility that different column groups can be stored in different files.
However , I could not find any instructions on how to achieve that. In my use case I would like to store the metadata in one file , store columns 1-100 data in one file and 101-200 in a second file .
Any idea how to achieve this ?
If you are using PySpark, it's as easy as this:
df = spark.createDataFrameFrom(...)
df.write.parquet('file_name.parquet')
and it will create a folder called file_name.parquet in the default location in HDFS. You can just create two dataframes, one with columns 1-100, and the other dataframe with columns 101-200 and save them separately. It automatically will save the metadata, if you mean the data frame schema.
You can select a range of columns like this:
df_first_hundred = df.select(df.columns[:100])
df_second_hundred = df.select(df.columns[100:])
Save them as separate files:
df_first_hundred.write.parquet('df_first_hundred')
df_second_hundred.write.parquet('df_second_hundred')

Pyspark: how to filter by date and read parquet files which is partitioned by date

I have a huge dataset of partitioned parquet files stored in AWS s3 in data-store/year=<>/month=<>/day=<>/hour=<>/ folder format. e.g data-store/year=2020/month=06/day=01/hour=05.
I want to read the files only for a specific date range e.g 2020/06/01 to 2020/08/30 or something like all the records with a date greater than equal to 2020/06/01.
How to do this effectively so that only the required data is loaded to the spark memory.

Azure Data Factory DYNAMICALLY partition a csv/txt file based on rowcount

I am using azure dataflow to transform delimited files (csv/txt) to json. But I want to separate the files dynamically based on a max row count of 5,000 because I will not know the row count every time. So if I have a csv file with 10,000 rows the pipeline will output two equal json files, file1.json and file2.json. What is the best way to actually get the row count of my sources and the correct n number of partitions based on that row count within Azure Data Factory?
One way to achieve this is to use the mod or % operator.
To start with set a surrogate key on the CSV file or use any sequential key in the data.
Add a aggregate step with a group by clause that is your key % row count
Set the Aggregates function to collect()
Your output should now be a array of rows with the expected count in each.
We can't specify the row number to split the csv file. The closest workaround is specify the partition of the sink.
For example, I have a csv file contains 700 rows data. I successfully copy to two equal json files.
My source csv data in Blob storage:
Sink settings: each partition output a new file: json1.json and json2.json:
Optimize:
Partition operation: Set partition
Partition type: Dynamic partition
Number of partitions: 2 (means split the csv data to 2 partitions)
Stored ranges in columns: id(split based on the id column)
Run the Data flow and the csv file will split to two json files which each contains 350 rows data.
For your situation, the csv file with 10,000 rows the pipeline will output two equal json files(each contains 5,000 row data).

Converting data from .dat to parquet using Pyspark

Why the number of rows is different after converting from .dat to parquet data format using pyspark? Even when I repeat the conversion on the same file multiple times, I get a different result (slightly more or slightly less or equal to the original rows count)!
I am using my Macbook pro with 16 gb
.dat file size is 16.5 gb
spark-2.3.2-bin-hadoop2.7.
I already have the rows count from my data provider (45 million rows).
First I read the .dat file
2011_df = spark.read.text(filepath)
Second, I convert it to parquet, a process that takes about two hours.
2011_df.write.option("compression","snappy").mode("overwrite").save("2011.parquet")
Afterwards, I read the converted parquet file
de_parq = spark.read.parquet(filepath)
Finally, I use "count" to get rows numbers.
de_parq.count()

parquet column pruning in spark

I know parquet supports read user selected columns only. But when I use
dataframe.read.parquet().select(col*)
to read the data, it looks like still read the whole file. Is there any way in spark to read the selected column chunks only?

Resources