Converting data from .dat to parquet using Pyspark - apache-spark

Why the number of rows is different after converting from .dat to parquet data format using pyspark? Even when I repeat the conversion on the same file multiple times, I get a different result (slightly more or slightly less or equal to the original rows count)!
I am using my Macbook pro with 16 gb
.dat file size is 16.5 gb
spark-2.3.2-bin-hadoop2.7.
I already have the rows count from my data provider (45 million rows).
First I read the .dat file
2011_df = spark.read.text(filepath)
Second, I convert it to parquet, a process that takes about two hours.
2011_df.write.option("compression","snappy").mode("overwrite").save("2011.parquet")
Afterwards, I read the converted parquet file
de_parq = spark.read.parquet(filepath)
Finally, I use "count" to get rows numbers.
de_parq.count()

Related

I am getting OOM ARRAY SIZE EXCEEDS VM LIMIT when I run my pyspark job

I have a cluster of 10 workers each with 4CPUs and 16GB, executor memory 10 GB, and driver 10 GB.
about data:
the input data is a huge text file of size 1.5GB, inside the text file each record is separated by 1%% as a delimiter.
Each record is yet another file (each file may contain lines from 10 to 60000+) with a unique report id. the report_id should be extracted from the first line of the record.
And let's say record number 3 has a file of 1 GB.
when I try regexp_replace, I get the java array size exceeds the VM limit. I was trying to replace all the "\n" with "\r\n".
I have repartitioned the data to 30 partitions
I figured out a way around it, it is specific to my use case,
My input is not a collection of rows and columns separated by a comma or tab. My input contains 100 text files each file ends with 1%% marking the end of the file.
and simply put I just want to parse each file (100) and write out output for each file in txt format.
So my df contains one column named = value. row1 in the value column contains file1, row2 - file2 and so on.
the OOM error happened because the size of a particular file was over 1GB. So I created two jobs,
first job maps each file to each row and creates an index column and writes out the output as partition by index.
and the second job reads page by page instead of file from the partial output generated by the first job.

pyspark csv format - mergeschema

I have a large dump of data that spans in TB's. The files contain activity data on a daily basis.
Day 1 can have 2 columns and Day 2 can have 3 columns. The file dump was in csv format. Now I need to read all these files and load it into a table. Problem is the format is csv and I am not sure how to merge the schema so as to lose not any columns. I know this can be achieved in parquet through mergeschema, but I cant convert these files one by one into parquet as the data is huge. Is there any way to merge schema with format as csv?

Dataframe saving in Python

I have splitted a string into date and time and i am satifsied how it was done.
The output are data splitted into 4 columns).
Now i would like to save the dataframe into a csv file, but every time i do it, the old / origin data format will be stored (with 2 columns).
Splitting a string into date and time (4 Columns)
unable to save the chnages i made on the data (2 Columns)

Specify max file size while write dataframe as parquet

When I try to write a dataframe as parquet, the file sizes are non-uniform. Although I don't want to make the files uniform, I want to set a max size for each file.
I can't afford to repartition the data as the dataframe is sorted(As per my understanding, repartitioning a sorted dataframe can distort the ordering).
Any help would be appreciated.
I have come across maxRecordsPerFile, but I don't want to limit the number of rows and I might not have full information about the columns(total number of columns and their types). So it's difficult to estimate file size based on rows.
I have read about parquet block size as well and I don't think that helps.

What's the best way to store a number (<= 10k) of huge csv files (~200k lines) in Cassandra?

I tried to store them as a whole string in text column but there're errors when I was trying to save huge strings (over 45 millions chars). Are there any better options than to create long row_index column as a clustering key and store .csv line-by-line (each csv record will be a line in the table).
Write/read constraints: there's 1 write only (as soon as we create that .csv file), there's a small number (<= 10) reads for a specific file as well. And we'd probably like to optimize for fast writing, if possible.
Thanks.

Resources