I am getting OOM ARRAY SIZE EXCEEDS VM LIMIT when I run my pyspark job - apache-spark

I have a cluster of 10 workers each with 4CPUs and 16GB, executor memory 10 GB, and driver 10 GB.
about data:
the input data is a huge text file of size 1.5GB, inside the text file each record is separated by 1%% as a delimiter.
Each record is yet another file (each file may contain lines from 10 to 60000+) with a unique report id. the report_id should be extracted from the first line of the record.
And let's say record number 3 has a file of 1 GB.
when I try regexp_replace, I get the java array size exceeds the VM limit. I was trying to replace all the "\n" with "\r\n".
I have repartitioned the data to 30 partitions

I figured out a way around it, it is specific to my use case,
My input is not a collection of rows and columns separated by a comma or tab. My input contains 100 text files each file ends with 1%% marking the end of the file.
and simply put I just want to parse each file (100) and write out output for each file in txt format.
So my df contains one column named = value. row1 in the value column contains file1, row2 - file2 and so on.
the OOM error happened because the size of a particular file was over 1GB. So I created two jobs,
first job maps each file to each row and creates an index column and writes out the output as partition by index.
and the second job reads page by page instead of file from the partial output generated by the first job.

Related

Azure Data Factory DYNAMICALLY partition a csv/txt file based on rowcount

I am using azure dataflow to transform delimited files (csv/txt) to json. But I want to separate the files dynamically based on a max row count of 5,000 because I will not know the row count every time. So if I have a csv file with 10,000 rows the pipeline will output two equal json files, file1.json and file2.json. What is the best way to actually get the row count of my sources and the correct n number of partitions based on that row count within Azure Data Factory?
One way to achieve this is to use the mod or % operator.
To start with set a surrogate key on the CSV file or use any sequential key in the data.
Add a aggregate step with a group by clause that is your key % row count
Set the Aggregates function to collect()
Your output should now be a array of rows with the expected count in each.
We can't specify the row number to split the csv file. The closest workaround is specify the partition of the sink.
For example, I have a csv file contains 700 rows data. I successfully copy to two equal json files.
My source csv data in Blob storage:
Sink settings: each partition output a new file: json1.json and json2.json:
Optimize:
Partition operation: Set partition
Partition type: Dynamic partition
Number of partitions: 2 (means split the csv data to 2 partitions)
Stored ranges in columns: id(split based on the id column)
Run the Data flow and the csv file will split to two json files which each contains 350 rows data.
For your situation, the csv file with 10,000 rows the pipeline will output two equal json files(each contains 5,000 row data).

Create dataframe from text file based on certain criterias

I have a text file that is around 3.3GB. I am only interested in 2 columns in this text file (out of 47). From these 2 columns, I only need rows where col2=='text1'. For example, consider my text file to have values such as:
text file:
col1~col2~~~~~~~~~~~~
12345~text1~~~~~~~~~~~~
12365~text1~~~~~~~~~~~~
25674~text2~~~~~~~~~~~~
35458~text3~~~~~~~~~~~~
44985~text4~~~~~~~~~~~~
I want to create a df where col2=='text1'. What I have done so far is tried to load the entire textfile into my df and then filter out the needed rows. However, since this is a large text file, creating a df takes more than 45 mins. I believe loading only the necessary rows (if possible) would be ideal as the df would be of considerably smaller size and I won't run into memory issues.
My code:
df=pd.read_csv('myfile.txt',low_memory=False,sep='~',usecols=['col1','col2'],dtype={'col2':str})
df1=df[df['col2']=='text1']
In short, can I filter a column, based on a criteria, while loading the text file to dataframe so as to 1) Reduce time for loading and 2) Reduce the size of df on my memory.
Okay, So I came up with a solution. Basically it has to do with loading the data in chunks, and filtering the chunks for col2=='text1'. This way, I only have a chunk loaded in memory each time and my final df will only have the data I need.
Code:
final=pd.DataFrame()
df=pd.read_csv('myfile.txt',low_memory=False,sep='~',usecols=['col1','col2'],dtype={'col2':str},chunksize=100000)
for chunk in df:
a=chunk[chunk['col2']=='text1']
final=pd.concat([final,a],axis=0)
Better alternatives, if any, will be most welcome!

Partitioning the data into equal number of records for each group in spark data frame

We have 1 month of data and each day has data of size which falls in the range of 10 to 100GB. We will be writing this data set in a partitioned manner. Here in our case, we have DATE parameter using which we will be partitioning the data in the data frame (partition("DATE")). And we also apply repartition to this data frame to create single or multiple files. If we repartition to 1, it creates 1 file for each partition. If we set to 5 it creates 5 partition files and is good.
But what we are trying here is, we want to make sure is each group (partitioned data of date) is created with equal size files (either through a number of records or sizes of files).
We have used spark data frame option "maxRecordsPerFile" and set to 10Million records. And this is working as expected. for 10 days of data, if I am doing this in one go, it is eating up the execution time, as it is collecting all 10 days of data and trying to do some distribution.
If I don't set this parameter and if I don't set repartition to 1, then this activity is completing in 5 minutes, but if I just set partition("DATE") and maxRecrodsPerFile option it is taking almost an hour.
Looking forward to some help on this!
~Krish

Converting data from .dat to parquet using Pyspark

Why the number of rows is different after converting from .dat to parquet data format using pyspark? Even when I repeat the conversion on the same file multiple times, I get a different result (slightly more or slightly less or equal to the original rows count)!
I am using my Macbook pro with 16 gb
.dat file size is 16.5 gb
spark-2.3.2-bin-hadoop2.7.
I already have the rows count from my data provider (45 million rows).
First I read the .dat file
2011_df = spark.read.text(filepath)
Second, I convert it to parquet, a process that takes about two hours.
2011_df.write.option("compression","snappy").mode("overwrite").save("2011.parquet")
Afterwards, I read the converted parquet file
de_parq = spark.read.parquet(filepath)
Finally, I use "count" to get rows numbers.
de_parq.count()

cassandra trigger on [copy from MAXBATCHSIZE]

when I'm trying to run the following CQL, I found that the canssandra trigger is not ran by one record, but by one batch.
COPY XXX_Table FROM 'xxxx.csv' WITH MAXBATCHSIZE=10
for example, I hava 2000 thousand recoreds csv file, after the above CQL is ran, there is 2000,000 records in cassandra, but the trigger is ran only 200 thousand times.
why?
it's because your data in the CSV file, has some same partition key.
When importing data, the parent process reads from the input file(s) chunks with CHUNKSIZE rows and sends each chunk to a worker process. Each worker process then analyses a chunk for rows with common partition keys. If at least 2 rows with the same partition key are found, they are batched and sent to a replica that owns the partition. You can control the minimum number of rows with a new option, MINBATCHSIZE, but it is advisable to leave it set to 2. For rows that do not share any common partition key, they get batched with other rows whose partition key belong to a common replica. These rows are then split into batches of size MAXBATCHSIZE, currently 20 rows. These batches are sent to the replicas where the partitions are located. Batches are of type UNLOGGED in both cases.
base on:
link

Resources