Pyspark write.csv() shutdown by YARN for exeding memory limits - apache-spark

Premise: I'm not in control of my cluster and am working on the premise that the problem is in my code, not the setup my school is using. Maybe that is wrong, but it's an assumption that underlies this question.
Why does write.csv() cause my pyspark/slurm job to excede memory limits, when many previous operations on larger versions of the data have succeeded, and what can I do about it?
The error I'm getting is (many iterations of...):
18/06/02 16:13:41 ERROR YarnScheduler: Lost executor 21 on server.name.edu: Container killed by YARN for exceeding memory limits. 7.0 GB of 7 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
I know I can change the memory limits, but I've already increased it several times with no change in outcome, and am pretty convinced I shouldn't be using anywhere near this amount of memory anyway. For reference my slurm call is:
spark-submit \
--master yarn \
--num-executors 100 \
--executor-memory 6g \
3main.py
So what exactly am I trying to write? Well I've read in a 39G .bz2 json, to an RDD,
allposts = ss.read.json(filename)
filtered a bunch, counted words, grouped the RDD, done some calculations, filtered more, and in the end I have these two print statements to give an idea of what's left...
abscounts = calculatePosts2(postRDD, sc, spark)
abscounts.printSchema()
print(abscounts.count())
These print statements work (output below). The resulting RDD is about 60 cols by 2000 rows+/-. Those 60 columns include 1 string the length of a subreddit name, and 59 doubles.
root
|-- subreddit: string (nullable = true)
|-- count(1): long (nullable = false)
|-- sum(wordcount): long (nullable = true)
|-- ingestfreq: double (nullable = true)
|-- causefreq: double (nullable = true)
|-- insightfreq: double (nullable = true)
|-- cogmechfreq: double (nullable = true)
|-- sadfreq: double (nullable = true)
|-- inhibfreq: double (nullable = true)
|-- certainfreq: double (nullable = true)
|-- tentatfreq: double (nullable = true)
|-- discrepfreq: double (nullable = true)
|-- spacefreq: double (nullable = true)
|-- timefreq: double (nullable = true)
|-- exclfreq: double (nullable = true)
|-- inclfreq: double (nullable = true)
|-- relativfreq: double (nullable = true)
|-- motionfreq: double (nullable = true)
|-- quantfreq: double (nullable = true)
|-- numberfreq: double (nullable = true)
|-- swearfreq: double (nullable = true)
|-- functfreq: double (nullable = true)
|-- absolutistfreq: double (nullable = true)
|-- ppronfreq: double (nullable = true)
|-- pronounfreq: double (nullable = true)
|-- wefreq: double (nullable = true)
|-- ifreq: double (nullable = true)
|-- shehefreq: double (nullable = true)
|-- youfreq: double (nullable = true)
|-- ipronfreq: double (nullable = true)
|-- theyfreq: double (nullable = true)
|-- deathfreq: double (nullable = true)
|-- biofreq: double (nullable = true)
|-- bodyfreq: double (nullable = true)
|-- hearfreq: double (nullable = true)
|-- feelfreq: double (nullable = true)
|-- perceptfreq: double (nullable = true)
|-- seefreq: double (nullable = true)
|-- fillerfreq: double (nullable = true)
|-- healthfreq: double (nullable = true)
|-- sexualfreq: double (nullable = true)
|-- socialfreq: double (nullable = true)
|-- familyfreq: double (nullable = true)
|-- friendfreq: double (nullable = true)
|-- humansfreq: double (nullable = true)
|-- affectfreq: double (nullable = true)
|-- posemofreq: double (nullable = true)
|-- negemofreq: double (nullable = true)
|-- anxfreq: double (nullable = true)
|-- angerfreq: double (nullable = true)
|-- assentfreq: double (nullable = true)
|-- nonflfreq: double (nullable = true)
|-- verbfreq: double (nullable = true)
|-- articlefreq: double (nullable = true)
|-- pastfreq: double (nullable = true)
|-- auxverbfreq: double (nullable = true)
|-- futurefreq: double (nullable = true)
|-- presentfreq: double (nullable = true)
|-- prepsfreq: double (nullable = true)
|-- adverbfreq: double (nullable = true)
|-- negatefreq: double (nullable = true)
|-- conjfreq: double (nullable = true)
|-- homefreq: double (nullable = true)
|-- leisurefreq: double (nullable = true)
|-- achievefreq: double (nullable = true)
|-- workfreq: double (nullable = true)
|-- religfreq: double (nullable = true)
|-- moneyfreq: double (nullable = true)
...
2026
After that the only remaining line in my code is:
abscounts.write.csv('bigoutput.csv', header=True)
And this crashes with memory errors. This absolutely should not take up gigs of space... What am I doing wrong here?
Thanks for your help.
If you're curious/it helps, the entirety of my code is on github

First of all executor.memoryOverhead is not the same as executor-memory. As you can see here.
With Pyspark, memoryOverhead is important because it controls the extra memory that may be needed by python to perform some actions (See here), in your case collecting and saving a CSV file per partition.
To help python, you also may consider using coalesce before writing.

Related

Pyspark structured streaming - Union data from 2 nested JSON

I have 2 kafka streaming dataframes. The spark schema looks like this:
root
|-- key: string (nullable = true)
|-- pmudata1: struct (nullable = true)
| |-- pmu_id: byte (nullable = true)
| |-- time: timestamp (nullable = true)
| |-- stream_id: byte (nullable = true)
| |-- stat: string (nullable = true)
and
root
|-- key: string (nullable = true)
|-- pmudata2: struct (nullable = true)
| |-- pmu_id: byte (nullable = true)
| |-- time: timestamp (nullable = true)
| |-- stream_id: byte (nullable = true)
| |-- stat: string (nullable = true)
How can I union all rows from both streams as they come by specific batch window? Positions of columns in both streams is same.
Each stream have different pmu_id value so I can differentiate records per that value.
UnionByName or union produces stream from single dataframe.
I would need to explode column names I guess, something like this but this is for scala.
Is there a way to automatically explode whole JSON in columns and union them?
You can use explode function only with array and map types. In your case, the column pmudata2 has type StructType so simply use star * to select all sub-fields like this:
df1 = df.selectExpr("key", "pmudata2.*")
#root
#|-- key: string (nullable = true)
#|-- pmu_id: byte (nullable = true)
#|-- time: timestamp (nullable = true)
#|-- stream_id: byte (nullable = true)
#|-- stat: string (nullable = true)

How to add an optional column inside struct field with pyspark

I currently had created a struct field in this way:
df = df.withColumn('my_struct', struct(
col('id').alias('id_test')
col('value').alias('value_test')
).alias('my_struct'))
The think is that now I need to add and extra field to my_struct called "optional". This field must be there when it exits and remove it when it's not. Sadly values like null/none not an option.
So far I have two different dataframes, one with the desired value and the column by id and another one without the value/column and all the information.
df_optional = df_optional.select('id','optional')
df = df.select('id','value','my_struct')
I want to add into df.my_struct the optional value when df_optional.id join df.join plus the rest.
Till this point I have this:
df_with_option = df.join(df_optional,on=['id'],how='inner') \
.withColumn('my_struct', struct(
col('id').alias('id_test')
col('value').alias('value_test')
col(optional).alias('optional')
).alias('my_struct')).drop('optional')
df_without = df.join(df_optional,on=['id'],how='leftanti') # it already have my_struct
But union should have similar columns so my code breaks.
df_result = df_without .unionByName(df_with_option)
I want to union both dataframes because at the end I write a json file partitioned by id:
df_result.repartitionByRange(df_result.count(),df['id']).write.format('json').mode('overwrite').save('my_path')
Those json files should have the 'optional' column when it has values, otherwise it should be out of the schema.
Any help will be appreciate.
--ADITIONAL INFO.
Schema input:
df_root
|-- id: string (nullable = true)
|-- optional: string (nullable = true)
df_optional
|-- id: string (nullable = true)
|-- value: string (nullable = true)
|-- my_struct: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- value: string (nullable = true)
Schema output:
df_result
|-- id: string (nullable = true)
|-- value: string (nullable = true)
|-- my_struct: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- value: string (nullable = true)
| |-- optional: string (nullable = true) (*)
(*) Only when it exists.
--UPDATE
I think that it jsut not posible in that way. I probably need to keep both dataframes appart and just write it two times. Something like this:
df_without.repartitionByRange(df_result.count(),df['id']).write.format('json').mode('overwrite').save('my_path')
df_with_option.repartitionByRange(df_result.count(),df['id']).write.format('json').mode('append').save('my_path')
Then I will had in my path the files by it's own way.

Apache Spark byte size per avro record

I have several hundreds of gb's worth of Avro files, each file containing thousands of records pertaining to a mobile application and its usage. One of the keys in the schema is the app version ID, and I need to return the byte size of each record grouped by the version ID. If the schema is set up something like this...
root
|-- useId: string (nullable = true)
|-- useTime: double (nullable = true)
|-- appVersion: string (nullable = true)
|-- useDetails: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- type: integer (nullable = true)
| | |-- something: double (nullable = true)
| | |-- somethingElse: double (nullable = true)
.
.
.
...then I want to essentially do something like select appVersion, sum(bytesPerRecord) from df group by appVersion in order to gauge the payload sizes (or char count, even) per released version of the app. I haven't found any off-the-shelf solutions for this, and I am not a spark savant either. Is this possible?

Spark ML: How does DecisionTreeClassificatonModel know about the tree weights?

I'd like to get the weight for the tree nodes from a saved (or unsaved) DecisionTreeClassificationModel. However I can't find anything remotely resembling that.
How does the model actually perform the classification not knowing any of those. Below are the Params that are saved in the model:
{"class":"org.apache.spark.ml.classification.DecisionTreeClassificationModel"
"timestamp":1551207582648
"sparkVersion":"2.3.2"
"uid":"DecisionTreeClassifier_4ffc94d20f1ddb29f282"
"paramMap":{
"cacheNodeIds":false
"maxBins":32
"minInstancesPerNode":1
"predictionCol":"prediction"
"minInfoGain":0.0
"rawPredictionCol":"rawPrediction"
"featuresCol":"features"
"probabilityCol":"probability"
"checkpointInterval":10
"seed":956191873026065186
"impurity":"gini"
"maxMemoryInMB":256
"maxDepth":2
"labelCol":"indexed"
}
"numFeatures":1
"numClasses":2
}
By using treeWeights:
treeWeights
Return the weights for each tree
New in version 1.5.0.
So
How does the model actually perform the classification not knowing any of those.
The weights are stored, just not as a part of the metadata. If you have model
from pyspark.ml.classification import RandomForestClassificationModel
model: RandomForestClassificationModel = ...
and save it to disk
path: str = ...
model.save(path)
you'll see that the writer creates treesMetadata subdirectory. If you load the content (default writer uses Parquet):
import os
trees_metadata = spark.read.parquet(os.path.join(path, "treesMetadata"))
you'll see following structure:
trees_metadata.printSchema()
root
|-- treeID: integer (nullable = true)
|-- metadata: string (nullable = true)
|-- weights: double (nullable = true)
where weights column contains the weight of tree identified by treeID.
Similarly node data is stored in the data subdirectory (see for example Extract and Visualize Model Trees from Sparklyr):
spark.read.parquet(os.path.join(path, "data")).printSchema()
root
|-- id: integer (nullable = true)
|-- prediction: double (nullable = true)
|-- impurity: double (nullable = true)
|-- impurityStats: array (nullable = true)
| |-- element: double (containsNull = true)
|-- gain: double (nullable = true)
|-- leftChild: integer (nullable = true)
|-- rightChild: integer (nullable = true)
|-- split: struct (nullable = true)
| |-- featureIndex: integer (nullable = true)
| |-- leftCategoriesOrThreshold: array (nullable = true)
| | |-- element: double (containsNull = true)
| |-- numCategories: integer (nullable = true)
Equivalent information (minus tree data and tree weights) is available for DecisionTreeClassificationModel as well.

How to sum the values of a column in pyspark dataframe

I am working in Pyspark and I have a data frame with the following columns.
Q1 = spark.read.csv("Q1final.csv",header = True, inferSchema = True)
Q1.printSchema()
root
|-- index_date: integer (nullable = true)
|-- item_id: integer (nullable = true)
|-- item_COICOP_CLASSIFICATION: integer (nullable = true)
|-- item_desc: string (nullable = true)
|-- index_algorithm: integer (nullable = true)
|-- stratum_ind: integer (nullable = true)
|-- item_index: double (nullable = true)
|-- all_gm_index: double (nullable = true)
|-- gm_ra_index: double (nullable = true)
|-- coicop_weight: double (nullable = true)
|-- item_weight: double (nullable = true)
|-- cpih_coicop_weight: double (nullable = true)
I need the sum of all the elements in the last column (cpih_coicop_weight) to use as a Double in other parts of my program. How can I do it?
Thank you very much in advance!
If you want just a double or int as return, the following function will work:
def sum_col(df, col):
return df.select(F.sum(col)).collect()[0][0]
Then
sum_col(Q1, 'cpih_coicop_weight')
will return the sum.
I am new to pyspark so I am not sure why such a simple method of a column object is not in the library.
try this :
from pyspark.sql import functions as F
total = Q1.groupBy().agg(F.sum("cpih_coicop_weight")).collect()
In total, you should have your result.
This can also be tried.
total = Q1.agg(F.sum("cpih_coicop_weight")).collect()

Resources