I am working in Pyspark and I have a data frame with the following columns.
Q1 = spark.read.csv("Q1final.csv",header = True, inferSchema = True)
Q1.printSchema()
root
|-- index_date: integer (nullable = true)
|-- item_id: integer (nullable = true)
|-- item_COICOP_CLASSIFICATION: integer (nullable = true)
|-- item_desc: string (nullable = true)
|-- index_algorithm: integer (nullable = true)
|-- stratum_ind: integer (nullable = true)
|-- item_index: double (nullable = true)
|-- all_gm_index: double (nullable = true)
|-- gm_ra_index: double (nullable = true)
|-- coicop_weight: double (nullable = true)
|-- item_weight: double (nullable = true)
|-- cpih_coicop_weight: double (nullable = true)
I need the sum of all the elements in the last column (cpih_coicop_weight) to use as a Double in other parts of my program. How can I do it?
Thank you very much in advance!
If you want just a double or int as return, the following function will work:
def sum_col(df, col):
return df.select(F.sum(col)).collect()[0][0]
Then
sum_col(Q1, 'cpih_coicop_weight')
will return the sum.
I am new to pyspark so I am not sure why such a simple method of a column object is not in the library.
try this :
from pyspark.sql import functions as F
total = Q1.groupBy().agg(F.sum("cpih_coicop_weight")).collect()
In total, you should have your result.
This can also be tried.
total = Q1.agg(F.sum("cpih_coicop_weight")).collect()
Related
I have 2 kafka streaming dataframes. The spark schema looks like this:
root
|-- key: string (nullable = true)
|-- pmudata1: struct (nullable = true)
| |-- pmu_id: byte (nullable = true)
| |-- time: timestamp (nullable = true)
| |-- stream_id: byte (nullable = true)
| |-- stat: string (nullable = true)
and
root
|-- key: string (nullable = true)
|-- pmudata2: struct (nullable = true)
| |-- pmu_id: byte (nullable = true)
| |-- time: timestamp (nullable = true)
| |-- stream_id: byte (nullable = true)
| |-- stat: string (nullable = true)
How can I union all rows from both streams as they come by specific batch window? Positions of columns in both streams is same.
Each stream have different pmu_id value so I can differentiate records per that value.
UnionByName or union produces stream from single dataframe.
I would need to explode column names I guess, something like this but this is for scala.
Is there a way to automatically explode whole JSON in columns and union them?
You can use explode function only with array and map types. In your case, the column pmudata2 has type StructType so simply use star * to select all sub-fields like this:
df1 = df.selectExpr("key", "pmudata2.*")
#root
#|-- key: string (nullable = true)
#|-- pmu_id: byte (nullable = true)
#|-- time: timestamp (nullable = true)
#|-- stream_id: byte (nullable = true)
#|-- stat: string (nullable = true)
I currently had created a struct field in this way:
df = df.withColumn('my_struct', struct(
col('id').alias('id_test')
col('value').alias('value_test')
).alias('my_struct'))
The think is that now I need to add and extra field to my_struct called "optional". This field must be there when it exits and remove it when it's not. Sadly values like null/none not an option.
So far I have two different dataframes, one with the desired value and the column by id and another one without the value/column and all the information.
df_optional = df_optional.select('id','optional')
df = df.select('id','value','my_struct')
I want to add into df.my_struct the optional value when df_optional.id join df.join plus the rest.
Till this point I have this:
df_with_option = df.join(df_optional,on=['id'],how='inner') \
.withColumn('my_struct', struct(
col('id').alias('id_test')
col('value').alias('value_test')
col(optional).alias('optional')
).alias('my_struct')).drop('optional')
df_without = df.join(df_optional,on=['id'],how='leftanti') # it already have my_struct
But union should have similar columns so my code breaks.
df_result = df_without .unionByName(df_with_option)
I want to union both dataframes because at the end I write a json file partitioned by id:
df_result.repartitionByRange(df_result.count(),df['id']).write.format('json').mode('overwrite').save('my_path')
Those json files should have the 'optional' column when it has values, otherwise it should be out of the schema.
Any help will be appreciate.
--ADITIONAL INFO.
Schema input:
df_root
|-- id: string (nullable = true)
|-- optional: string (nullable = true)
df_optional
|-- id: string (nullable = true)
|-- value: string (nullable = true)
|-- my_struct: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- value: string (nullable = true)
Schema output:
df_result
|-- id: string (nullable = true)
|-- value: string (nullable = true)
|-- my_struct: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- value: string (nullable = true)
| |-- optional: string (nullable = true) (*)
(*) Only when it exists.
--UPDATE
I think that it jsut not posible in that way. I probably need to keep both dataframes appart and just write it two times. Something like this:
df_without.repartitionByRange(df_result.count(),df['id']).write.format('json').mode('overwrite').save('my_path')
df_with_option.repartitionByRange(df_result.count(),df['id']).write.format('json').mode('append').save('my_path')
Then I will had in my path the files by it's own way.
I'd like to get the weight for the tree nodes from a saved (or unsaved) DecisionTreeClassificationModel. However I can't find anything remotely resembling that.
How does the model actually perform the classification not knowing any of those. Below are the Params that are saved in the model:
{"class":"org.apache.spark.ml.classification.DecisionTreeClassificationModel"
"timestamp":1551207582648
"sparkVersion":"2.3.2"
"uid":"DecisionTreeClassifier_4ffc94d20f1ddb29f282"
"paramMap":{
"cacheNodeIds":false
"maxBins":32
"minInstancesPerNode":1
"predictionCol":"prediction"
"minInfoGain":0.0
"rawPredictionCol":"rawPrediction"
"featuresCol":"features"
"probabilityCol":"probability"
"checkpointInterval":10
"seed":956191873026065186
"impurity":"gini"
"maxMemoryInMB":256
"maxDepth":2
"labelCol":"indexed"
}
"numFeatures":1
"numClasses":2
}
By using treeWeights:
treeWeights
Return the weights for each tree
New in version 1.5.0.
So
How does the model actually perform the classification not knowing any of those.
The weights are stored, just not as a part of the metadata. If you have model
from pyspark.ml.classification import RandomForestClassificationModel
model: RandomForestClassificationModel = ...
and save it to disk
path: str = ...
model.save(path)
you'll see that the writer creates treesMetadata subdirectory. If you load the content (default writer uses Parquet):
import os
trees_metadata = spark.read.parquet(os.path.join(path, "treesMetadata"))
you'll see following structure:
trees_metadata.printSchema()
root
|-- treeID: integer (nullable = true)
|-- metadata: string (nullable = true)
|-- weights: double (nullable = true)
where weights column contains the weight of tree identified by treeID.
Similarly node data is stored in the data subdirectory (see for example Extract and Visualize Model Trees from Sparklyr):
spark.read.parquet(os.path.join(path, "data")).printSchema()
root
|-- id: integer (nullable = true)
|-- prediction: double (nullable = true)
|-- impurity: double (nullable = true)
|-- impurityStats: array (nullable = true)
| |-- element: double (containsNull = true)
|-- gain: double (nullable = true)
|-- leftChild: integer (nullable = true)
|-- rightChild: integer (nullable = true)
|-- split: struct (nullable = true)
| |-- featureIndex: integer (nullable = true)
| |-- leftCategoriesOrThreshold: array (nullable = true)
| | |-- element: double (containsNull = true)
| |-- numCategories: integer (nullable = true)
Equivalent information (minus tree data and tree weights) is available for DecisionTreeClassificationModel as well.
I am trying to move data from greenplum to HDFS using Spark. I can read the data successfully from the source table and the spark inferred schema of the dataframe (of the greenplum table) is:
DataFrame Schema:
je_header_id: long (nullable = true)
je_line_num: long (nullable = true)
last_updated_by: decimal(15,0) (nullable = true)
last_updated_by_name: string (nullable = true)
ledger_id: long (nullable = true)
code_combination_id: long (nullable = true)
balancing_segment: string (nullable = true)
cost_center_segment: string (nullable = true)
period_name: string (nullable = true)
effective_date: timestamp (nullable = true)
status: string (nullable = true)
creation_date: timestamp (nullable = true)
created_by: decimal(15,0) (nullable = true)
entered_dr: decimal(38,20) (nullable = true)
entered_cr: decimal(38,20) (nullable = true)
entered_amount: decimal(38,20) (nullable = true)
accounted_dr: decimal(38,20) (nullable = true)
accounted_cr: decimal(38,20) (nullable = true)
accounted_amount: decimal(38,20) (nullable = true)
xx_last_update_log_id: integer (nullable = true)
source_system_name: string (nullable = true)
period_year: decimal(15,0) (nullable = true)
period_num: decimal(15,0) (nullable = true)
The corresponding schema of the Hive table is:
je_header_id:bigint|je_line_num:bigint|last_updated_by:bigint|last_updated_by_name:string|ledger_id:bigint|code_combination_id:bigint|balancing_segment:string|cost_center_segment:string|period_name:string|effective_date:timestamp|status:string|creation_date:timestamp|created_by:bigint|entered_dr:double|entered_cr:double|entered_amount:double|accounted_dr:double|accounted_cr:double|accounted_amount:double|xx_last_update_log_id:int|source_system_name:string|period_year:bigint|period_num:bigint
Using the Hive table schema mentioned above, I created the below StructType from using the logic:
def convertDatatype(datatype: String): DataType = {
val convert = datatype match {
case "string" => StringType
case "bigint" => LongType
case "int" => IntegerType
case "double" => DoubleType
case "date" => TimestampType
case "boolean" => BooleanType
case "timestamp" => TimestampType
}
convert
}
Prepared Schema:
je_header_id: long (nullable = true)
je_line_num: long (nullable = true)
last_updated_by: long (nullable = true)
last_updated_by_name: string (nullable = true)
ledger_id: long (nullable = true)
code_combination_id: long (nullable = true)
balancing_segment: string (nullable = true)
cost_center_segment: string (nullable = true)
period_name: string (nullable = true)
effective_date: timestamp (nullable = true)
status: string (nullable = true)
creation_date: timestamp (nullable = true)
created_by: long (nullable = true)
entered_dr: double (nullable = true)
entered_cr: double (nullable = true)
entered_amount: double (nullable = true)
accounted_dr: double (nullable = true)
accounted_cr: double (nullable = true)
accounted_amount: double (nullable = true)
xx_last_update_log_id: integer (nullable = true)
source_system_name: string (nullable = true)
period_year: long (nullable = true)
period_num: long (nullable = true)
When I try to apply my newSchema on the dataframe Schema, I get an exception:
java.lang.RuntimeException: java.math.BigDecimal is not a valid external type for schema of bigint
I understand that it is trying to convert BigDecimal to Bigint and it fails, but could anyone tell me how do I cast the bigint to a spark compatible datatype ?
If not, how can I modify my logic to give proper datatypes in the case statement for this bigint/bigdecimal problem ?
Here by seeing your question, seems like you are trying to convert bigint value to big decimal, which is not right. Bigdecimal is a decimal that must have fixed precision (the maximum number of digits) and scale (the number of digits on right side of dot). And your's is seems like long value.
Here instead of using BigDecimal datatype, try to use LongType to convert bigint value correctly. See if this solve your purpose.
Premise: I'm not in control of my cluster and am working on the premise that the problem is in my code, not the setup my school is using. Maybe that is wrong, but it's an assumption that underlies this question.
Why does write.csv() cause my pyspark/slurm job to excede memory limits, when many previous operations on larger versions of the data have succeeded, and what can I do about it?
The error I'm getting is (many iterations of...):
18/06/02 16:13:41 ERROR YarnScheduler: Lost executor 21 on server.name.edu: Container killed by YARN for exceeding memory limits. 7.0 GB of 7 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
I know I can change the memory limits, but I've already increased it several times with no change in outcome, and am pretty convinced I shouldn't be using anywhere near this amount of memory anyway. For reference my slurm call is:
spark-submit \
--master yarn \
--num-executors 100 \
--executor-memory 6g \
3main.py
So what exactly am I trying to write? Well I've read in a 39G .bz2 json, to an RDD,
allposts = ss.read.json(filename)
filtered a bunch, counted words, grouped the RDD, done some calculations, filtered more, and in the end I have these two print statements to give an idea of what's left...
abscounts = calculatePosts2(postRDD, sc, spark)
abscounts.printSchema()
print(abscounts.count())
These print statements work (output below). The resulting RDD is about 60 cols by 2000 rows+/-. Those 60 columns include 1 string the length of a subreddit name, and 59 doubles.
root
|-- subreddit: string (nullable = true)
|-- count(1): long (nullable = false)
|-- sum(wordcount): long (nullable = true)
|-- ingestfreq: double (nullable = true)
|-- causefreq: double (nullable = true)
|-- insightfreq: double (nullable = true)
|-- cogmechfreq: double (nullable = true)
|-- sadfreq: double (nullable = true)
|-- inhibfreq: double (nullable = true)
|-- certainfreq: double (nullable = true)
|-- tentatfreq: double (nullable = true)
|-- discrepfreq: double (nullable = true)
|-- spacefreq: double (nullable = true)
|-- timefreq: double (nullable = true)
|-- exclfreq: double (nullable = true)
|-- inclfreq: double (nullable = true)
|-- relativfreq: double (nullable = true)
|-- motionfreq: double (nullable = true)
|-- quantfreq: double (nullable = true)
|-- numberfreq: double (nullable = true)
|-- swearfreq: double (nullable = true)
|-- functfreq: double (nullable = true)
|-- absolutistfreq: double (nullable = true)
|-- ppronfreq: double (nullable = true)
|-- pronounfreq: double (nullable = true)
|-- wefreq: double (nullable = true)
|-- ifreq: double (nullable = true)
|-- shehefreq: double (nullable = true)
|-- youfreq: double (nullable = true)
|-- ipronfreq: double (nullable = true)
|-- theyfreq: double (nullable = true)
|-- deathfreq: double (nullable = true)
|-- biofreq: double (nullable = true)
|-- bodyfreq: double (nullable = true)
|-- hearfreq: double (nullable = true)
|-- feelfreq: double (nullable = true)
|-- perceptfreq: double (nullable = true)
|-- seefreq: double (nullable = true)
|-- fillerfreq: double (nullable = true)
|-- healthfreq: double (nullable = true)
|-- sexualfreq: double (nullable = true)
|-- socialfreq: double (nullable = true)
|-- familyfreq: double (nullable = true)
|-- friendfreq: double (nullable = true)
|-- humansfreq: double (nullable = true)
|-- affectfreq: double (nullable = true)
|-- posemofreq: double (nullable = true)
|-- negemofreq: double (nullable = true)
|-- anxfreq: double (nullable = true)
|-- angerfreq: double (nullable = true)
|-- assentfreq: double (nullable = true)
|-- nonflfreq: double (nullable = true)
|-- verbfreq: double (nullable = true)
|-- articlefreq: double (nullable = true)
|-- pastfreq: double (nullable = true)
|-- auxverbfreq: double (nullable = true)
|-- futurefreq: double (nullable = true)
|-- presentfreq: double (nullable = true)
|-- prepsfreq: double (nullable = true)
|-- adverbfreq: double (nullable = true)
|-- negatefreq: double (nullable = true)
|-- conjfreq: double (nullable = true)
|-- homefreq: double (nullable = true)
|-- leisurefreq: double (nullable = true)
|-- achievefreq: double (nullable = true)
|-- workfreq: double (nullable = true)
|-- religfreq: double (nullable = true)
|-- moneyfreq: double (nullable = true)
...
2026
After that the only remaining line in my code is:
abscounts.write.csv('bigoutput.csv', header=True)
And this crashes with memory errors. This absolutely should not take up gigs of space... What am I doing wrong here?
Thanks for your help.
If you're curious/it helps, the entirety of my code is on github
First of all executor.memoryOverhead is not the same as executor-memory. As you can see here.
With Pyspark, memoryOverhead is important because it controls the extra memory that may be needed by python to perform some actions (See here), in your case collecting and saving a CSV file per partition.
To help python, you also may consider using coalesce before writing.