I'm trying to do a very simple LinearRegression in PySpark using a housing data set I found on Kaggle. There are a bunch of columns, but to make this (virtually) as simple as possible, I'm retaining just two of the columns (after starting with all of them), and still no luck getting a model trained. Here is what the data frame looks like before going through the regression step:
2016-09-07 17:12:08,804 root INFO [Row(price=78000.0, sqft_living=780.0, sqft_lot=16344.0, features=DenseVector([780.0, 16344.0])), Row(price=80000.0, sqft_living=430.0, sqft_lot=5050.0, features=DenseVector([430.0, 5050.0])), Row(price=81000.0, sqft_living=730.0, sqft_lot=9975.0, features=DenseVector([730.0, 9975.0])), Row(price=82000.0, sqft_living=860.0, sqft_lot=10426.0, features=DenseVector([860.0, 10426.0])), Row(price=84000.0, sqft_living=700.0, sqft_lot=20130.0, features=DenseVector([700.0, 20130.0])), Row(price=85000.0, sqft_living=830.0, sqft_lot=9000.0, features=DenseVector([830.0, 9000.0])), Row(price=85000.0, sqft_living=910.0, sqft_lot=9753.0, features=DenseVector([910.0, 9753.0])), Row(price=86500.0, sqft_living=840.0, sqft_lot=9480.0, features=DenseVector([840.0, 9480.0])), Row(price=89000.0, sqft_living=900.0, sqft_lot=4750.0, features=DenseVector([900.0, 4750.0])), Row(price=89950.0, sqft_living=570.0, sqft_lot=4080.0, features=DenseVector([570.0, 4080.0]))]
I'm using the following code to train the model:
standard_scaler = StandardScaler(inputCol='features',
outputCol='scaled')
lr = LinearRegression(featuresCol=standard_scaler.getOutputCol(), labelCol='price', weightCol=None,
maxIter=100, tol=1e-4)
pipeline = Pipeline(stages=[standard_scaler, lr])
grid = (ParamGridBuilder()
.baseOn({lr.labelCol: 'price'})
.addGrid(lr.regParam, [0.1, 1.0])
.addGrid(lr.elasticNetParam, elastic_net_params or [0.0, 1.0])
.build())
ev = RegressionEvaluator(metricName="rmse", labelCol='price')
cv = CrossValidator(estimator=pipeline,
estimatorParamMaps=grid,
evaluator=ev,
numFolds=5)
model = cv.fit(data).bestModel
The error I'm getting is:
2016-09-07 17:12:08,805 root INFO Training regression model...
2016-09-07 17:12:09,530 root ERROR An error occurred while calling o60.fit.
: java.lang.NullPointerException
at org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:164)
at org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:70)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:211)
at java.lang.Thread.run(Thread.java:745)
Any thoughts?
You cannot use a Pipeline in this case. When you call pipeline.fit it translates to (roughly)
standard_scaler_model = standard_scaler.fit(dataframe)
lr_model = lr.fit(dataframe)
But you actually need
standard_scaler_model = standard_scaler.fit(dataframe)
dataframe = standard_scaler_model.transform(dataframe)
lr_model = lr.fit(dataframe)
The error is because your lr.fit cannot find the output (i.e., the result of the transformation) of your StandardScaler model.
Related
Trying to run a GLM using poisson family and log link function and getting the following errors:
2022-01-11 15:56:55,143 root ERROR An error occurred while calling o266.fit.
: java.lang.NullPointerException
at scala.collection.immutable.StringOps$.length$extension(StringOps.scala:51)
at scala.collection.immutable.StringOps.length(StringOps.scala:51)
at scala.collection.IndexedSeqOptimized.isEmpty(IndexedSeqOptimized.scala:30)
at scala.collection.IndexedSeqOptimized.isEmpty$(IndexedSeqOptimized.scala:30)
at scala.collection.immutable.StringOps.isEmpty(StringOps.scala:33)
at scala.collection.TraversableOnce.nonEmpty(TraversableOnce.scala:114)
at scala.collection.TraversableOnce.nonEmpty$(TraversableOnce.scala:114)
at scala.collection.immutable.StringOps.nonEmpty(StringOps.scala:33)
at org.apache.spark.ml.PredictorParams.validateAndTransformSchema(Predictor.scala:58)
at org.apache.spark.ml.PredictorParams.validateAndTransformSchema$(Predictor.scala:47)
at org.apache.spark.ml.regression.GeneralizedLinearRegression.org$apache$spark$ml$regression$GeneralizedLinearRegressionBase$$super$validateAndTransformSchema(GeneralizedLinearRegression.scala:246)
at org.apache.spark.ml.regression.GeneralizedLinearRegressionBase.validateAndTransformSchema(GeneralizedLinearRegression.scala:213)
at org.apache.spark.ml.regression.GeneralizedLinearRegressionBase.validateAndTransformSchema$(GeneralizedLinearRegression.scala:188)
at org.apache.spark.ml.regression.GeneralizedLinearRegression.validateAndTransformSchema(GeneralizedLinearRegression.scala:246)
at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:177)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:71)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:133)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:115)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
This is how I'm building the GLM:
data = getattr(self, data_name)
train, test = data.randomSplit(
[train_fraction, 1 - train_fraction], seed=42)
self.store_metadata({"training_samples": train.count()})
self.store_metadata({"test_samples": test.count()})
std_scaler = StandardScaler(inputCol=feature_col,
outputCol="scaled_features")
glm = GeneralizedLinearRegression(labelCol=label_col,
featuresCol="scaled_features",
offsetCol=offset_col,
family=family,
link=link,
maxIter=max_iter,
tol=tol,
weightCol=weight_col
)
pipeline = Pipeline(stages=[std_scaler, glm])
logging.info(pipeline)
grid = (ParamGridBuilder()
.addGrid(glm.regParam, reg_params or [1, 0.1])
.build())
ev = RegressionEvaluator(labelCol=label_col,
metricName=metric_name)
cv = CrossValidator(estimator=pipeline, estimatorParamMaps=grid, evaluator=ev,
numFolds=k_folds, seed=seed)
logging.debug(cv.explainParams())
cv_model = cv.fit(train)
The dataset is insurance claims frequencies with the offsetCol being floating point number of months (exposure) and the label is the number of claims in that period. The features are a 100-dimensional vector of floats. Any suggestions what to look into here?
It turns out I ran into this issue with Spark 5 years ago and forgot about it. In fact I even posted a bug report that was never fixed (because supposedly it is the "correct" behavior). The issue is setting weightCol=None causes an error, whereas simply leaving it out is fine, even though the default according to the api doc is weightCol=None. Go figure. For reference, here is the bug report I filed about this in 2016:
https://issues.apache.org/jira/browse/SPARK-17508
I have two dataframe with exactly same join keys
spark config:
spark_config = {
'spark.driver.memory': '1g',
'spark.driver.maxResultSize': '1g',
'spark.executor.memory': '1g',
'spark.executor.memoryOverhead': '1g',
'spark.kryoserializer.buffer.max': 1024,
'spark.executor.instances': 2,
'spark.executor.cores': 1,
...
}
Inner join is fine
df1 = spark.read.parquet('/data/pipeline-workspace/91_active/3d2e21c0-3d0f-11ec-a384-b9dfba307662/segment')
df2 = spark.read.parquet('/data/pipeline-workspace/91_active/b0fe56e6-01d8-4e18-a8a8-d53e4e363fe8/daily_merged2')
df3 = df1.drop('__uuid__', '__timestamp__').join(df2, on="__label_id__", how='inner')
df3.write.parquet('/data/test', mode='overwrite')
count :
>>> df1.count()
102301
>>> df2.count()
102301
>>> df3.count()
102301
But left join got OOM
df3 = df1.drop('__uuid__', '__timestamp__').join(df2, on="__label_id__", how='left')
df3.write.parquet('/data/test', mode='overwrite')
logs:
Py4JJavaError: An error occurred while calling o374.parquet.
: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:202)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:136)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:160)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:157)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:132)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:696)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:696)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:696)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:305)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:291)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:249)
at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:586)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.OutOfMemoryError: Not enough memory to build and broadcast the table to all worker nodes. As a workaround, you can either disable broadcast by setting spark.sql.autoBroadcastJoinThreshold to -1 or increase the spark driver memory by setting spark.driver.memory to a higher value
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:122)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:76)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withExecutionId$1.apply(SQLExecution.scala:103)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
at org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:100)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:75)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:75)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
Set spark.sql.autoBroadcastJoinThreshold=-1 does solve the OOM, but it is not the correct behavior. Because we need autoBroadcast little table, shouldn't disable it .
I tried df2.collect() got OOM too,
so the problem becomes:
why spark think the size of df2 is lower than 10M ?
Could I mark the df2 with some tag no auto broadcast so that don't need to set spark.sql.autoBroadcastJoinThreshold=-1
So, for personal project I tried using this PySpark SMOTE Implementation to handle an imbalanced flight data that looks like this:
CRS_DEP_TIME
DEP_TIME
CRS_ARR_TIME
DISTANCE
YEAR_IDX
DAY_OF_WEEK_IDX
DEST_IDX
ORIGIN_IDX
MONTH_OF_YEAR_IDX
OP_CARRIER_IDX
y
715
709
1014
630
2
1
15
30
0
1
0
2141
2153
2234
630
2
1
30
15
0
1
1
I got the data from this Kaggle dataset and after split, the training data contains around 2.1GB, 34M rows, with probably around 1:7 imbalance. I know it is not that big of an imbalance but I just wanna try the implementation.
The main goal is to make a classification model that can predict whether a flight will arrive late or not, with 1 as late and 0 as not in the label column y.
I've tried running this on Google Colab. Note that both the smote and pre_smote_df_process function is the one from the PySpark SMOTE Implementation.
class SmoteConfig:
def __init__(self, seed, bucketLength, k, multiplier):
self.seed = seed
self.bucketLength = bucketLength
self.k = k
self.multiplier = multiplier
config = SmoteConfig(76, 200, 3, 2)
train_smoted = smote(pre_smote_df_process(train,['CRS_DEP_TIME','DEP_TIME','CRS_ARR_TIME','DISTANCE'],['YEAR_IDX','DAY_OF_WEEK_IDX','DEST_IDX','ORIGIN_IDX','MONTH_OF_YEAR_IDX','OP_CARRIER_IDX'],'y',index_suffix="_index"), config)
train_smoted.show()
When I tried using a fraction of the data (up to 1%) it works witin reasonable time. However, when I ran it on 10% data it took me more than 14 hours to show the result, and using 100% data this following error came up.
Py4JJavaError Traceback (most recent call last)
<ipython-input-14-6919ba1953f2> in <module>()
----> 1 disassembled.coalesce(1).write.format('com.databricks.spark.csv').save("/content/drive/MyDrive/Tugas/Tugas Akhir/Eksperimen/lsh_smoted_18_05_21.csv",header = 'true')
3 frames
/usr/local/lib/python3.7/dist-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling o12989.save.
: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:231)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:188)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:131)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:132)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:131)
at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:989)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:989)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:438)
at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:415)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:293)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 28 in stage 745.0 failed 1 times, most recent failure: Lost task 28.0 in stage 745.0 (TID 34904) (55fc2b589d8c executor driver): java.io.IOException: No space left on device
at java.base/java.io.FileOutputStream.writeBytes(Native Method)
at java.base/java.io.FileOutputStream.write(FileOutputStream.java:354)
at org.apache.spark.storage.TimeTrackingOutputStream.write(TimeTrackingOutputStream.java:59)
at java.base/java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:81)
at java.base/java.io.BufferedOutputStream.flush(BufferedOutputStream.java:142)
at net.jpountz.lz4.LZ4BlockOutputStream.finish(LZ4BlockOutputStream.java:263)
at net.jpountz.lz4.LZ4BlockOutputStream.close(LZ4BlockOutputStream.java:193)
at java.base/java.io.FilterOutputStream.close(FilterOutputStream.java:188)
at java.base/java.io.FilterOutputStream.close(FilterOutputStream.java:188)
at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$1.close(UnsafeRowSerializer.scala:96)
at org.apache.spark.storage.DiskBlockObjectWriter.commitAndGet(DiskBlockObjectWriter.scala:175)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:163)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Suppressed: java.io.IOException: No space left on device
at java.base/java.io.FileOutputStream.writeBytes(Native Method)
at java.base/java.io.FileOutputStream.write(FileOutputStream.java:354)
at org.apache.spark.storage.TimeTrackingOutputStream.write(TimeTrackingOutputStream.java:59)
at java.base/java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:81)
at java.base/java.io.BufferedOutputStream.flush(BufferedOutputStream.java:142)
at net.jpountz.lz4.LZ4BlockOutputStream.flush(LZ4BlockOutputStream.java:243)
at java.base/java.io.BufferedOutputStream.flush(BufferedOutputStream.java:143)
at java.base/java.io.DataOutputStream.flush(DataOutputStream.java:123)
at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$1.flush(UnsafeRowSerializer.scala:91)
at org.apache.spark.storage.DiskBlockObjectWriter.commitAndGet(DiskBlockObjectWriter.scala:173)
at org.apache.spark.storage.DiskBlockObjectWriter.$anonfun$close$1(DiskBlockObjectWriter.scala:156)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.storage.DiskBlockObjectWriter.close(DiskBlockObjectWriter.scala:158)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:164)
... 10 more
Suppressed: java.io.IOException: No space left on device
at java.base/java.io.FileOutputStream.writeBytes(Native Method)
at java.base/java.io.FileOutputStream.write(FileOutputStream.java:354)
at org.apache.spark.storage.TimeTrackingOutputStream.write(TimeTrackingOutputStream.java:59)
at java.base/java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:81)
at java.base/java.io.BufferedOutputStream.flush(BufferedOutputStream.java:142)
at java.base/java.io.FilterOutputStream.close(FilterOutputStream.java:182)
at org.apache.spark.storage.DiskBlockObjectWriter$ManualCloseBufferedOutputStream$1.org$apache$spark$storage$DiskBlockObjectWriter$ManualCloseOutputStream$$super$close(DiskBlockObjectWriter.scala:108)
at org.apache.spark.storage.DiskBlockObjectWriter$ManualCloseOutputStream.manualClose(DiskBlockObjectWriter.scala:65)
at org.apache.spark.storage.DiskBlockObjectWriter$ManualCloseOutputStream.manualClose$(DiskBlockObjectWriter.scala:64)
at org.apache.spark.storage.DiskBlockObjectWriter$ManualCloseBufferedOutputStream$1.manualClose(DiskBlockObjectWriter.scala:108)
at org.apache.spark.storage.DiskBlockObjectWriter.$anonfun$closeResources$1(DiskBlockObjectWriter.scala:135)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.storage.DiskBlockObjectWriter.closeResources(DiskBlockObjectWriter.scala:136)
at org.apache.spark.storage.DiskBlockObjectWriter.$anonfun$close$2(DiskBlockObjectWriter.scala:158)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1448)
... 12 more
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2253)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2202)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2201)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2201)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1078)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1078)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1078)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2440)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2382)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2371)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
at org.apache.spark.SparkContext.runJob(SparkContex
On the resource monitor, the disk constantly creeps up until full, before the error mentioned occurs.
Running it on local also shows the same error. Is there anything I'm doing wrong? Is the code unoptimized? Or there might possibly be something wrong with the data? Any insight will be much appreciated. Thanks.
since you say it works fine when you use fraction of data, and considering the error that i see here, it seems like memory problem. I didn't use this code. i used SMOTE() in imblearn library and that works fine with my PySpark project(maybe consider using that too). but for this kind of error, increasing memory worked for me.
you can increase memory in this way:
I used Google Colab like you and when i wanted to create a SparkSession instance i changed it to :
spark = SparkSession.builder.master("local")\
.appName("your_project_Name")\
.config("spark.ui.port",'4050')\
.config("spark.driver.memory","32G")\
.config("spark.executer.memory","32G")\
.getOrCreate()
it worked for me. hope it helps you and others too.
but generally,if data is still so big and even if you don't want to just use Undersampling, you can use combination of Undersampling with other methods(basic Oversampling or others). in this way you can reduce some memory problems.
I was trying K means from Spark ML, on some scaled features which are a column of a spark dataframe. It works as expected. However, as soon as I apply PCA on the same feature set, and try K means on that, I get the error
java.lang.IllegalArgumentException: requirement failed
My original data before PCA has 108 features, which have been Indexed and Assembled. K means runs fine on this and there are no nulls/NANs. After this, I'm setting the number of principal components as 6,running PCA. On the PCA data, K means is being run, and I am getting the above error.
A look at the stacktrace shows that the error is in fastSquaredDistance in MLUtils
.......at scala.Predef$.require(Predef.scala:212)
at org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:486)
at org.apache.spark.mllib.clustering.KMeans$.fastSquaredDistance(KMeans.scala:589)
at org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:563)
at org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:557)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at org.apache.spark.mllib.clustering.KMeans$.findClosest(KMeans.scala:557)
at org.apache.spark.mllib.clustering.KMeans$.pointCost(KMeans.scala:580)
at org.apache.spark.mllib.clustering.KMeansModel$$anonfun$computeCost$1.apply(KMeansModel.scala:88)
at org.apache.spark.mllib.clustering.KMeansModel$$anonfun$computeCost$1.apply(KMeansModel.scala:88)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.fold(TraversableOnce.scala:212)
at scala.collection.AbstractIterator.fold(Iterator.scala:1336)
at org.apache.spark.rdd.RDD$$anonfun$fold$1$$anonfun$20.apply(RDD.scala:1087)
at org.apache.spark.rdd.RDD$$anonfun$fold$1$$anonfun$20.apply(RDD.scala:1087)
at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:2120)
at org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:2120)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
This throws an exception when the dimensions of the vectors are not the same, or when the vector norm is negative
I checked the size of all the vectors in the PCA column and all of the, have size 6( as expected) . As for non negative norm, I can't figure out how that's even possible.
Does anyone have leads into what could be happening?
If my question isn't formatted well, I apologize, it's my first time asking a question here
The code that's throwing the error is below
....import org.apache.spark.ml.clustering.KMeans
val kmeans = new KMeans().setK(6).setSeed(42L)
val model = kmeans.fit(scaledDataPcaNN).setFeaturesCol("pcaFeatures")
val predictions = model.transform(scaledDataPca)
val WSSSE = model.computeCost(scaledDataPca)
Can some help me convert the data from Cassandra row to be converted to Vector using Java
The code snippet where its failing is as below.
I am reading data from Cassandra table and trying to create a RandomForestClassifier model as below.
JavaRDD<VibrationData1> vibrationReadRdd = CassandraJavaUtil
.javaFunctions(sc)
.cassandraTable("vibration_ks", "vibration_data_feature",
CassandraJavaUtil.mapRowTo(VibrationData1.class))
.select("classifier", "features");
System.out.println("vibrationRdd.count() --------- "
+ vibrationReadRdd.count());
SQLContext sqlContext = new SQLContext(sc);
DataFrame data = sqlContext.createDataFrame(vibrationReadRdd,
VibrationData1.class);
StringIndexerModel labelIndexer = new StringIndexer()
.setInputCol("classifier").setOutputCol("indexedLabel")
.fit(data);
// Automatically identify categorical features, and index them.
// Set maxCategories so features with > 4 distinct values are treated as
// continuous.
VectorIndexerModel featureIndexer = new VectorIndexer()
.setInputCol("features").setOutputCol("indexedFeatures")
.setMaxCategories(4).fit(data);
The error is observed at line
VectorIndexerModel featureIndexer = new VectorIndexer().setInputCol("features"). setOutputCol("indexedFeatures").setMaxCategories(4).fit(data);
The error message below
17/01/18 13:15:49 INFO DAGScheduler: Job 2 finished: countByValue at StringIndexer.scala:86, took 3.331207 s
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Column features must be of type org.apache.spark.mllib.linalg.VectorUDT#f71b0bce but was actually StringType.
at scala.Predef$.require(Predef.scala:233)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
at org.apache.spark.ml.feature.VectorIndexer.transformSchema(VectorIndexer.scala:134)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:68)
at org.apache.spark.ml.feature.VectorIndexer.fit(VectorIndexer.scala:112)
at com.edge.database.test.VibrationDBUtil.createSchema(VibrationDBUtil.java:161)
at com.edge.database.test.VibrationDBUtil.run(VibrationDBUtil.java:95)
at com.edge.database.test.VibrationDBUtil.main(VibrationDBUtil.java:217)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
The cassandra table is as below
cqlsh:vibration_ks> select * from vibration_data_feature;
uid | classifier | features
-----+------------+-----------------------------------------
15 | 0 | 10:0.0092 11:0.0093 20:0.0094 21:0.0095
17 | 1 | 10:0.192 11:0.3093 20:0.8094 21:0.9095
Thanks in advance.