Jupyter notebook on spark standalone cluster cache memory issue - apache-spark

My code was working fine earlier but now it is showing cache memory issue.My program involves DataFrame loading,transformation and processing running on jupyter notebook connected with pyspark-shell.
I am not understanding what is the main issue and how to tackle it.Any Help is highly appreciated.
My code is:
import time
start = time.time()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('spark://172.16.12.200:7077').appName('new').getOrCreate()
ndf = spark.read.json("Musical_Instruments.json")
pd=ndf.select(ndf['asin'],ndf['overall'],ndf['reviewerID'])
spark.sparkContext.setCheckpointDir("/home/npproject/jupyter_files/checkpoints")
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import TrainValidationSplit,ParamGridBuilder
from pyspark.ml.feature import StringIndexer
from pyspark.ml import Pipeline
from pyspark.sql.functions import col
indexer = [StringIndexer(inputCol=column, outputCol=column+"_index") for column in list(set(pd.columns)-set(['overall'])) ]
pipeline = Pipeline(stages=indexer)
transformed = pipeline.fit(pd).transform(pd)
(training,test)=transformed.randomSplit([0.8, 0.2])
als=ALS(maxIter=5,regParam=0.09,rank=25,userCol="reviewerID_index",itemCol="asin_index",ratingCol="overall",coldStartStrategy="drop",nonnegative=True)
model=als.fit(training)
evaluator=RegressionEvaluator(metricName="rmse",labelCol="overall",predictionCol="prediction")
predictions=model.transform(test)
rmse=evaluator.evaluate(predictions)
print("RMSE="+str(rmse))
print("Rank: ",model.rank)
print("MaxIter: ",model._java_obj.parent().getMaxIter())
print("RegParam: ",model._java_obj.parent().getRegParam())
user_recs=model.recommendForAllUsers(10).show(20)
end = time.time()
print("execution time",end-start)
Error code is:
Error:
Py4JJavaError: An error occurred while calling o40.json.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 5, 172.16.12.208, executor 1): java.io.FileNotFoundException: File file:/home/npproject/jupyter_files /Musical_Instruments.json does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.

Related

SparkContext conflict with spark udf

Good morning
When running:
from pyspark.sql.types import IntegerType
import pyspark.sql.functions as F
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
class ETL:
def addone(x):
return x + 1
def job_run():
df = spark.sql('SELECT 1 one').withColumn('AddOne', udf_addone(F.col('one')))
df.show()
if (__name__ == '__main__'):
udf_addone = F.udf(lambda x: ETL.addone(x), returnType=IntegerType())
ETL.job_run()
I get the following error message:
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
I have reviewed the answers given at ERROR:SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063 and at Spark: Broadcast variables: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion with no success. I'd like to stick to using spark udf in my script.
Any help on this is appreciated.
Many thanks!

Erratic occurence of "Container killed by YARN for exceeding memory limits."

ErrorMessage': 'An error occurred while calling o103.pyWriteDynamicFrame.
Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most
recent failure: Lost task 0.3 in stage 5.0
(TID 131, ip-1-2-3-4.eu-central-1.compute.internal, executor 20):
ExecutorLostFailure (executor 20 exited caused by one of the running tasks)
Reason: Container killed by YARN for exceeding memory limits. 5.5 GB of
5.5 GB physical memory used. Consider boosting
spark.yarn.executor.memoryOverhead or disabling
yarn.nodemanager.vmem-check-enabled because of YARN-4714.
The job is doing this (pseudo code):
Reads CSV into DyanamicFrame dynf
`dynf.toDF().repartition(100)
Map.apply(dyndf, tf) # tf being function applied on every row
`dynf.toDF().coalesce(10)
Writes dyndf as parquet to S3
This job has been executed dozens of times with identical Glue setup (Standard worker with MaxCapacity of 10.0) successfully and reexecution on CSV that it failed on is usually successful without any adjustments. Meaning: it works. Not just that. The job ran even successfully with much larger CSVs than those it failed on.
That's what I mean with erratic. I don't see a pattern like if CSV is larger than X then I need more workers or something like that.
Has somebody an idea what might be a cause for this error which occurs somewhat randomly?
The relevant part of the code:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
# s3://bucket/path/object
args = getResolvedOptions(sys.argv, [
'JOB_NAME',
'SOURCE_BUCKET', # "bucket"
'SOURCE_PATH', # "path/"
'OBJECT_NAME', # "object"
'TARGET_BUCKET', # "bucket"
'TARGET_PATH', # "path/"
'PARTS_LOAD',
'PARTS_SAVE'
])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
data_DYN = glueContext.create_dynamic_frame_from_options(
connection_type="s3",
format="csv",
connection_options={
"paths":[
"s3://{sb}/{sp}{on}".format(
sb=args['SOURCE_BUCKET'],
sp=args['SOURCE_PATH'],
on=args['OBJECT_NAME']
)
]
},
format_options={
"withHeader": True,
"separator": ","
}
)
data_DF = data_DYN.toDF().repartition(int(args["PARTS_LOAD"]))
data_DYN = DynamicFrame.fromDF(data_DF, glueContext, "data_DYN")
def tf(rec):
# functions applied to elements of rec
return rec
data_DYN_2 = Map.apply(data_DYN, tf)
cols = [
'col1', 'col2', ...
]
data_DYN_3 = SelectFields.apply(data_DYN_2, cols)
data_DF_3 = data_DYN_3.toDF().cache()
data_DF_4 = data_DF_3.coalesce(int(args["PARTS_SAVE"]))
data_DYN_4 = DynamicFrame.fromDF(data_DF_4, glueContext, "data_DYN_4")
datasink = glueContext.write_dynamic_frame.from_options(
frame = data_DYN_4,
connection_type = "s3",
connection_options = {
"path": "s3://{tb}/{tp}".format(tb=args['TARGET_BUCKET'],tp=args['TARGET_PATH']),
"partitionKeys": ["col_x","col_y"]
},
format = "parquet",
transformation_ctx = "datasink"
)
job.commit()
I would suspect .coalesce(10) to be the culprit, due to 100 -> 10 reduction in number of partitions without rebalancing the data across them. Doing .repartition(10) instead might fix it, at the expense of an extra shuffle.

Error while using dataframe show method in pyspark

I am trying to read data from BigQuery using pandas and pyspark. I am able to get the data but somehow getting below error while converting it into Spark DataFrame.
py4j.protocol.Py4JJavaError: An error occurred while calling o28.showString.
: java.lang.IllegalStateException: Could not find TLS ALPN provider; no working netty-tcnative, Conscrypt, or Jetty NPN/ALPN available
at com.google.cloud.spark.bigquery.repackaged.io.grpc.netty.shaded.io.grpc.netty.GrpcSslContexts.defaultSslProvider(GrpcSslContexts.java:258)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.netty.shaded.io.grpc.netty.GrpcSslContexts.configure(GrpcSslContexts.java:171)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.netty.shaded.io.grpc.netty.GrpcSslContexts.forClient(GrpcSslContexts.java:120)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.netty.shaded.io.grpc.netty.NettyChannelBuilder.buildTransportFactory(NettyChannelBuilder.java:401)
at com.google.cloud.spark.bigquery.repackaged.io.grpc.internal.AbstractManagedChannelImplBuilder.build(AbstractManagedChannelImplBuilder.java:444)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.InstantiatingGrpcChannelProvider.createSingleChannel(InstantiatingGrpcChannelProvider.java:223)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.InstantiatingGrpcChannelProvider.createChannel(InstantiatingGrpcChannelProvider.java:169)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.grpc.InstantiatingGrpcChannelProvider.getTransportChannel(InstantiatingGrpcChannelProvider.java:156)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.rpc.ClientContext.create(ClientContext.java:157)
Following is the environment detail
Python version : 3.7
Spark version : 2.4.3
Java version : 1.8
The code is as follow
import google.auth
import pyspark
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession , SQLContext
from google.cloud import bigquery
# Currently this only supports queries which have at least 10 MB of results
QUERY = """ SELECT * FROM test limit 1 """
#spark = SparkSession.builder.appName('Query Results').getOrCreate()
sc = pyspark.SparkContext()
bq = bigquery.Client()
print('Querying BigQuery')
project_id = ''
query_job = bq.query(QUERY,project=project_id)
# Wait for query execution
query_job.result()
df = SQLContext(sc).read.format('bigquery') \
.option('dataset', query_job.destination.dataset_id) \
.option('table', query_job.destination.table_id)\
.option("type", "direct")\
.load()
df.show()
I am looking some help to solve this issue.
I managed to find the better solution referencing this link , below is my working code :
Install pandas_gbq package in python library before writing below code .
import pandas_gbq
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
project_id = "<your-project-id>"
query = """ SELECT * from testSchema.testTable"""
athletes = pandas_gbq.read_gbq(query=query, project_id=project_id,dialect = 'standard')
# Get a reference to the Spark Session
sc = SparkContext()
spark = SparkSession(sc)
# convert from Pandas to Spark
sparkDF = spark.createDataFrame(athletes)
# perform an operation on the DataFrame
print(sparkDF.count())
sparkDF.show()
Hope it helps to someone ! Keep pysparking :)

PCA in Spark 2.3.0 Examples with PySpark

I have a Spark dataframe I would like to use to run a simple PCA example. I have looked at this example and notice this works because they transpose the features into vectors:
from pyspark.ml.linalg import Vectors
>>> data = [(Vectors.sparse(5, [(1, 1.0), (3, 7.0)]),),
... (Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),),
... (Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)]
>>> df = spark.createDataFrame(data,["features"])
>>> pca = PCA(k=2, inputCol="features", outputCol="pca_features")
I am trying to reproduce the same kind of simple PCA by using a Spark Dataframe I have created my self. How would I transform my Spark DataFrame into a form similar to the above so I could run it with one input column and one output column?
I looked into using RowMatrix as shown here but I am not understanding if this is the way to go (see error below).
>>>from pyspark.mllib.linalg import Vectors
>>>from pyspark.mllib.linalg.distributed import RowMatrix
>>>from pyspark.ml.feature import PCA
>>>master = pd.read_parquet('master.parquet',engine='fastparquet')
>>>A = sc.parallelize(master)
>>>mat = RowMatrix(A)
>>>pc = mat.computePrincipalComponents(4)
Py4JJavaError: An error occurred while calling
o382.computePrincipalComponents. : org.apache.spark.SparkException:
Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times,
most recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost,
executor driver): org.apache.spark.api.python.PythonException:
Traceback (most recent call last)
In Pyspark for mllib library you need to convert all the features into a single feature vector.
You can do the same using a Vector Assembler:
https://spark.apache.org/docs/latest/ml-features.html#vectorindexer
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=inputColumnsList,outputCol='features')
assembler.transform(df)
Where inputColsList contains a list of all the features u want to use

Simple PySpark regression fails because of scala.MatchError on Spark 2.0? [duplicate]

I am getting a scala.MatchError when using a ParamGridBuilder in Spark 1.6.1 and 2.0
val paramGrid = new ParamGridBuilder()
.addGrid(lr.regParam, Array(0.1, 0.01))
.addGrid(lr.fitIntercept)
.addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0))
.build()
Error is
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 57.0 failed 1 times, most recent failure: Lost task 0.0 in stage 57.0 (TID 257, localhost):
scala.MatchError: [280000,1.0,[2400.0,9373.0,3.0,1.0,1.0,0.0,0.0,0.0]] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
Full code
The question is how I should use ParamGridBuilder in this case
Problem here is input schema not ParamGridBuilder. Price column is loaded as an integer while LinearRegression is expecting a double. You can fix it by explicitly casting column to required type:
val houses = sqlContext.read.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load(...)
.withColumn("price", $"price".cast("double"))

Resources