Pyspark Job with Dataproc on GCP - apache-spark

I'm trying to running a pyspark job, but I keep getting job failure for this reason:
*Google Cloud Dataproc Agent reports job failure. If logs are available, they can be found at: https://console.cloud.google.com/dataproc/jobs/f8f8e95794e0457d80ea1b0c4df8d815?project=long-state-352923&region=us-central1 gcloud dataproc jobs wait 'f8f8e95794e0457d80ea1b0c4df8d815' --region 'us-central1' --project 'long-state-352923' **...***
here is also my code in running the job:
`from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName('spark_hdfs_to_hdfs') \
.getOrCreate()
sc = spark.sparkContext
sc.setLogLevel("WARN")
MASTER_NODE_INSTANCE_NAME="cluster-d687-m"
log_files_rdd = sc.textFile('hdfs://{}/data/logs_example/*'.format(MASTER_NODE_INSTANCE_NAME))
splitted_rdd = log_files_rdd.map(lambda x: x.split(" "))
selected_col_rdd = splitted_rdd.map(lambda x: (x[0], x[3], x[5], x[6]))
columns = ["ip","date","method","url"]
logs_df = selected_col_rdd.toDF(columns)
logs_df.createOrReplaceTempView('logs_df')
sql = """
SELECT
url,
count(*) as count
FROM logs_df
WHERE url LIKE '%/article%'
GROUP BY url
"""
article_count_df = spark.sql(sql)
print(" ### Get only articles and blogs records ### ")
article_count_df.show(5)`
i don't seem to understand the reasoning why its failing.
Is there a problem with code?

Related

How to write a streaming dataframe into another Kafka Topic after doing some transformations?

I am trying to read data from a Kafka topic, join it with another dataframe from Hive table and save the result to another Kafka topic.
Below is the code I have written.
# Returns a dataframe after reading Kafka topic.
kafka_df = kafka_data(spark=spark, kafkaconfig=kafkaconfig, tableconfig=table_config, source_type='kafka', where_clause='', objectname='object_name')
#Write the dataframe returned from above step into another Kafka topic.
write_batches(kafka_df)
def write_batches(kafka_df):
table_config = po_header_config
kafka_config = kafkaconfig
jaas_config = kafka_config['jaas_config']
oauth_client = f" oauth.client.id='{kafka_config['client_id']}'"
oauth_secret = f" oauth.client.secret='{kafka_config['client_secret']}'"
oauth_token_endpoint_uri = f" oauth.token.endpoint.uri='{kafka_config['endpoint_uri']}'"
oauth_config = jaas_config + oauth_client + oauth_secret + oauth_token_endpoint_uri + " oauth.max.token.expiry.seconds='30000' ;"
kafka_df.writeStream \
.option('checkpointLocation', table_config['checkpoint_location']) \
.option('kafka.bootstrap.servers', kafka_config['kafka_broker']) \
.option('topic', kafka_config['topic_name']) \
.format('kafka') \
.foreachBatch(join_kafka_streams_final_table_test) \
.outputMode("append") \
.trigger(processingTime="300 seconds") \
.start().awaitTermination()
def join_kafka_streams_final_table_test(kafka_df, batch_id):
try:
table_config = config
filters = data_filter(kafka_df=kafka_df)
query = f'select * from DB.TABLE where {filters}'
main_df = spark.sql(query)
print(f'Joining kafka dataframe with final_table table')
joined_df = join_remove_duplicate_col(kafka_df=kafka_df, final_table=main_df, table_config=table_config)
except Exception as error:
print(f'Join failed with the exception: {error}')
traceback.print_exc()
print('Stopping the application')
sys.exit(1)
def join_remove_duplicate_col(kafka_df, final_table: DataFrame, table_config: dict):
try:
df = kafka_df.join(final_table, on=table_config['join_keys'], how='left_outer')
print('Join Successful.')
repeated_columns = [c for c in kafka_df.columns if c in final_table.columns]
for column in repeated_columns:
df = df.drop(final_table[column])
return df
except Exception as error:
print(f'Unable join kafka_df & final_table table with the exception: {error}')
traceback.print_exc()
sys.exit(1)
def data_filter(kafka_df):
try:
print('Preparing filters for final_table table')
lst = []
distinct_partitions = kafka_df.select('main_part', 'create_dt').withColumn('month_part', substring('create_dt', 1, 7)).drop('create_dt').distinct()
filters = distinct_partitions.groupby('main_part').agg(F.concat_ws("', '", F.collect_list(distinct_partitions.month_part))).rdd.map(lambda row: (row[0], row[1])).collectAsMap()
for key, value in filters.items():
s = "'" + value + "'"
lst.append(f"(super_main_part = '{key}' and month_part in ({s}))")
datafilter = ' or '.join(lst)
return datafilter
except Exception as error:
print(f'Unable to form filter for final_table table with the exception: {error}')
traceback.print_exc()
print('Stopping the application')
sys.exit(1)
The problem here is when I invoke the method write_batches with my Kafka dataframe, I don't see any print statements from the methods present inside join_kafka_streams_final_table_test which is executed foreachBatch
The stream just loads and does nothing.
Could anyone let me know if my syntax is in the right format ? If not, what is the mistake I did here and how can I correct it.
Two things that came to my mind:
In the current code, you are using format("kafka") and foreachBatch(join_kafka_streams_final_table_test) at the same time. Typically, you would only use one of them.
The method join_kafka_streams_final_table_test does not contain any actions such as writing, hence, it will never be executed.
Looking at the overall code, I really recommend to get familiar with the Structured Streaming Programming Guide. As I am not completely familiar with Python I can only guess what you are trying to achieve, but the programming model of a structured streaming application already allows you to handle Dataframes within a batch. So instead of calling explictly foreachBatch you could do something as below:
table_config = po_header_config
kafka_config = kafkaconfig
# Returns a dataframe after reading Kafka topic.
kafka_df = kafka_data(spark=spark, kafkaconfig=kafkaconfig, tableconfig=table_config, source_type='kafka', where_clause='', objectname='object_name')
table_config = config
filters = data_filter(kafka_df=kafka_df)
query = f'select * from DB.TABLE where {filters}'
main_df = spark.sql(query)
df = kafka_df.join(main_df, on=table_config['join_keys'], how='left_outer')
df.writeStream \
.option('checkpointLocation', table_config['checkpoint_location']) \
.option('kafka.bootstrap.servers', kafka_config['kafka_broker']) \
.option('topic', kafka_config['topic_name']) \
.format('kafka') \
.outputMode("append") \
.trigger(processingTime="300 seconds") \
.start().awaitTermination()
Again, not completely familiar with the Python syntax, but I hope you get the idea.

PySpark- Error accessing broadcast variable in udf while running in standalone cluster mode

#f.pandas_udf(returnType= DoubleType())
def square(r : pd.Series) -> pd.Series:
print('In pandas Udf square')
offset_value = offset.value
return (r * r ) + 10
if __name__ == "__main__":
spark = SparkSession.builder.appName("Spark").getOrCreate()
sc = spark.sparkContext
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
offset = sc.broadcast(10)
x = pd.Series(range(0,100))
df = spark.createDataFrame(pd.DataFrame(x, columns=["x"]))
df = df.withColumn('sq',square(df.x)).withColumn('sqsq', square(f.col('sq')))
start_time = datetime.datetime.now()
df.show()
offset.unpersist()
offset.destroy()
spark.stop()
The above code works well if i run pyspark submit command in local mode
Submit.cmd --master local[*] test.py
Same code, if i try to run in standalone cluster mode, i.e
Submit.cmd --master spark://xx.xx.0.24:7077 test.py
I get error while accessing broadcast variable in udf
java.io.IOException: Failed to delete original file 'C:\Users\xxx\AppData\Local\Temp\spark-bf6b4553-f30f-4e4a-a7f7-ef117329985c\executor-3922c28f-ed1e-4348-baa4-4ed08e042b76\spark-b59e518c-a20a-4a11-b96b-b7657b1c79ea\broadcast6537791588721535439' after copy to 'C:\Users\xxx\AppData\Local\Temp\spark-bf6b4553-f30f-4e4a-a7f7-ef117329985c\executor-3922c28f-ed1e-4348-baa4-4ed08e042b76\blockmgr-ee27f0f0-ee8b-41ea-86d6-8f923845391e\37\broadcast_0_python'
at org.apache.commons.io.FileUtils.moveFile(FileUtils.java:2835)
at org.apache.spark.storage.DiskStore.moveFileToBlock(DiskStore.scala:133)
at org.apache.spark.storage.BlockManager$TempFileBasedBlockStoreUpdater.saveToDiskStore(BlockManager.scala:424)
at org.apache.spark.storage.BlockManager$BlockStoreUpdater.$anonfun$save$1(BlockManager.scala:343)
at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1298)
Without accessing broadcast variable in Udf, this code works fine.

Spark2 reads ORC files much slower than Spark1

I found that Spark2 loads ORC files much slower than Spark1, and then I tried some methods to speed up Spark2, but no success. Codes show as below:
Spark 1.5
val conf = new SparkConf().setAppName("LoadOrc")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.akka.frameSize", "512")
.set("spark.akka.timeout","800s")
.set("spark.storage.blockManagerHeartBeatMs", "300000")
.set("spark.kryoserializer.buffer.max","1024m")
.set("spark.executor.extraJavaOptions", "-Djava.util.Arrays.useLegacyMergeSort=true")
val sc = new SparkContext(conf)
val hiveContext = new HiveContext(sc)
val start = System.nanoTime()
val ret = hiveContext.read.orc(args(0)).count()
val end = System.nanoTime()
println(s"count: $ret")
println(s"Time taken: ${(end - start) / 1000 / 1000} ms")
sc.stop()
Spark UI
Spark1 UI
results
count: 2290811187
Time taken: 401063 ms
Spark 2
val spark = SparkSession.builder()
.appName("LoadOrc")
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.akka.frameSize", "512")
.config("spark.akka.timeout","800s")
.config("spark.storage.blockManagerHeartBeatMs", "300000")
.config("spark.kryoserializer.buffer.max","1024m")
.config("spark.executor.extraJavaOptions", "-Djava.util.Arrays.useLegacyMergeSort=true")
.enableHiveSupport()
.getOrCreate()
println(spark.time(spark.read.format("org.apache.spark.sql.execution.datasources.orc")
.load(args(0)).count()))
spark.close()
Spark UI
Spark2 UI
results
Time taken: 1384464 ms
2290811187

Execute PySpark code from a Java/Scala application

Is there a way to execute PySpark code from a Java/Scala application on an existing SparkSession?
Specifically, given a PySpark code that receives and returns pyspark dataframe, is there a way to submit it to Java/Scala SparkSession and get back the output dataframe:
String pySparkCode = "def my_func(input_df):\n" +
" from pyspark.sql.functions import *\n" +
" return input_df.selectExpr(...)\n" +
" .drop(...)\n" +
" .withColumn(...)\n"
SparkSession spark = SparkSession.builder().master("local").getOrCreate()
Dataset inputDF = spark.sql("SELECT * from my_table")
outputDf = spark.<SUBMIT_PYSPARK_METHOD>(pySparkCode, inputDF)

Issue while running Spark application on Yarn

I have a testing spark environment(Single Node) running on AWS. I executed few adhoc queries in PySpark shell and everything went as expected, however, when I'm running the application using spark-submit , I get error.
Below is the code:
from __future__ import print_function
from pyspark import SparkContext, SparkConf
from pyspark.sql.session import SparkSession
from pyspark.sql import SQLContext as sql
conf = SparkConf().setAppName("myapp")
sc = SparkContext(conf=conf)
spark = SparkSession(sc)
if __name__ == "__main__":
#inp_data = loaded data from db
df = inp_data.select('Id','DueDate','Principal','delay','unpaid_emi','future_payment')
filterd_unpaid_emi = df.filter(df.unpaid_emi == 1)
par = filterd_unpaid_emi.groupBy('Id').sum('Principal').withColumnRenamed('sum(Principal)' , 'par')
temp_df = df.filter(df.unpaid_emi == 1)
temp_df_1 = temp_df.filter(temp_df.future_payment == 0)
temp_df_1.registerTempTable("mytable")
bucket_df_1 = sql("""select *, case
when delay<0 and delay ==0 then '9999'
when delay>0 and delay<7 then '9'
when delay>=7 and delay<=14 then '8'
when delay>=15 and delay<=29 then '7'
when delay>=30 and delay<=59 then '6'
when delay>=60 and delay<=89 then '5'
when delay>=90 and delay<=119 then '4'
when delay>=120 and delay<=149 then '3'
when delay>=150 and delay<=179 then '2'
else '1'
end as bucket
from mytable""")
bucket_df_1 = bucket_df_1.select(bucket_df_1.Id,bucket_df_1.Principal,bucket_df_1.delay,bucket_df_1.unpaid_emi,bucket_df_1.future_payment,bucket_df_1.bucket.cast("int").alias('buckets'))
min_bucket = bucket_df_1.groupBy('Id').min('buckets').withColumnRenamed('min(buckets)' , 'max_delay')
joinedDf = par.join(min_bucket, ["Id"])
#joinedDf.printSchema()
And below is the command to submit the application:
spark-submit \
--master yarn \
--driver-class-path /path to/mysql-connector-java-5.0.8-bin.jar \
--jars /path to/mysql-connector-java-5.0.8-bin.jar \
/path to/mycode.py
ERROR:
17/11/10 10:00:34 INFO SparkSqlParser: Parsing command: mytable
Traceback (most recent call last):
File "/path to/mycode.py", line 36, in <module>
from mytable""")
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/context.py", line 73, in __init__
AttributeError: 'str' object has no attribute '_jsc'
17/11/10 10:00:34 INFO SparkContext: Invoking stop() from shutdown hook
17/11/10 10:00:34 INFO SparkUI: Stopped Spark web UI at ........
I'm quite new to Spark so can someone please tell the mistake(s) i'm doing.?
Also, any feedback on improving coding style will be appreciated!
Spark Version : 2.2
You are using the imported SQLContext as sql to query your temp table (which is not bound to any spark instances), not the spark.sql (from the initialized spark instance). I also, changed some of your imports and code.
from __future__ import print_function
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
if __name__ == "__main__":
# move the initializations within the main
conf = SparkConf().setAppName("myapp")
# create the session
spark = SparkSession.builder.config(conf=conf) \
.getOrCreate()
# load your data and do what you need to do
#inp_data = loaded data from db
df = inp_data.select('Id','DueDate','Principal','delay','unpaid_emi','future_payment')
filterd_unpaid_emi = df.filter(df.unpaid_emi == 1)
par = filterd_unpaid_emi.groupBy('Id').sum('Principal').withColumnRenamed('sum(Principal)' , 'par')
temp_df = df.filter(df.unpaid_emi == 1)
temp_df_1 = temp_df.filter(temp_df.future_payment == 0)
temp_df_1.registerTempTable("mytable")
# use spark.sql to query your table
bucket_df_1 = spark.sql("""select *, case
when delay<0 and delay ==0 then '9999'
when delay>0 and delay<7 then '9'
when delay>=7 and delay<=14 then '8'
when delay>=15 and delay<=29 then '7'
when delay>=30 and delay<=59 then '6'
when delay>=60 and delay<=89 then '5'
when delay>=90 and delay<=119 then '4'
when delay>=120 and delay<=149 then '3'
when delay>=150 and delay<=179 then '2'
else '1'
end as bucket
from mytable""")
bucket_df_1 = bucket_df_1.select(bucket_df_1.Id,bucket_df_1.Principal,bucket_df_1.delay,bucket_df_1.unpaid_emi,bucket_df_1.future_payment,bucket_df_1.bucket.cast("int").alias('buckets'))
min_bucket = bucket_df_1.groupBy('Id').min('buckets').withColumnRenamed('min(buckets)' , 'max_delay')
joinedDf = par.join(min_bucket, ["Id"])
#joinedDf.printSchema()
Hope this helps, good luck!

Resources