Spark Performance Issue - Writing to S3 - python-3.x

I have a AWS Glue job in which I am using pyspark to read a large file (30gb) csv on s3 and then save it as parquet on s3. The job ran for more then 3 hours post which I killed it. Not sure why converting the file format would take so long ? Is spark right tool to do this conversion . below is the code I am using
import logging
import sys
from datetime import datetime
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from pyspark.sql import SQLContext
import boto3
import time
if __name__ == "__main__":
sc = SparkContext()
glueContext = GlueContext(sc)
job = Job(glueContext)
sqc = SQLContext(sc)
rdd = (sc. \
textFile("s3://my-bucket/data.txt")\
.flatMap(lambda line: line.split("END")) \
.map(lambda x: x.split("|")) \
.filter(lambda x: len(x) > 1))
df=sqc.createDataFrame(rdd)
#print(df1.head(10))
print(f'df.rdd.getNumPartitions() - {df.rdd.getNumPartitions()}')
df1.write.mode('overwrite').parquet('s3://my-bucket/processed')
job.commit()
Any suggestions for reducing the run time ?

Related

AWS Glue job throwing java.lang.OutOfMemoryError while writing to s3

I have a glue job that reads from RDS and writes it to s3 in parquet format with partitions.
Below is the script. While the data is loaded into the frame and the count, schema is printed in a log. It fails with an error while writing the parquet to s3.
Also, I have mostly used spark's data frame since I was just testing glue out.
The size of the data is around 150 GB
import json
import sys
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from pyspark.sql.functions import col, udf, year, month
from pyspark.sql.functions import to_json, struct
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
POSTGRES_USER = "postgres"
POSTGRES_PASSWORD = "postgres"
POSTGRES_JDBC_URL = "<host>"
df = spark.read.format("jdbc").option("url", POSTGRES_JDBC_URL) \
.option("driver", "org.postgresql.Driver").option("dbtable", "log") \
.option("user", POSTGRES_USER).option("password", POSTGRES_PASSWORD).load()
print("Log Count : ", df.count())
print("Log Schema : ", df.printSchema())
df_with_year_and_month = df.withColumn("year", year(col("created_at"))).withColumn("month", month(col("created_at")))
print("Log count : ", df_with_year_and_month.count())
print("Log schema : ", df_with_year_and_month.printSchema())
df_with_year_and_month.write.partitionBy("year", "month").parquet("s3a://datalake-bucket/slogs-parquet",mode="append")
job.commit()
It's impossible to tell why you got OOM when you don't provide any information about data (i.e. how big is it). However, I can tell that you're reading your Postgres three times by counting twice and writing once. So if you have 1TB of data from Postgres, your Spark job would be able to process 3TB of data. That might be the reason. If it still doesn't work, update your question with more insight.

Unable to add/import additional python library datacompy in aws glue

i am trying to import additional python library - datacompy in to the glue job which use version 2 with below step
Open the AWS Glue console.
Under Job parameters, added the following:
For Key, added --additional-python-modules.
For Value, added datacompy==0.7.3, s3://python-modules/datacompy-0.7.3.whl.
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import datacompy
from py4j.java_gateway import java_import
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
## #params: [JOB_NAME, URL, ACCOUNT, WAREHOUSE, DB, SCHEMA, USERNAME, PASSWORD]
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'URL', 'ACCOUNT', 'WAREHOUSE', 'DB', 'SCHEMA','additional-python-modules'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
but the job return the error
module not found error no module named 'datacompy'
how to resolve this issue?
With Spark 2.4, Python 3 (Glue Version 2.0)
I set the following Job Parameter
Then I can import it my Job like so
import pandas as pd
import numpy as np
import datacompy
df1 = pd.DataFrame(np.random.randn(10,2), columns=['a','b'])
df2 = pd.DataFrame(np.random.randn(10,2), columns=['a','b'])
compare = datacompy.Compare(df1, df2, join_columns='a')
print(compare.report())
and when I check the CW Log for the Job Run
If you're using a Python Shell Job, try the following
Create a datacompy whl file or you can download it from PYPI
upload that file to an S3 bucket
Then enter the path to the s3 whl file in the Python library path box
s3://my-bucket/datacompy-0.8.0-py3-none-any.whl

PySpark import statements running for a very long time

My PySpark code is below, and the first part, i.e, the import statements cell takes a very long time to run in Jupyter, in fact, the execution didn't happen till 5 - 6 hours, and later it shows a "Time limit exceeded error".
I have tried everything, like restarting jupyter, uninstalling anaconda, and then reinstalling, uninstalling spark and pyspark, and then re-installing both of them again. I even removed python completely and then installed it again, BUT THE PROBLEM NEVER SOLVED...!
Edit 1:- I realized that the problem is with the line spark = init_spark() This is taking a lot of time to run (in fact not running even after 4 - 5 hours)
Please help me with this...
import os
import sys
import pyspark
from pyspark.rdd import RDD
from pyspark.sql import Row
from pyspark.sql import DataFrame
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
from pyspark.sql import functions
from pyspark.sql.functions import lit, desc, col, size
import pandas as pd
from pandas import DataFrame
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
from IPython.core.interactiveshell import InteractiveShell
import matplotlib
from pylab import *
import scipy.stats as stats
# This helps auto print out the items without explixitly using 'print'
InteractiveShell.ast_node_interactivity = "all"
# Initialize a spark session.
conf = pyspark.SparkConf().setMaster("local[*]")
def init_spark():
spark = SparkSession \
.builder \
.appName("Statistical Inferences with Pyspark") \
.config(conf=conf) \
.getOrCreate()
return spark
spark = init_spark()
filename_data = 'D:\Subjects\ARTIFICIAL INTELLIGENCE\SEMESTER - 5\Big Data and DataBase Management\End Sem Project\endomondoHR_proper.json'
df = spark.read.json(filename_data, mode="DROPMALFORMED")
# Load meta data file into pyspark data frame as well
print('Data frame type: {}'.format(type(df)))

getting error while trying to read athena table in spark

I have the following code snippet in pyspark:
import pandas as pd
from pyspark import SparkContext, SparkConf
from pyspark.context import SparkContext
from pyspark.sql import Row, SQLContext, SparkSession
import pyspark.sql.dataframe
def validate_data():
conf = SparkConf().setAppName("app")
spark = SparkContext(conf=conf)
config = {
"val_path" : "s3://forecasting/data/validation.csv"
}
data1_df = spark.read.table("db1.data_dest”)
data2_df = spark.read.table("db2.data_source”)
print(data1_df.count())
print(data2_df.count())
if __name__ == "__main__":
validate_data()
Now this code works fine when run on jupyter notebook on sagemaker ( connecting to EMR )
but when we are running as a python script on terminal, its throwing this error
Error message
AttributeError: 'SparkContext' object has no attribute 'read'
We have to automate these notebooks, so we are trying to convert them to python scripts
You can only call read on a Spark Session, not on a Spark Context.
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
conf = SparkConf().setAppName("app")
spark = SparkSession.builder.config(conf=conf)
Or you can convert the Spark context to a Spark session
conf = SparkConf().setAppName("app")
sc = SparkContext(conf=conf)
spark = SparkSession(sc)

set number of file write attempts in spark context

I'm running pyspark inside of aws glue jobs. As part of my pyspark script I write pyspark dataframes to a directory as parquet files. I would like to modify my spark context so that it will try to write each parquet file to the directory at least 20 times before failing the whole dataframe write attempt. The original version I have of starting my code is below. I've updated the "updated" version below as I think I'm supposed to in order to modify the spark context and use it with the glue context. Can someone please tell me if I've done this correctly or let me know how to fix it? Thanks
Original:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
glueContext = GlueContext(SparkContext.getOrCreate())
spark = glueContext.spark_session
updated:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
sc = SparkContext()
sc._jsc.hadoopConfiguration().set("fs.s3.maxretries", "20")
glueContext = GlueContext(sc.getOrCreate())
spark = glueContext.spark_session
Your updated code looks right
You can validate if the property is set by printing out the value from the below method
sc.getConf().getAll()

Resources