Unable to add/import additional python library datacompy in aws glue - python-3.x

i am trying to import additional python library - datacompy in to the glue job which use version 2 with below step
Open the AWS Glue console.
Under Job parameters, added the following:
For Key, added --additional-python-modules.
For Value, added datacompy==0.7.3, s3://python-modules/datacompy-0.7.3.whl.
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import datacompy
from py4j.java_gateway import java_import
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
## #params: [JOB_NAME, URL, ACCOUNT, WAREHOUSE, DB, SCHEMA, USERNAME, PASSWORD]
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'URL', 'ACCOUNT', 'WAREHOUSE', 'DB', 'SCHEMA','additional-python-modules'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
but the job return the error
module not found error no module named 'datacompy'
how to resolve this issue?

With Spark 2.4, Python 3 (Glue Version 2.0)
I set the following Job Parameter
Then I can import it my Job like so
import pandas as pd
import numpy as np
import datacompy
df1 = pd.DataFrame(np.random.randn(10,2), columns=['a','b'])
df2 = pd.DataFrame(np.random.randn(10,2), columns=['a','b'])
compare = datacompy.Compare(df1, df2, join_columns='a')
print(compare.report())
and when I check the CW Log for the Job Run
If you're using a Python Shell Job, try the following
Create a datacompy whl file or you can download it from PYPI
upload that file to an S3 bucket
Then enter the path to the s3 whl file in the Python library path box
s3://my-bucket/datacompy-0.8.0-py3-none-any.whl

Related

AWS --extra-py-files throwing ModuleNotFoundError: No module named 'pg8000'

I am trying to use pg8000 in my Glue Script, following are params in Glue Job
--extra-py-files s3://mybucket/pg8000libs.zip //NOTE: my zip contains __init__.py
Some Insights towards code
import sys
import os
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import boto3
from pyspark.sql import Row
from datetime import datetime, date
zip_path = os.path.join('/tmp', 'pg8000libs.zip')
sys.path.insert(0, zip_path)
def dump_python_path():
print("python path:", sys.path)
for path in sys.path:
if os.path.isdir(path):
print(f"dir: {path}")
print("\t" + str(os.listdir(path)))
print(path)
print(os.listdir('/tmp'))
dump_python_path()
# Import the library
import pg8000
Dump in cloudwatch
python path: ['/tmp/pg8000libs.zip', '/opt/amazon/bin', '/tmp/pg8000libs.zip', '/opt/amazon/spark/jars/spark-core_2.12-3.1.1-amzn-0.jar', '/opt/amazon/spark/python/lib/pyspark.zip', '/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip', '/opt/amazon/lib/python3.6/site-packages', '/usr/lib64/python37.zip', '/usr/lib64/python3.7', '/usr/lib64/python3.7/lib-dynload', '/home/spark/.local/lib/python3.7/site-packages', '/usr/lib64/python3.7/site-packages', '/usr/lib/python3.7/site-packages']

Spark Performance Issue - Writing to S3

I have a AWS Glue job in which I am using pyspark to read a large file (30gb) csv on s3 and then save it as parquet on s3. The job ran for more then 3 hours post which I killed it. Not sure why converting the file format would take so long ? Is spark right tool to do this conversion . below is the code I am using
import logging
import sys
from datetime import datetime
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from pyspark.sql import SQLContext
import boto3
import time
if __name__ == "__main__":
sc = SparkContext()
glueContext = GlueContext(sc)
job = Job(glueContext)
sqc = SQLContext(sc)
rdd = (sc. \
textFile("s3://my-bucket/data.txt")\
.flatMap(lambda line: line.split("END")) \
.map(lambda x: x.split("|")) \
.filter(lambda x: len(x) > 1))
df=sqc.createDataFrame(rdd)
#print(df1.head(10))
print(f'df.rdd.getNumPartitions() - {df.rdd.getNumPartitions()}')
df1.write.mode('overwrite').parquet('s3://my-bucket/processed')
job.commit()
Any suggestions for reducing the run time ?

set number of file write attempts in spark context

I'm running pyspark inside of aws glue jobs. As part of my pyspark script I write pyspark dataframes to a directory as parquet files. I would like to modify my spark context so that it will try to write each parquet file to the directory at least 20 times before failing the whole dataframe write attempt. The original version I have of starting my code is below. I've updated the "updated" version below as I think I'm supposed to in order to modify the spark context and use it with the glue context. Can someone please tell me if I've done this correctly or let me know how to fix it? Thanks
Original:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
glueContext = GlueContext(SparkContext.getOrCreate())
spark = glueContext.spark_session
updated:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
sc = SparkContext()
sc._jsc.hadoopConfiguration().set("fs.s3.maxretries", "20")
glueContext = GlueContext(sc.getOrCreate())
spark = glueContext.spark_session
Your updated code looks right
You can validate if the property is set by printing out the value from the below method
sc.getConf().getAll()

convert spark dataframe to aws glue dynamic frame

I tried converting my spark dataframes to dynamic to output as glueparquet files but I'm getting the error
'DataFrame' object has no attribute 'fromDF'"
My code uses heavily spark dataframes. Is there a way to convert from spark dataframe to dynamic frame so I can write out as glueparquet? If so could you please provide an example, and point out what I'm doing wrong below?
code:
# importing libraries
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
glueContext = GlueContext(SparkContext.getOrCreate())
# updated 11/19/19 for error caused in error logging function
spark = glueContext.spark_session
from pyspark.sql import Window
from pyspark.sql.functions import col
from pyspark.sql.functions import first
from pyspark.sql.functions import date_format
from pyspark.sql.functions import lit,StringType
from pyspark.sql.types import *
from pyspark.sql.functions import substring, length, min,when,format_number,dayofmonth,hour,dayofyear,month,year,weekofyear,date_format,unix_timestamp
base_pth='s3://test/'
bckt_pth1=base_pth+'test_write/glueparquet/'
test_df=glueContext.create_dynamic_frame.from_catalog(
database='test_inventory',
table_name='inventory_tz_inventory').toDF()
test_df.fromDF(test_df, glueContext, "test_nest")
glueContext.write_dynamic_frame.from_options(frame = test_nest,
connection_type = "s3",
connection_options = {"path": bckt_pth1+'inventory'},
format = "glueparquet")
error:
'DataFrame' object has no attribute 'fromDF'
Traceback (most recent call last):
File "/mnt/yarn/usercache/livy/appcache/application_1574556353910_0001/container_1574556353910_0001_01_000001/pyspark.zip/pyspark/sql/dataframe.py", line 1300, in __getattr__
"'%s' object has no attribute '%s'" % (self.__class__.__name__, name))
AttributeError: 'DataFrame' object has no attribute 'fromDF'
fromDF is a class function. Her's how you can convert Dataframe to DynamicFrame
from awsglue.dynamicframe import DynamicFrame
DynamicFrame.fromDF(test_df, glueContext, "test_nest")
Just to consolidate the answers for Scala users too, here's how to transform a Spark Dataframe to a DynamicFrame (the method fromDF doesn't exist in the scala API of the DynamicFrame) :
import com.amazonaws.services.glue.DynamicFrame
val dynamicFrame = DynamicFrame(df, glueContext)
I hope it helps !

aws glue dropping mostly null fields

I have a dataframe df. It has a couple columns that are mostly null. I'm writing it to an s3 bucket using the code below. I then crawl the s3 bucket to get the table schema in the datacatalog. I'm finding when I crawl the data the fields that are mostly null get dropped. I've checked the json that is output and I'm finding that some records have the field, and others don't. Does anyone know what the issue might be? I would like to include the fields even if they are mostly null.
Code:
# importing libraries
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
glueContext = GlueContext(SparkContext.getOrCreate())
from pyspark.sql.functions import col
from pyspark.sql.functions import first
from pyspark.sql.functions import date_format
from pyspark.sql.functions import lit,StringType
from pyspark.sql.types import *
from pyspark.sql.functions import to_date,format_number,dayofmonth,hour,dayofyear,month,year,weekofyear,date_format,unix_timestamp
from pyspark.sql.functions import *
# write to table
df.write.json('s3://path/table')
Why not use AWS Glue write method instead of spark DF?
glueContext.write_dynamic_frame.from_options

Resources