convert spark dataframe to aws glue dynamic frame - apache-spark

I tried converting my spark dataframes to dynamic to output as glueparquet files but I'm getting the error
'DataFrame' object has no attribute 'fromDF'"
My code uses heavily spark dataframes. Is there a way to convert from spark dataframe to dynamic frame so I can write out as glueparquet? If so could you please provide an example, and point out what I'm doing wrong below?
code:
# importing libraries
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
glueContext = GlueContext(SparkContext.getOrCreate())
# updated 11/19/19 for error caused in error logging function
spark = glueContext.spark_session
from pyspark.sql import Window
from pyspark.sql.functions import col
from pyspark.sql.functions import first
from pyspark.sql.functions import date_format
from pyspark.sql.functions import lit,StringType
from pyspark.sql.types import *
from pyspark.sql.functions import substring, length, min,when,format_number,dayofmonth,hour,dayofyear,month,year,weekofyear,date_format,unix_timestamp
base_pth='s3://test/'
bckt_pth1=base_pth+'test_write/glueparquet/'
test_df=glueContext.create_dynamic_frame.from_catalog(
database='test_inventory',
table_name='inventory_tz_inventory').toDF()
test_df.fromDF(test_df, glueContext, "test_nest")
glueContext.write_dynamic_frame.from_options(frame = test_nest,
connection_type = "s3",
connection_options = {"path": bckt_pth1+'inventory'},
format = "glueparquet")
error:
'DataFrame' object has no attribute 'fromDF'
Traceback (most recent call last):
File "/mnt/yarn/usercache/livy/appcache/application_1574556353910_0001/container_1574556353910_0001_01_000001/pyspark.zip/pyspark/sql/dataframe.py", line 1300, in __getattr__
"'%s' object has no attribute '%s'" % (self.__class__.__name__, name))
AttributeError: 'DataFrame' object has no attribute 'fromDF'

fromDF is a class function. Her's how you can convert Dataframe to DynamicFrame
from awsglue.dynamicframe import DynamicFrame
DynamicFrame.fromDF(test_df, glueContext, "test_nest")

Just to consolidate the answers for Scala users too, here's how to transform a Spark Dataframe to a DynamicFrame (the method fromDF doesn't exist in the scala API of the DynamicFrame) :
import com.amazonaws.services.glue.DynamicFrame
val dynamicFrame = DynamicFrame(df, glueContext)
I hope it helps !

Related

Does spark write dataframes asynchronously

I have two spark dataframes df1 and df2. I'm trying to write them out to two different file paths. Can someone tell me, do the writes occur synchronously or asynchronously? That is since they're two different dataframes writting to two different paths, will the writes occur at the same time, or do I have to wait until it finishes writing df1 out before it starts writing df2 out?
example code:
update added importing libraries:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
glueContext = GlueContext(SparkContext.getOrCreate())
# updated 11/19/19 for error caused in error logging function
spark = glueContext.spark_session
from pyspark.sql import Window
from pyspark.sql.functions import col
from pyspark.sql.functions import first
from pyspark.sql.functions import date_format
from pyspark.sql.functions import lit,StringType
from pyspark.sql.types import *
from pyspark.sql.functions import substring, length, min,when,format_number,dayofmonth,hour,dayofyear,month,year,weekofyear,date_format,unix_timestamp
import time
import math
df1.write.mode("overwrite").parquet(filepath1)
df2.write.mode("overwrite").parquet(filepath2)
If its on single thread it will write one at a time. You can use threding and share the spark Context.

How to use a scikit pickle model in spark structured streaming? [duplicate]

I'm trying to apply a scikit model retrieved using a pickle to every row of a structured streaming dataframe.
I've tried using pandas_udf (version code 1), and it gives me this error:
AttributeError: 'numpy.ndarray' object has no attribute 'isnull'
Code:
inputPath = "/FileStore/df_training/streaming_df_1_nh_nd/"
from pyspark.sql import functions as f
from pyspark.sql.types import *
data_schema = data_spark_ts.schema
import pandas as pd
from pyspark.sql.functions import col, pandas_udf, PandasUDFType # User Defines Functions for Pandas Dataframe
from pyspark.sql.types import LongType
get_prediction = pandas_udf(lambda x: gb2.predict(x), IntegerType())
streamingInputDF = (
spark
.readStream
.schema(data_schema) # Set the schema of the JSON data
.option("maxFilesPerTrigger", 1) # Treat a sequence of files as a stream by picking one file at a time
.csv(inputPath)
.fillna(0)
.withColumn("prediction", get_prediction( f.struct([col(x) for x in data_spark.columns]) ))
)
display(streamingInputDF.select("prediction"))
I've tried also using a normal udf instead of the pandas_udf, and it gives me this error:
ValueError: Expected 2D array, got 1D array instead:
[.. ... .. ..]
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
I don't know how to reshape my data.
The model I try to apply is retrieved this way:
#load the pickle
import pickle
gb2 = None
with open('pickle_modello_unico.p', 'rb') as fp:
gb2 = pickle.load(fp)
And it's specification is this one:
GradientBoostingClassifier(criterion='friedman_mse', init=None,
learning_rate=0.1, loss='deviance', max_depth=3,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=300,
n_iter_no_change=None, presort='auto', random_state=None,
subsample=1.0, tol=0.0001, validation_fraction=0.1,
verbose=0, warm_start=False)
Any help to solve this?
I solved the issue returning a pd.Series from the pandas_udf.
Here is the working code:
inputPath = "/FileStore/df_training/streaming_df_1_nh_nd/"
from pyspark.sql import functions as f
from pyspark.sql.types import *
data_schema = data_spark_ts.schema
import pandas as pd
from pyspark.sql.functions import col, pandas_udf, PandasUDFType # User Defines Functions for Pandas Dataframe
from pyspark.sql.types import LongType
get_prediction = pandas_udf(lambda x: pd.Series(gb2.predict(x)), StringType())
streamingInputDF = (
spark
.readStream
.schema(data_schema) # Set the schema of the JSON data
.option("maxFilesPerTrigger", 1) # Treat a sequence of files as a stream by picking one file at a time
.csv(inputPath)
.withColumn("prediction", get_prediction( f.struct([col(x) for x in data_spark.columns]) ))
)
display(streamingInputDF.select("prediction"))

aws glue dropping mostly null fields

I have a dataframe df. It has a couple columns that are mostly null. I'm writing it to an s3 bucket using the code below. I then crawl the s3 bucket to get the table schema in the datacatalog. I'm finding when I crawl the data the fields that are mostly null get dropped. I've checked the json that is output and I'm finding that some records have the field, and others don't. Does anyone know what the issue might be? I would like to include the fields even if they are mostly null.
Code:
# importing libraries
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
glueContext = GlueContext(SparkContext.getOrCreate())
from pyspark.sql.functions import col
from pyspark.sql.functions import first
from pyspark.sql.functions import date_format
from pyspark.sql.functions import lit,StringType
from pyspark.sql.types import *
from pyspark.sql.functions import to_date,format_number,dayofmonth,hour,dayofyear,month,year,weekofyear,date_format,unix_timestamp
from pyspark.sql.functions import *
# write to table
df.write.json('s3://path/table')
Why not use AWS Glue write method instead of spark DF?
glueContext.write_dynamic_frame.from_options

spark cannot read from the aws

My code in Python3, fails with the following error:
Py4JJavaError: An error occurred while calling o45.load. :
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class
org.apache.hadoop.fs.s3a.S3AFileSystem not found
The code:
from pyspark.sql.functions import udf
from pyspark.sql.types import *
import os
from pandas.io.json import json_normalize
from pyspark.sql.types import *
os.environ['PYSPARK_SUBMIT_ARGS'] = '--driver-memory 10g --packages org.elasticsearch:elasticsearch-hadoop:6.5.2,org.apache.hadoop:hadoop-aws:3.1.1 pyspark-shell'
from pyspark import SparkConf
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").config(conf=SparkConf()).config("spark.local.‌​dir","/Users/psuresh
/spark_local_dir").enableHiveSupport().getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
hadoopConf = spark._jsc.hadoopConfiguration()
hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoopConf.set("fs.s3a.access.key", "XXXXXX")
hadoopConf.set("fs.s3a.secret.key","XXXXXXXXXX")
from pyspark.sql.functions import input_file_name
from pyspark.sql.functions import regexp_extract, col
from pyspark.sql.functions import *
df= spark.read.format("parquet").load("s3a://a/b/c/d/dump/")

'RDD' object has no attribute '_jdf' pyspark RDD

I'm new in pyspark. I would like to perform some machine Learning on a text file.
from pyspark import Row
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark import SparkConf
sc = SparkContext
spark = SparkSession.builder.appName("ML").getOrCreate()
train_data = spark.read.text("20ng-train-all-terms.txt")
td= train_data.rdd #transformer df to rdd
tr_data= td.map(lambda line: line.split()).map(lambda words: Row(label=words[0],words=words[1:]))
from pyspark.ml.feature import CountVectorizer
vectorizer = CountVectorizer(inputCol ="words", outputCol="bag_of_words")
vectorizer_transformer = vectorizer.fit(td)
and for my last command, i obtain the error
"AttributeError: 'RDD' object has no attribute '_jdf'
enter image description here
can anyone help me please.
thank you
You shouldn't be using rdd with CountVectorizer. Instead you should try to form the array of words in the dataframe itself as
train_data = spark.read.text("20ng-train-all-terms.txt")
from pyspark.sql import functions as F
td= train_data.select(F.split("value", " ").alias("words")).select(F.col("words")[0].alias("label"), F.col("words"))
from pyspark.ml.feature import CountVectorizer
vectorizer = CountVectorizer(inputCol="words", outputCol="bag_of_words")
vectorizer_transformer = vectorizer.fit(td)
And then it should work so that you can call transform function as
vectorizer_transformer.transform(td).show(truncate=False)
Now, if you want to stick to the old style of converting to the rdd style then you have to modify certain lines of code. Following is the modified complete code (working) of yours
from pyspark import Row
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark import SparkConf
sc = SparkContext
spark = SparkSession.builder.appName("ML").getOrCreate()
train_data = spark.read.text("20ng-train-all-terms.txt")
td= train_data.rdd #transformer df to rdd
tr_data= td.map(lambda line: line[0].split(" ")).map(lambda words: Row(label=words[0], words=words[1:])).toDF()
from pyspark.ml.feature import CountVectorizer
vectorizer = CountVectorizer(inputCol="words", outputCol="bag_of_words")
vectorizer_transformer = vectorizer.fit(tr_data)
But I would suggest you to stick with dataframe way.

Resources