Iam new to the aws-glue. I am trying to read the csv and transforming to the json object. As i seen the approach would be to read the csv via crawler and convert to Pyspark DF, then convert to json object.
Till now, i have converted to json object. Now i would need to write these json back to s3 bucket?
Below is the code
#########################################
### IMPORT LIBRARIES AND SET VARIABLES
#########################################
#Import python modules
from datetime import datetime
#Import pyspark modules
from pyspark.context import SparkContext
import pyspark.sql.functions as f
#Import glue modules
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job
import json
import boto3
#Initialize contexts and session
spark_context = SparkContext.getOrCreate()
glue_context = GlueContext(spark_context)
session = glue_context.spark_session
s3_source = boto3.resource('s3')
#Parameters
glue_db = "umesh-db"
glue_tbl = "read"
#########################################
### EXTRACT (READ DATA)
#########################################
#Read movie data to Glue dynamic frame
dynamic_frame_read = glue_context.create_dynamic_frame.from_catalog(database = glue_db, table_name = glue_tbl)
#Convert dynamic frame to data frame to use standard pyspark functions
data_frame = dynamic_frame_read.toDF()
## Show DF data
print("Showing Df data")
data_frame.show()
### Convert the DF to the json
jsonContent = data_frame.toJSON()
jsonValue={}
arrraYObj=[]
for row in jsonContent.collect():
print("Row data ", row)
arrraYObj.append(row)
print("Array Obj",arrraYObj)
jsonValue['Employee']=arrraYObj
print("Json object ", jsonValue)
#Log end time
#dt_end = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
#print("Start time:", dt_end)
Appreciate if anyone can help to provide the right approach?
Thanks
data_frame.write.format(‘json’).save(‘s3://bucket/key’)
Or directly from dynamic frame
glue_context.write_dynamic_frame.from_options(frame = dynamic_frame_read,
connection_type = "s3",
connection_options = {"path": "s3://bucket/key"},
format = "json")
Related
I have a sample.json file and I want to dump the data into a DynamoDB table but I keep getting this error - ValueError: Expected object or value.
Here is what I have done so far.
import boto3
import requests
import json
import re
import json_stream
import pandas as pd
from sys import getsizeof
from datetime import datetime
from botocore.exceptions import ClientError
from boto3.dynamodb.conditions import Key
from decimal import Decimal
dynamodb_client = boto3.resource("dynamodb", region_name="eu-central-1")
url_table = dynamodb_client.Table("URLConfig")
global_config = dynamodb_client.Table("GlobalConfigs")
chunks = pd.read_json("sample.json", lines=True, chunksize = 500)
for c in chunks:
c = c.to_dict(orient="records")
try:
with url_table.batch_writer() as batch:
for i in range(len(c)):
c[i]["confidence"] = Decimal(str(c[i]["confidence"]))
batch.put_item(Item=c[i])
except Exception as e:
print(e)
# records which generated exceptions must be collected in a file.
# as of now it is being ignored.
continue
I need to read .parquet files into a Pandas DataFrame in Python on my local machine without downloading the files. The parquet files are stored on Azure blobs with hierarchical directory structure.
I am doing something like following and I am not sure how to proceed :
from azure.storage.blob import BlobServiceClient
blob_service_client = BlobServiceClient.from_connection_string(connection_string)
blob_client = blob_service_client.get_blob_client(container="abc", blob="/xyz/pqr/folder_with_parquet_files")
I have used dummy names here for privacy concerns. Assuming the directory "folder_with_parquet_files" contains 'n' no. of parquet files, how can I read them into a single Pandas DataFrame?
Hi you could use pandas and read parquet from stream. It colud be very helpful for small data set, sprak session is not required here. It could be the fastest way especially for testing purposes.
import pandas as pd
from io import BytesIO
from azure.storage.blob import ContainerClient
path = '/path_to_blob/..'
conn_string = <conn_string>
blob_name = f'{path}.parquet'
container = ContainerClient.from_connection_string(conn_str=conn_string, container_name=<name_of_container>)
blob_client = container.get_blob_client(blob=blob_name)
stream_downloader = blob_client.download_blob()
stream = BytesIO()
stream_downloader.readinto(stream)
processed_df = pd.read_parquet(stream, engine='pyarrow')
Here is a very similar solution, but slightly different using the new method azure.storage.blob._download.StorageStreamDownloader.readall:
from io import BytesIO
from azure.storage.blob import BlobServiceClient
blob_service_client = BlobServiceClient.from_connection_string(connection_string)
container_client = blob_service_client.get_container_client(container="parquet")
downloaded_blob = container_client.download_blob(upload_name)
bytes_io = BytesIO(downloaded_blob.readall())
df = pd.read_parquet(bytes_io)
print(df.head())
get_blob_to_bytes method can be used
Here the file is fetched from blob storage and held in memory. Pandas can then read this byte array as parquet format.
from azure.storage.blob import BlockBlobService
import pandas as pd
from io import BytesIO
#Source account and key
source_account_name = 'testdata'
source_account_key ='****************'
SOURCE_CONTAINER = 'my-data'
eachFile = 'test/2021/oct/myfile.parq'
source_block_blob_service = BlockBlobService(account_name=source_account_name, account_key=source_account_key)
f = source_block_blob_service.get_blob_to_bytes(SOURCE_CONTAINER, eachFile)
df = pd.read_parquet(BytesIO(f.content))
print(df.shape)
I tried converting my spark dataframes to dynamic to output as glueparquet files but I'm getting the error
'DataFrame' object has no attribute 'fromDF'"
My code uses heavily spark dataframes. Is there a way to convert from spark dataframe to dynamic frame so I can write out as glueparquet? If so could you please provide an example, and point out what I'm doing wrong below?
code:
# importing libraries
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
glueContext = GlueContext(SparkContext.getOrCreate())
# updated 11/19/19 for error caused in error logging function
spark = glueContext.spark_session
from pyspark.sql import Window
from pyspark.sql.functions import col
from pyspark.sql.functions import first
from pyspark.sql.functions import date_format
from pyspark.sql.functions import lit,StringType
from pyspark.sql.types import *
from pyspark.sql.functions import substring, length, min,when,format_number,dayofmonth,hour,dayofyear,month,year,weekofyear,date_format,unix_timestamp
base_pth='s3://test/'
bckt_pth1=base_pth+'test_write/glueparquet/'
test_df=glueContext.create_dynamic_frame.from_catalog(
database='test_inventory',
table_name='inventory_tz_inventory').toDF()
test_df.fromDF(test_df, glueContext, "test_nest")
glueContext.write_dynamic_frame.from_options(frame = test_nest,
connection_type = "s3",
connection_options = {"path": bckt_pth1+'inventory'},
format = "glueparquet")
error:
'DataFrame' object has no attribute 'fromDF'
Traceback (most recent call last):
File "/mnt/yarn/usercache/livy/appcache/application_1574556353910_0001/container_1574556353910_0001_01_000001/pyspark.zip/pyspark/sql/dataframe.py", line 1300, in __getattr__
"'%s' object has no attribute '%s'" % (self.__class__.__name__, name))
AttributeError: 'DataFrame' object has no attribute 'fromDF'
fromDF is a class function. Her's how you can convert Dataframe to DynamicFrame
from awsglue.dynamicframe import DynamicFrame
DynamicFrame.fromDF(test_df, glueContext, "test_nest")
Just to consolidate the answers for Scala users too, here's how to transform a Spark Dataframe to a DynamicFrame (the method fromDF doesn't exist in the scala API of the DynamicFrame) :
import com.amazonaws.services.glue.DynamicFrame
val dynamicFrame = DynamicFrame(df, glueContext)
I hope it helps !
I have 6 files with named with Data_20190823101010,Data_20190823101112,Data_20190823101214,Data_20190823101310,Data_20190823101410,Data_20190823101510.
These are daily files to be loaded into a SQL Server DB table.
Due to size and performance reasons need to load one by one.
Python code must pick one file at a time,process and load into DB Table.
How to write the code?
Thanks in advance.
import glob
import os
import pandas as pd
import time
from datetime import datetime
import numpy as np
#folder_name = 'Data_Folder'
file_type = 'csv'
file_titles = ['C1','C2','C3',C4','C5']
df = pd.concat([pd.read_csv(f, header=None,skiprows=1,names=file_titles,low_memory=False) for f in glob.glob(folder_name + "//*Data_*" )])
You can import those csv files in a dataframe and then concatenate and use pandas to_sql function to connect and upload the data to MS SQL Server DB
from sqlalchemy import create_engine
import urllib
import pyodbc
import pandas as pd
import glob
connection= urllib.parse.quote_plus("DRIVER={SQL Server Native Client 11.0};SERVER=Server_name;DATABASE=DB Name")
engine = create_engine('mssql+pyodbc:///?odbc_connect={}'.format(connection))
path = r'C:\file_path' # local drive File path
all_csv_files = glob.glob(path + "/*.csv")
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
df.to_sql('Table_Name', schema='dbo', con = engine)
I am using a Spark application for processing textfiles that dropped at /home/user1/files/ folder in my system and which map the comma separated data that present in those text files into a particular JSON format. I have written following python code using spark for doing the same. But the output that comes in Kafka will look like as follows
Row(Name=Priyesh,Age=26,MailId=priyeshkaratha#gmail.com,Address=AddressTest,Phone=112)
Python Code :
import findspark
findspark.init('/home/user1/spark')
from pyspark import SparkConf, SparkContext
from operator import add
import sys
from pyspark.streaming import StreamingContext
from pyspark.sql import Column, DataFrame, Row, SparkSession
from pyspark.streaming.kafka import KafkaUtils
import json
from kafka import SimpleProducer, KafkaClient
from kafka import KafkaProducer
producer = KafkaProducer(bootstrap_servers='server.kafka:9092')
def handler(message):
records = message.collect()
for record in records:
producer.send('spark.out', str(record))
print(record)
producer.flush()
def main():
sc = SparkContext(appName="PythonStreamingDirectKafkaWordCount")
ssc = StreamingContext(sc, 1)
lines = ssc.textFileStream('/home/user1/files/')
fields = lines.map(lambda l: l.split(","))
udr = fields.map(lambda p: Row(Name=p[0],Age=int(p[3].split('#')[0]),MailId=p[31],Address=p[29],Phone=p[46]))
udr.foreachRDD(handler)
ssc.start()
ssc.awaitTermination()
if __name__ == "__main__":
main()
So how can I convert this row form into JSON while pushing into kafka topic?
You can convert Spark Row objects to dict's, and then serialize those to JSON. For example, you could change this line:
producer.send('spark.out', str(record))
to this:
producer.send('spark.out', json.dumps(record.asDict())))
Alternatively.. in your example code since you aren't using DataFrames you could just create it as a dict to begin with instead of a Row.