Unable to parse file from AWS Glue dynamic_frame to Pyspark Data frame - python-3.x

Iam new to AWs glue.
I am facing issue in converting glue data frame to pyspark data frame :
Below is the crawler configuration i created for reading csv file
glue_cityMapDB="csvDb"
glue_cityMapTbl="csv table"
datasource2 = glue_context.create_dynamic_frame.from_catalog(database = glue_cityMapDB, table_name = glue_cityMapTbl, transformation_ctx = "datasource2")
datasource2.show()
print("Show the data source2 city DF")
cityDF=datasource2.toDF()
cityDF.show()
Output:
Here i am getting output from the glue dydf - #datasource2.show()
But after converting to the pyspark DF, iam getting following error
S3NativeFileSystem (S3NativeFileSystem.java:open(1208)) - Opening 's3://s3source/read/names.csv' for reading 2020-04-24 05:08:39,789 ERROR [Executor task launch worker for task
Appreciate if anybody can help on this?

Make use of a file are of UTF-8 encoded. You can check using file or convert using inconv or any other text editor like sublime.
You can also read the files as a dataframe using:
df = spark.read.csv('s3://s3source/read/names.csv')
then convert to dynamic frames using fromDF()

Related

Getting duplicate values for each key while querying parquet file using PySpark

We are getting duplicated values while querying data from the parquet file using PySpark.
While getting correct data after querying from presto.
Spark Version: 3.1
Configuration setup so far:
from scbuilder.kubernetes import Kubernetes
kobj = Kubernetes(kubernetes = True)
kobj.setExecutorCores(5)
kobj.setExecutorMemory("5g")
kobj.addAdditionalConf("spark.driver.memory", "8g")
kobj.setNumberOfExecutor(2)
sc = kobj.buildSparkSession()
sc.getActiveSession()
sc.conf.set('sc.hadoopConfiguration.setClass',"mapreduce.input.pathFilter.class")
sc.conf.set("hive.convertMetastoreParquet",False)
sc.conf.set("hive.input.format","org.apache.hadoop.hive.ql.io.HiveInputFormat")
Actual data count: 17722
After querying from parquet file: 1036320
Need help to understand why parquet file is showing such behavior and how we can fix it?

How to resolve invalid column name on parquet file read itself in PySpark

I setup a standalone spark and a standalone HDFS.
I installed pyspark and was able to create spark session.
I uploaded one parquet file to HDFS under /data : hdfs://localhost:9000/data
I tried to create a dataframe out of this directory using PySpark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local[*]').appName("test").getOrCreate()
df = spark.read.parquet("hdfs://localhost:9000/data").withColumnRenamed("Wafer ID", "Wafer_ID")
I am getting invalid column name even with withColumnRenamed.
I tried with the following code but I got same error for this as well
df = spark.read.parquet("hdfs://localhost:9000/data").select(col("Wafer ID").alias("Wafer_ID"))
I have means to change the column names manually (pandas) or use different file entirely but I want to know if there is a way to solve this problem.
What am I doing wrong?

why is my glue table creating with the wrong path?

I'm creating a table in AWS Glue using a spark job orchestrated by Airflow, it reads from a json and writes a table, the command I use within the job is the following:
spark.sql(s"CREATE TABLE IF NOT EXISTS $database.$table using PARQUET LOCATION '$path'")
The odd thing here is that I have other tables created using the same job (with different names) but they are created without problems, e.g. they have the location
s3://bucket_name/databases/my_db/my_perfectly_created_table
there is exactly one table that creates itself with this location:
s3://bucket_name/databases/my_db/my_problematic_table-__PLACEHOLDER__
I don't know where that -__PLACEHOLDER__ is coming from. I already tried deleting the table and recreating it but it always does the same thing on this exact table. The data is in parquet format in the path:
s3://bucket_name/databases/my_db/my_problematic_table
so I know the problem is just creating the table correctly because all I get is a col (array<string>) when trying to query it in Athena (as there is no data in /my_problematic_table-__PLACEHOLDER__).
Have any of you guys dealt with this before?
Upon closer inspection in AWS glue, this specific problematic_table had the following config, specific for CSV files and custom-delimiters:
Input Format org.apache.hadoop.mapred.SequenceFileInputFormat
Output Format org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Serde serialization library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
while my other tables had the config specific for parquet:
Input Format org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
Output Format org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Serde serialization library org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
I tried to create the table forcing the config for parquet with the following command:
val path = "s3://bucket_name/databases/my_db/my_problematic_table/"
val my_table = spark.read.format("parquet").load(path)
val ddlSchema = my_table.toDF.schema.toDDL
spark.sql(s"""
|CREATE TABLE IF NOT EXISTS my_db.manual_myproblematic_table($ddlSchema)
|ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
|STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
|OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
|LOCATION '$path'
|""".stripMargin
)
but it threw the following error:
org.apache.spark.SparkException: Cannot recognize hive type string: struct<1:string,2:string,3:string>, column: problematic_column
so the problem was the naming of those columns, "1", "2" & "3" within that struct.
Given that this struct did not contain valuable info I ended up dropping it and creating the table again. now it works like a charm and it has the correct (parquet) config in glue.
Hope this helps anyone

Lambda Function to convert csv to parquet from s3

I have a requirement -
1. To convert parquet file present in s3 to csv format and place it back in s3. The process should exclude the use of EMR.
2. Parquet file has more than 100 cols i need to just extract 4 cols from that parquet file and create the csv in s3.
Does anyone has any solution to this?
Note - Cannot use EMR or AWS Glue
Assuming you want to keep things easy within the AWS environment, and not using Spark (Glue / EMR), you could use AWS Athena in the following way:
Let's say your parquet files are located in S3://bucket/parquet/.
You can create a table in the Data Catalog (i.e. using Athena or a Glue Crawler), pointing to that parquet location. For example, running something like this in the Athena SQL console:
CREATE EXTERNAL TABLE parquet_table (
col_1 string,
...
col_100 string)
PARTITIONED BY (date string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
LOCATION 's3://bucket/parquet/' ;
Once you can query your parquet_table table, which will be reading parquet files, you should be able to create the CSV files in the following way, using Athena too and choosing only the 4 columns you're interested in:
CREATE TABLE csv_table
WITH (
format = 'TEXTFILE',
field_delimiter = ',',
external_location = 's3://bucket/csv/'
)
AS SELECT col_1, col_2, col_3, col_4
FROM parquet_table ;
After this, you can actually drop the csv temporary table and only use the CSV files, under s3://bucket/csv/, and do more, for example by having an S3-trigger Lambda function and doing something else or similar.
Remember that all this can be achieved from Lambda, interacting with Athena (example here) and also, bear in mind it has an ODBC connector and PyAthena to use it from Python, or more options, so using Athena through Lambda or the AWS Console is not the only option you have, in case you want to automate this in a different way.
I hope this helps.
Additional edit, on Sept 25th, 2019:
Answering to your question, about doing this in Pandas, I think the best way would be using Glue Python Shell, but you mentioned you didn't want to use it. So, if you decide to, here it is a basic example of how to:
import pandas as pd
import boto3
from awsglue.utils import getResolvedOptions
from boto3.dynamodb.conditions import Key, Attr
args = getResolvedOptions(sys.argv,
['region',
's3_bucket',
's3_input_folder',
's3_output_folder'])
## #params and #variables: [JOB_NAME]
## Variables used for now. Job input parameters to be used.
s3Bucket = args['s3_bucket']
s3InputFolderKey = args['s3_input_folder']
s3OutputFolderKey = args['s3_output_folder']
## aws Job Settings
s3_resource = boto3.resource('s3')
s3_client = boto3.client('s3')
s3_bucket = s3_resource.Bucket(s3Bucket)
for s3_object in s3_bucket.objects.filter(Prefix=s3InputFolderKey):
s3_key = s3_object.key
s3_file = s3_client.get_object(Bucket=s3Bucket, Key=s3_key)
df = pd.read_csv(s3_file['Body'], sep = ';')
partitioned_path = 'partKey={}/year={}/month={}/day={}'.format(partKey_variable,year_variable,month_variable,day_variable)
s3_output_file = '{}/{}/{}'.format(s3OutputFolderKey,partitioned_path,s3_file_name)
# Writing file to S3 the new dataset:
put_response = s3_resource.Object(s3Bucket,s3_output_file).put(Body=df)
Carlos.
It all depends upon your business requirement, what sort of action you want to take like Asynchronously or Synchronous call.
You can trigger a lambda github example on s3 bucket asynchronously when a parquet file arrives in the specified bucket. aws s3 doc
You can configure s3 service to send a notification to SNS or SQS as well when an object is added/removed form the bucket which in turn can then invoke a lambda to process the file Triggering a Notification.
You can run a lambda Asynchronously every 5 minutes by scheduling the aws cloudwatch events The finest resolution using a cron expression is a minute.
Invoke a lambda Synchronously over HTTPS (REST API endpoint) using API Gateway.
Also worth checking how big is your Parquet file as lambda can run max 15 min i.e. 900 sec.
Worth checking this page as well Using AWS Lambda with Other Services
It is worth taking a look at CTAS queries in Athena recently: https://docs.aws.amazon.com/athena/latest/ug/ctas-examples.html
We can store the query results in a different format using CTAS.

Using Pyspark how to convert Text file to CSV file

I am new learner for Pyspark. I got a requirement in my project to read JSON file with a schema and need to convert it to CSV file.
Can some one help me how to proceed this request using PYspark.
You can load JSON and write CSV with SparkSession.
spark = SparkSession.builder.master("local").appName("ETL").getOrCreate()
spark.read.json(path-to-txt)
spark.write.csv(path-to-csv)

Resources