I am using Spark-SQL where I want to retrieve records from Redshift & write them to S3. I am referring to this link Spark-Redshift ...I am able to print the schema of DataFrame but I am not able to write to S3..It is throwing error :
error: S3ServiceException:The AWS Access Key Id you provided does not exist in our records.,Status 403,Error InvalidAccessKeyId
What should I need to do to write to S3 ?? Any other way to save Redshift data to S3 using Spark??
Related
I'm using Spark to load data into a hive table through Pyspark, and when I load data from a path in Amazon S3, the original file is getting wiped from the Directory. The file is found, and is populating the table with data. I also tried to add the Local clause but that throws an error when looking for the file. When looking through the documentation it doesn't explicitly state that this is the intended behavior.
spark.sql("CREATE TABLE src (key INT, value STRING) STORED AS textfile")
spark.sql("LOAD DATA INPATH 's3://bucket/kv1.txt' OVERWRITE INTO TABLE src")
I would like to "create if not exists" a S3 bucket in YYYY-MM-DD format and store my transformed parquet files there. How do you achieve this in pyspark? Should I use boto3 or does pyspark have something builtin?
I am using the code below to read data from S3. I would like to create S3 and put my transformed files there.
spark_context._jsc.hadoopConfiguration().set("fs.s3a.access.key", config.access_id)
spark_context._jsc.hadoopConfiguration().set("fs.s3a.secret.key", config.access_key)
spark.conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
Seems like you need to just enable: fs.s3.buckets.create.enabled
I am trying to use Azure Databricks in order to :
1- insert rows into table of Azure SQL Databse with python 3. I cannot see a documentation about insert rows. (I have use this link to connect to the database Doc and it is working).
2- Save Csv file in my datalake
3- Create Table from Dataframe if possible
Thanks for your help and sorry for my novice questions
**1- insert rows into table of Azure SQL Databse with python 3. **
Azure Databricks has installed the JDBC driver. We can use JDBC driver to write data to SQL Server with a Dataframe. For more details, please refer to here.
For example
jdbcHostname = "<hostname>"
jdbcDatabase = ""
jdbcPort = 1433
jdbcUrl = "jdbc:sqlserver://{0}:{1};database={2}".format(jdbcHostname, jdbcPort, jdbcDatabase)
connectionProperties = {
"user" : jdbcUsername,
"password" : jdbcPassword,
"driver" : "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}
#write
df=spark.createDataFrame([(1, "test1"),(2,"test2")],["id", "name"])
df.write.jdbc(url=jdbcUrl,table="users",mode="overwrite",properties=connectionProperties)
#check
df1 = spark.read.jdbc(url=jdbcUrl, table='users', properties=connectionProperties)
display(df1)
2- Create Table from Dataframe
If you want to create a DataBricks table from datafarme, you can use the method registerTempTable or saveAsTable.
registerTempTable creates an in-memory table that is scoped to the cluster in which it was created. The data is stored using Hive's highly-optimized, in-memory columnar format.
saveAsTable creates a permanent, physical table stored in S3 using the Parquet format. This table is accessible to all clusters including the dashboard cluster. The table metadata including the location of the file(s) is stored within the Hive metastore.
For more details, please refer to here and here.
I have read other question and I am confused about the option. I want to read a Athena view in EMR spark and from searching on google/stackoverflow, I realized that these view are somehow stored in S3, so I first tried to find the external location of the view through
Describe mydb.Myview
It provides schema but doesnt provide the external location. From which I assumed that I cannot read it as Dataframe from S3
What i have considered so far for reading athena view in Spark
I have considered following options
Make a new table out of this athena VIEW using WITH statment with external format as PARQUET
CREATE TABLE Temporary_tbl_from_view WITH ( format = 'PARQUET', external_location = 's3://my-bucket/views_to_parquet/', ) AS ( SELECT * FROM "mydb"."myview"; );
Another option is based on this answer,which suggests
When you start an EMR cluster (v5.8.0 and later) you can instruct it
to connect to your Glue Data Catalog. This is a checkbox in the
'create cluster' dialog. When you check this option your Spark
SqlContext will connect to the Glue Data Catalog, and you'll be able
to see the tables in Athena.
but I am not sure how can I query this view (not table) in pyspark if athena table/views are available through Glue catalogue in spark context, will the simple statement like this work?
sqlContext.sql("SELECT * from mydbmyview")
Question, What is the more effecient way to read this view in spark, does recreating a table using WITH statement (external location) means that I am storing this thing in Glue catalog or S3 twice? If yes, How can I read it directly through S3 or glue catalog?
Just to share the solution I followed with others, I created my cluster with the following option enabled
Use AWS Glue Data Catalog for table metadata
Afterwards, I saw the database name from AWS GLUE and Was able to see the desired view in tablename as below
spark.sql("use my_db_name")
spark.sql("show tables").show(truncate=False)
+------------+---------------------------+-----------+
|database |tableName |isTemporary|
+------------+---------------------------+-----------+
| my_db_name|tabel1 |false |
| my_db_name|desired_table |false |
| my_db_name|tabel3 |false |
+------------+---------------------------+-----------+
I have a requirement -
1. To convert parquet file present in s3 to csv format and place it back in s3. The process should exclude the use of EMR.
2. Parquet file has more than 100 cols i need to just extract 4 cols from that parquet file and create the csv in s3.
Does anyone has any solution to this?
Note - Cannot use EMR or AWS Glue
Assuming you want to keep things easy within the AWS environment, and not using Spark (Glue / EMR), you could use AWS Athena in the following way:
Let's say your parquet files are located in S3://bucket/parquet/.
You can create a table in the Data Catalog (i.e. using Athena or a Glue Crawler), pointing to that parquet location. For example, running something like this in the Athena SQL console:
CREATE EXTERNAL TABLE parquet_table (
col_1 string,
...
col_100 string)
PARTITIONED BY (date string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
LOCATION 's3://bucket/parquet/' ;
Once you can query your parquet_table table, which will be reading parquet files, you should be able to create the CSV files in the following way, using Athena too and choosing only the 4 columns you're interested in:
CREATE TABLE csv_table
WITH (
format = 'TEXTFILE',
field_delimiter = ',',
external_location = 's3://bucket/csv/'
)
AS SELECT col_1, col_2, col_3, col_4
FROM parquet_table ;
After this, you can actually drop the csv temporary table and only use the CSV files, under s3://bucket/csv/, and do more, for example by having an S3-trigger Lambda function and doing something else or similar.
Remember that all this can be achieved from Lambda, interacting with Athena (example here) and also, bear in mind it has an ODBC connector and PyAthena to use it from Python, or more options, so using Athena through Lambda or the AWS Console is not the only option you have, in case you want to automate this in a different way.
I hope this helps.
Additional edit, on Sept 25th, 2019:
Answering to your question, about doing this in Pandas, I think the best way would be using Glue Python Shell, but you mentioned you didn't want to use it. So, if you decide to, here it is a basic example of how to:
import pandas as pd
import boto3
from awsglue.utils import getResolvedOptions
from boto3.dynamodb.conditions import Key, Attr
args = getResolvedOptions(sys.argv,
['region',
's3_bucket',
's3_input_folder',
's3_output_folder'])
## #params and #variables: [JOB_NAME]
## Variables used for now. Job input parameters to be used.
s3Bucket = args['s3_bucket']
s3InputFolderKey = args['s3_input_folder']
s3OutputFolderKey = args['s3_output_folder']
## aws Job Settings
s3_resource = boto3.resource('s3')
s3_client = boto3.client('s3')
s3_bucket = s3_resource.Bucket(s3Bucket)
for s3_object in s3_bucket.objects.filter(Prefix=s3InputFolderKey):
s3_key = s3_object.key
s3_file = s3_client.get_object(Bucket=s3Bucket, Key=s3_key)
df = pd.read_csv(s3_file['Body'], sep = ';')
partitioned_path = 'partKey={}/year={}/month={}/day={}'.format(partKey_variable,year_variable,month_variable,day_variable)
s3_output_file = '{}/{}/{}'.format(s3OutputFolderKey,partitioned_path,s3_file_name)
# Writing file to S3 the new dataset:
put_response = s3_resource.Object(s3Bucket,s3_output_file).put(Body=df)
Carlos.
It all depends upon your business requirement, what sort of action you want to take like Asynchronously or Synchronous call.
You can trigger a lambda github example on s3 bucket asynchronously when a parquet file arrives in the specified bucket. aws s3 doc
You can configure s3 service to send a notification to SNS or SQS as well when an object is added/removed form the bucket which in turn can then invoke a lambda to process the file Triggering a Notification.
You can run a lambda Asynchronously every 5 minutes by scheduling the aws cloudwatch events The finest resolution using a cron expression is a minute.
Invoke a lambda Synchronously over HTTPS (REST API endpoint) using API Gateway.
Also worth checking how big is your Parquet file as lambda can run max 15 min i.e. 900 sec.
Worth checking this page as well Using AWS Lambda with Other Services
It is worth taking a look at CTAS queries in Athena recently: https://docs.aws.amazon.com/athena/latest/ug/ctas-examples.html
We can store the query results in a different format using CTAS.