How do I write query from Spark to Redshift? - apache-spark

I connected via SSH to Dev Endpoint in Glue.
There is Spark 2.4.1 running.
I want to run a simple query select * from pg_namespace;
Also after that, want to move data from S3 to Redshift using COPY command.
How to write that in a Spark console?
Thanks.

Am not sure if you can use COPY command directly, and i haven't tried it.
For moving data from S3 to Redshift, you can use AWS Glue APIs. Please check here for sample codes from AWS? Behind the scenes, I think AWS Glue uses COPY / UNLOAD commands for moving data between S3 and REDSHIFT.

You can use aws cli and psql from your ssh terminal.
For psql check https://docs.aws.amazon.com/redshift/latest/mgmt/connecting-from-psql.html
Then u can run select and copy command from it.
But I will not recommend as AWS Glue is serverless service so your cluster will be different everytime.

Related

Fb-Prophet, Apache Spark in Colab and AWS SageMaker/ Lambda

I am using Google-Colab for creating a model by using FbProphet and i am try to use Apache Spark in the Google-Colab itself. Now can i upload this Google-colab notebook in aws Sagemaker/Lambda for free (without charge for Apache Spark and only charge for AWS SageMaker)?
In short, You can upload the notebook without any issue into SageMaker. Few things to keep in mind
If you are using the pyspark library in colab and running spark locally, you should be able to do the same by installing necessary pyspark libs in Sagemaker studio kernels. Here you will only pay for the underlying compute for the notebook instance. If you are experimenting then I would recommend you to use https://studiolab.sagemaker.aws/ to create a free account and try things out.
If you had a separate spark cluster setup then you may need a similar setup in AWS using EMR so that you can connect to the cluster to execute the job.

Databricks Lakehouse JDBC and Docker

Pretty new to Databricks.
I've got a requirement to access data in the Lakehouse using a JDBC driver. This works fine.
I now want to stub the Lakehouse using a docker image for some tests I want to write. Is it possible to get a Databricks / spark docker image with a database in it? I would also want to bootstrap the database on startup to create a bunch of tables.
No - Databricks is not a database but a hosted service (PaaS). You can theoretically you can use OSS Spark with Thriftserver started on it, but the connections strings and other functionality would be very different, so it makes no sense to spend time on it (imho). Real solution would depend on the type of tests that you want to do.
Regarding bootstrapping database & create a bunch of tables - just issue these commands, like, create database if not exists or create table if not exists when you application starts up (see documentation for an exact syntax)

How to read/load local files in Databricks?

is there anyway of reading files located in my local machine other than navigating to 'Data'> 'Add Data' on Databricks.
in my past experience using Databricks, when using s3 buckets, I was able to just read and load a dataframe by just specifying the path like so: i.e
df = spark.read.format('delta').load('<path>')
is there any way i can do something like this using databricks to read local files?
If you use the Databricks Connect client library you can read local files into memory on a remote Databricks Spark cluster. See details here.
The alternative is to use the Databricks CLI (or REST API) and push local data to a location on DBFS, where it can be read into Spark from within a Databricks notebook. A similar idea would be to use the AWS CLI to put local data into an S3 bucket that can be accessed from Databricks.
It sounds like what you are looking for is Databricks Connect, which works with many popular IDEs.

Copy data into postgres from Redshift using Node.js

Is there an efficient way to copy a table from redshift to postgres using nodejs, couldn't find any concrete examples
There does not seem to be any utility pre-written. the process that you must adopt (set up) for anything more than just a few rows is:
Push data to S3
Use AWS Copy command (using SDK) to copy from S3 to Redshift
Transform data in Redshift (optional)

Can Spark access DynamoDb without EMR

I have a set of AWS Instances where Apache Hadoop distribution along with apache spark is setup
I am trying to access DynamoDb through Spark streaming for reading and writing to the table But
During writing the Spark- DynamoDB code, I got to know emr-ddb-hadoop.jar is required to get DynamoDB Input Format and OutputFormat which is present in EMR Cluster only.
After checking few blogs it seems that it is accessible only with EMR Spark.
Is It correct?
However I use standalone JAVA SDK to access Dynamodb which worked fine
I got the solution of the problem.
I downloaded the emr-ddb-hadoop.jar file from EMR and using it in my environment.
Please note: To run the DynamoDB, we only need above jar only.

Resources