datahub s3 metadata ingestion needs Spark installed

datahub s3 metadata ingestion needs Spark installed - apache-spark

I am trying to get the metadata from a bucket/prefix on S3 using datahub but i am getting error:
{logger:26} - Please set env variable SPARK_VERSION.
The datahub s3 documentation mentions
Profiles are computed with PyDeequ, which relies on PySpark. Therefore, for computing profiles, we currently require Spark 3.0.3 with Hadoop 3.2 to be installed and the SPARK_HOME and SPARK_VERSION environment variables to be set.
Where should i install Spark and SPARK_VERSION? In the container? I have Spark locally installed.

Related

Connecting Pyspark with Kafka

I'm having problem understanding how to connect Kafka and PySpark.
I have kafka installation on Windows 10 with topic nicely streaming data.
I've installed pyspark which runs properly-I'm able to create test DataFrame without problem.
But when I try to connect to Kafka stream it gives me error:
AnalysisException: Failed to find data source: kafka. Please deploy
the application as per the deployment section of "Structured Streaming-
Kafka Integration Guide".
Spark documentation is not really helpful - it says:
...
groupId = org.apache.spark
artifactId = spark-sql-kafka-0-10_2.12
version = 3.2.0
...
For Python applications, you need to add this above library and its dependencies when deploying your application. See the Deploying subsection below.
And then when you go to Deploying section it says:
As with any Spark applications, spark-submit is used to launch your application. spark-sql-kafka-0-10_2.12 and its dependencies can be directly added to spark-submit using --packages, such as,
./bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0 ...
I'm developing app, I don't want to deploy it.
Where and how to add these dependencies if I'm developing pyspark app?
Tried several tutorials ended up being more confused.
Saw answer saying that
"You need to add kafka-clients JAR to your --packages".so-answer
Few more steps could be useful because for someone who is new this is unclear.
versions:
kafka 2.13-2.8.1
spark 3.1.2
java 11.0.12
All environmental variables and paths are correctly set.
EDIT
I've load :
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2,org.apache.kafka:kafka-clients:2.8.1'
as suggested but still getting same error.
I've triple checked kafka, scala and spark versions and tried various combinations but not it didn't work, I'm still getting same error:
AnalysisException: Failed to find data source: kafka. Please deploy
the application as per the deployment section of "Structured Streaming-Kafka Integration Guide".
EDIT 2
I installed latest Spark 3.2.0 and Hadoop 3.3.1 and kafka version kafka_2.12-2.8.1. Changed all environmental variables, tested Spark and Kafka - working properly.
My environment variable looks like this now:
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0,org.apache.kafka:kafka-clients:2.8.1'
Still no luck, I get same error :(

Spark documentation is not really helpful - it says ... artifactId = spark-sql-kafka-0-10_2.12 version = 3.2.0 ...
Yes, that is correct... but for the latest version of Spark
versions:
spark 3.1.2
Have you tried looking at the version specific docs?
In other words, you want the matching spark-sql-kafka version of 3.1.2.
bin/spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2
Or in Python,
scala_version = '2.12'
spark_version = '3.1.2'
# TODO: Ensure match above values match the correct versions
packages = [
f'org.apache.spark:spark-sql-kafka-0-10_{scala_version}:{spark_version}',
'org.apache.kafka:kafka-clients:3.2.1'
]
spark = SparkSession.builder\
.master("local")\
.appName("kafka-example")\
.config("spark.jars.packages", ",".join(packages))\
.getOrCreate()
Or with an env-var
import os
spark_version = '3.1.2'
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.12:{}'.format(spark_version)
# init spark here
need to add this above library and its dependencies
As you found in my previous answer, also append the kafka-clients package using comma-separated list.
--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.2,org.apache.kafka:kafka-clients:2.8.1
I'm developing app, I don't want to deploy it.
"Deploy" is Spark terminology. Running locally is still a "deployment"

Read data from Cassandra in spark-shell

I want to read data from cassandra node in my client node on :
This is what i tried :
spark-shell --jars /my-dir/spark-cassandra-connector_2.11-2.3.2.jar.
val df = spark.read.format("org.apache.spark.sql.cassandra")\
.option("keyspace","my_keyspace")\
.option("table","my_table")\
.option("spark.cassandra.connection.host","Hostname of my Cassandra node")\
.option("spark.cassandra.connection.port","9042")\
.option("spark.cassandra.auth.password","mypassword)\
.option("spark.cassandra.auth.username","myusername")\
.load
I'm getting this error: java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.sql.cassandra.DefaultSource$
and
java.lang.NoClassDefFoundError: org/apache/commons/configuration/ConfigurationException.
Am I missing any properties? What this error is for ? How would I resolve this ?
Spark-version:2.3.2, DSE version 6.7.8

The Spark Cassandra Connector itself depends on the number of other dependencies, that could be missing here - this happens because you're providing only one jar, and not all required dependencies.
Basically, in your case you need to have following choice:
If you're running this on the DSE node, then you can use built-in Spark, if the cluster has Analytics enabled - in this case, all jars and properties are already provided, and you only need to provide username and password when starting spark shell via dse -u user -p password spark
if you're using external Spark, then it's better to use so-called BYOS (bring your own spark) - special version of the Spark Cassandra Connector with all dependencies bundled inside, and you can download jar from DataStax's Maven repo, and use with --jars
you can still use open source Spark Cassandra Connector, but in this case, it's better to use --packages com.datastax.spark:spark-cassandra-connector_2.11:2.3.2 so Spark will able to fetch all dependencies automatically.
P.S. In case of open source Spark Cassandra Connector I would recommend to use version 2.5.1 or higher, although it requires Spark 2.4.x (although 2.3.x may work) - this version has improved support for DSE, plus a lot of the new functionality not available in the earlier versions. Plus for that version there is a version that includes all required dependencies (so-called assembly) that you can use with --jars if your machine doesn't have access to the internet.

Unable to read S3 bucket since upgrading to 2.4.4 and EMR 5.29

Been recently trying to load Parquet files within my local Spark environment. Recently updated my EMR cluster to the latest version (previously using Spark 2.3.0) to Spark 2.4.4 in its latest EMR 5.29.
However, when I modify the versioning within my project to read the bucket using s3a the following prompts while debugging.
org.apache.hadoop.fs.s3a.AWSClientIOException: getFileStatus on mybucket com.amazonaws.SdkClientException: Failed to connect to service endpoint: : Failed to connect to service endpoint:
The following dependencies are the ones being used.
val spark = "2.4.4"
val awsSdk = "1.11.682"
Another issue is that when using the older versioning (Spark 2.3.0), I wasn't able to read newly generated data as something in the serialization was updated (unable to parse Scala Array or List to whatever the new class Spark is using internally is).

Can Spark access DynamoDb without EMR

I have a set of AWS Instances where Apache Hadoop distribution along with apache spark is setup
I am trying to access DynamoDb through Spark streaming for reading and writing to the table But
During writing the Spark- DynamoDB code, I got to know emr-ddb-hadoop.jar is required to get DynamoDB Input Format and OutputFormat which is present in EMR Cluster only.
After checking few blogs it seems that it is accessible only with EMR Spark.
Is It correct?
However I use standalone JAVA SDK to access Dynamodb which worked fine

I got the solution of the problem.
I downloaded the emr-ddb-hadoop.jar file from EMR and using it in my environment.
Please note: To run the DynamoDB, we only need above jar only.

Accessing Cassandra from Google Cloud Dataproc

I just set up a Spark cluster in Google Cloud using DataProc and I have a standalone installation of Cassandra running on a separate VM. I would like to install the Datastax spark-cassandra connector so I can connect to Cassandra from spark. How can I do this ?
The connector can be downloaded here:
https://github.com/datastax/spark-cassandra-connector
The instructions on building are here:
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/12_building_and_artifacts.md
sbt is needed to build it.
Where can I find sbt for the DataProc installation ?
Would it be under $SPARK_HOME/bin ? Where is spark installed for DataProc ?

I'm going to follow up the really helpful comment #angus-davis made not too long ago.
Where can I find sbt for the DataProc installation ?
At present, sbt is not included on Cloud Dataproc clusters. The sbt documentation contains information on how to install sbt manually. If you need to re-install sbt on your clusters, I highly recommend you create an init action to install sbt when you create a cluster. After some research, it looks like SBT is covered under a BSD-3 license, which means we can probably (no promise) include it in Cloud Dataproc clusters.
Would it be under $SPARK_HOME/bin ? Where is spark installed for DataProc ?
The answer to this is it depends on what you mean.
binaries - /usr/bin
config - /etc/spark/conf
spark_home - /usr/lib/spark
Importantly, this same pattern is used for other major OSS components installed on Cloud Dataproc clusters, like Hadoop and Hive.
I would like to install the Datastax spark-cassandra connector so I can connect to Cassandra from spark. How can I do this ?
The Stack Overflow answer Angus sent is probably the easiest way if it can be used as a Spark package. Based on what I can find, however, this is probably not an option. This means you will need to install sbt and manually install.

You can use cassandra along with the mentioned jar and connector from datastax. You can simply download the jar and pass it to dataproc cluster. You can find Google provided template, I contributed to, in this link [1]. This explains how you can use the template to connect to Cassandra using Dataproc.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string