I am able to read/write files from spark standalone cluster to s3 using the below configuration.
val spark = SparkSession.builder()
.appName("Sample App")
.config("spark.master", "spark://spark:7077")
.config("spark.hadoop.fs.s3a.path.style.access", value = true)
.config("fs.s3a.fast.upload", value = true)
.config("fs.s3a.connection.ssl.enabled", value = false)
.config("mapreduce.fileoutputcommitter.algorithm.version", value = 2)
.config("spark.hadoop.fs.s3a.access.key", "Access Key Value")
.config("spark.hadoop.fs.s3a.secret.key", "Secret Key Value")
.config("spark.hadoop.fs.s3a.endpoint", "End-Point Value")
.getOrCreate()
But my requirement is to reuse the connection to s3 instead of mentioning s3 keys every time I create a spark-session. Like a mount point in data bricks.
Related
I have a spark streaming application that reads multiples paths in a bucket.
Every path has a csv with a specific schema. How could I set the schema according to the path spark is reading.
Example:
# bucket structure: bucket_name/table_1/year/month/day/file.csv
schemas = {"table_1":"id INT, name STRING, status STRING", "table_2":"col1 STRING, col2 STRING"}
df_changes = spark.readStream\
.format("csv")\
.option("delimiter","|")\
.option("Header",True)\
.option("multiLine",True)\
.option('ignoreLeadingWhiteSpace',True)\
.option('ignoreTrailingWhiteSpace',True)\
.option("escape", "\"")\
.load(f"s3a://bucket_name/*/*/*/*/*/*.csv")\
.withColumn("file_path", input_file_name())\
.withColumn("raw_timestamp", current_timestamp())
def append_data(df,batchId):
file_path = df.first()['file_path']
lista = file_path.split('/')[:-4]
full_path = '/'.join(lista).replace('landing', 'raw') + '/cdc'
windowSpec = Window.partitionBy().orderBy(lit(None))
df.withColumn("row_number",row_number().over(windowSpec)).write.format("delta").mode('append').save(full_path)
df_changes.writeStream \
.foreachBatch(append_data) \
.option("checkpointLocation", "/checkpoint/")\
.start()
Is it possible to insert during the spark load the schema dynamically?
Something like: .load(f"s3a://bucket_name/*/*/*/*/*/*.csv", schemas=schemas[path-from-spark-splited])
I am trying to write a data pipeline that reads a .tsv file from Azure Blob Storage and write the data to a MySQL database. I have a sensor that looks for a file with a given prefix within my storage container and then a SparkSubmitOperator which actually reads the data and writes it to the database.
The sensor works fine and when I write the data from local storage to MySQL, that works fine as well. However, I am having quite a bit of trouble reading the data from Blob Storage.
This is the simple Spark job that I am trying to run,
spark = (SparkSession
.builder \
.config("fs.azure", "org.apache.hadoop.fs.azure.NativeAzureFileSystem") \
.config("fs.azure.account.key.{}.blob.core.windows.net".format(blob_account_name), blob_account_key) \
.getOrCreate()
)
sc = spark.sparkContext
sc.setLogLevel("WARN")
df_tsv = spark.read.csv("wasb://{}#{}.blob.core.windows.net/{}".format(blob_container, blob_account_name, blob_name), sep=r'\t', header=True)
mysql_url = 'jdbc:mysql://' + mysql_server
df_tsv.write.jdbc(url=mysql_url, table=mysql_table, mode="append", properties={"user":mysql_user, "password": mysql_password, "driver: "com.mysql.cj.jdbc.Driver" })
This is my SparkSubmitOperator,
spark_job = SparkSubmitOperator(
task_id="my-spark-app",
application="path/to/my/spark/job.py", # Spark application path created in airflow and spark cluster
name="my-spark-app",
conn_id="spark_default",
verbose=1,
conf={"spark.master":spark_master},
application_args=[tsv_file, mysql_server, mysql_user, mysql_password, mysql_table],
jars=azure_hadoop_jar + ", " + mysql_driver_jar,
driver_class_path=azure_hadoop_jar + ", " + mysql_driver_jar,
dag=dag)
I keep getting this error,
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.azure.NativeAzureFileSystem not found
What exactly am I doing wrong?
I have both mysql-connector-java-8.0.27.jar and hadoop-azure-3.3.1.jar in my application. I have given the path to these in the driver_class_path and jars parameters. Is there something wrong with how I have done that here?
I have tried following the suggestions given here, Saving Pyspark Dataframe to Azure Storage, but they have not been helpful.
I have a Cassandra table that is created as the following(in cqlsh)
CREATE TABLE blog.session( id int PRIMARY KEY, visited text);
I write data to Cassandra and it looks like this
id | visited
1 | Url1-Url2-Url3
I then try to read it using spark Cassandra connector(2.5.1).
val sparkSession = SparkSession.builder()
.master("local")
.appName("ReadFromCass")
.config("spark.cassandra.connection.host", "localhost")
.config("spark.cassandra.connection.port", "9042")
.getOrCreate()
import sparkSession.implicits._
val readSessions = sparkSession.sqlContext
.read
.cassandraFormat("table1", "keyspace1").load().show()
However, it seems to be unable to read the visited since it is a text object with dashes in between words. The error occurs as
org.apache.spark.unsafe.types.UTF8String is not a valid external type for schema of string
any ideas on why spark is unable to read this and how to fix it?
The error seemed to be the version of the spark-cassandra-connector. Instead of using "2.5.1" use "3.0.0-beta"
I try to connect to a remote hive cluster using the following code and I get the table data as expected
val spark = SparkSession
.builder()
.appName("adhocattempts")
.config("hive.metastore.uris", "thrift://<remote-host>:9083")
.enableHiveSupport()
.getOrCreate()
val seqdf=sql("select * from anon_seq")
seqdf.show
However, when I try to do this via HiveServer2, I get no data in my dataframe. This table is based on a sequencefile. Is that the issue, since I am actually trying to read this via jdbc?
val sparkJdbc = SparkSession.builder.appName("SparkHiveJob").getOrCreate
val sc = sparkJdbc.sparkContext
val sqlContext = sparkJdbc.sqlContext
val driverName = "org.apache.hive.jdbc.HiveDriver"
Class.forName(driverName)
val df = sparkJdbc.read
.format("jdbc")
.option("url", "jdbc:hive2://<remote-host>:10000/default")
.option("dbtable", "anon_seq")
.load()
df.show()
Can someone help me understand the purpose of using HiveServer2 with jdbc and relevant drivers in Spark2?
So the read works fine as per documentation:
val cql = new org.apache.spark.sql.cassandra.CassandraSQLContext(sc)
cql.setConf("cluster-src/spark.cassandra.connection.host", "1.1.1.1")
cql.setConf("cluster-dst/spark.cassandra.connection.host", "2.2.2.2")
...
var df = cql.read.format("org.apache.spark.sql.cassandra")
.option("table", "my_table")
.option("keyspace", "my_keyspace")
.option("cluster", "cluster-src")
.load()
But it is not clear how to pass the destination cluster name to the save counterpart. This obviously does not work, it just tries to connect to the local spark host:
df.write
.format("org.apache.spark.sql.cassandra")
.option("table", "my_table")
.option("keyspace", "my_keyspace")
.option("cluster", "cluster-dst")
.save()
Update:
Found a workaround but it is kind of ugly. So instead of:
.option("cluster", "cluster-dst")
use:
.option("spark_cassandra_connection_host", cql.getConf("cluster-dst/spark.cassandra.connection.host")