Saving JSON DataFrame from Spark Standalone Cluster - apache-spark

I am running some tests on my local machine with a Spark Standalone Cluster (4 docker containers, 3 workers and 1 master). I'm trying to save a DataFrame using JSON format, like so:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName('sparkApp') \
.master("spark://172.19.0.2:7077") \
.getOrCreate()
data = spark.range(5, 10)
data.write.format("json") \
.mode("overwrite") \
.save("/home/alvaro/Alvaro/ICI/DeltaLake/SparkCluster/output11")
However, the folder created looks like the one in output9. I ran the same code, but replaced the master url to local[5] and it worked, resulting in the file output8.
What should I do to get the DataFrame JSON to be created on my local machine? Is it possible to do so using this kind of Spark cluster?
Thanks in advance!

Related

How to fix "File file /tmp/delta-table does not exist in Delta Lake?

Hello dear programmers,
I am currently setting up Delta Lake with Apache Spark. For the spark worker and master I am using the image docker.io/bitnami/spark:3.
What I am trying to do is via my python application is creating a new table of type delta via the spark master/worker I setup. However when I try to save the table I get the following error: File file:/tmp/delta-table/_delta_log/00000000000000000000.json does not exist.
This might have something to do with the worker/master container not being able to access my local files however I am not sure how to fix this. I also looked into using HDFS but should I be running a separate server for this because it seems to be built into Delta Lake already?
The code of my applications looks as follows:
import pyspark
from delta import *
builder = pyspark.sql.SparkSession.builder.appName("MyApp") \
.master("spark://spark:7077") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.config("spark.jars.packages", "io.delta:delta-core_2.12:1.1.0") \
spark = configure_spark_with_delta_pip(builder).getOrCreate()
data = spark.range(0, 5)
data.write.mode("overwrite").format("delta").save("/tmp/delta-table")
df = spark.read.format("delta").load("/tmp/delta-table")
df.show()

Spark streaming aggregate stream is not updating when reading from new csv files

I'm trying to run some basic examples of Spark streaming via pyspark and the behavior of the latest version (3.0.1) is not as advertised and not as I remembered it from previous versions.
Specifically, I set up a streaming DF to read csv files from a folder. Each file contains two columns: stock and value and a series of randomly generated stock values for 4 different stocks. For example:
stock
value
HPE
11.7014
NHPI
0.00672
NHPI
0.00714
NHPI
0.008232
TSLA
337.9674
I then groupBy the stock name and average the price.
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
spark = SparkSession \
.builder \
.appName("StockTicker") \
.getOrCreate()
# Create DataFrame creating a stream from incoming csv files
#Define schema
from pyspark.sql.types import StructType
schema = StructType().add('stock','string').add('value','double')
dfCSV = spark \
.readStream \
.option('header',True) \
.schema(schema) \
.option('maxFilesPerTrigger',1) \
.csv("stocks")
# Generate running average price
avgStockPrice = dfCSV.groupBy("stock").avg()
# Start running the query that prints the running averages to the console
query = avgStockPrice \
.writeStream \
.outputMode("complete") \
.format("console") \
.start()
When I put in more than one file it calculates over all the files as expected, running a batch for each file. But then if I add any more files, a batch is triggered, but the results are exactly the same, not changing no matter how many new files I add.
I tried changing the avg to count to see if I was just getting an average price from the random function generating the files but I got the same result. The row count increased for each new file for the initial files in the folder but did not budge when adding more files. A computation was triggered, but no new results.
Using Ubuntu 18.04, python 3.7.10, spark 3.0.1, pyspark (jupyter notebook).

What to set Spark Master address to when deploying on Kubernetes Spark Operator?

The official spark documentation only has information on the spark-submit method for deploying code to a spark cluster. It mentions we must prefix the address from kubernetes api server with k8s://. What should we do when deploying through Spark Operator?
For instance if I have a basic pyspark application that starts up like this how do I set the master:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark import SparkConf, SparkContext
sc = SparkContext("local", "Big data App")
spark = SQLContext(sc)
spark_conf = SparkConf().setMaster('local').setAppName('app_name')
Here I have local, where if I was running on a non-k8's cluster I would mention the master address with spark:// prefix or yarn. Must I also use the k8s:// prefix if deploying through the Spark Operator?
If not what should be used for master parameter?
It's better not to use setMaster in the code, but instead specify it when running the code via spark-submit, something like this (see documentation for details):
./bin/spark-submit \
--master k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port> \
--deploy-mode cluster \
your_script.py
I haven't used Spark operator, but it should set master automatically, as I understand from the documentation.
you also need to get convert this code:
sc = SparkContext("local", "Big data App")
spark = SQLContext(sc)
spark_conf = SparkConf().setMaster('local').setAppName('app_name')
to more modern (see doc):
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
as SQLContext is deprecated.
P.S. I recommend to get through first chapters of Learning Spark, 2ed that is freely available from the Databricks site.

Pyspark crashing on Dataproc cluster for small dataset

I am running a jupyter notebook created on a gcp dataproc cluster consisting of 3 worker nodes and 1 master node of type "N1-standard2" (2 cores, 7.5GB RAM), for my data science project. The dataset consists of ~0.4 mn rows. I have called a groupBy function with the groupBy column consisting of only 10 unique values, so that the output dataframe should consist of just 10 rows!
It's susprising that it crashes everytime I call grouped_df.show() or grouped_df.toPandas(), where grouped_df is obtained after calling groupBy() and sum() function.
This should be a cakewalk for spark which was originally built for processing large datasets. I am attaching the spark config that I am using which I have defined in a function.
builder = SparkSession.builder \
.appName("Spark NLP Licensed") \
.master("local[*]") \
.config("spark.driver.memory", "40G") \
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
.config("spark.kryoserializer.buffer.max", "2000M") \
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.11:2.5.1") \
.config("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem") \
.config("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
return builder.getOrCreate()
`
This is the error I am getting. Please help.
Setting master's URL in setMaster() helped. Now I can load data as large as 20GB and perform groupBy() operations as well on the cluster.
Thanks #mazaneicha.

How to distribute JDBC jar on Cloudera cluster?

I've just installed a new Spark 2.4 from CSD on my CDH cluster (28 nodes) and am trying to install JDBC driver in order to read data from a database from within Jupyter notebook.
I downloaded and copied it on one node to the /jars folder, however it seems that I have to do the same on each and every host (!). Otherwise I'm getting the following error from one of the workers:
java.lang.ClassNotFoundException: com.microsoft.sqlserver.jdbc.SQLServerDriver
Is there any easy way (without writing bash scripts) to distribute the jar files with packages on the whole cluster? I wish Spark could distribute it itself (or maybe it does and I don't know how to do it).
Spark has a jdbc format reader you can use.
launch a scala shell to confirm your MS SQL Server driver is in your classpath
example
Class.forName("com.microsoft.sqlserver.jdbc.SQLServerDriver")
If driver class isn't showing make sure you place the jar on an edge node and include it in your classpath where you initialize your session
example
bin/spark-shell --driver-class-path postgresql-9.4.1207.jar --jars postgresql-9.4.1207.jar
Connect to your MS SQL Server via Spark jdbc
example via spark python
# option1
jdbcDF = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql:dbserver") \
.option("dbtable", "schema.tablename") \
.option("user", "username") \
.option("password", "password") \
.load()
# option2
jdbcDF2 = spark.read \
.jdbc("jdbc:postgresql:dbserver", "schema.tablename",
properties={"user": "username", "password": "password"})
specifics and additional ways to compile connection strings can be found here
https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html
you mentioned jupyter ... if you still cannot get the above to work try setting some env vars via this post (cannot confirm if this works though)
https://medium.com/#thucnc/pyspark-in-jupyter-notebook-working-with-dataframe-jdbc-data-sources-6f3d39300bf6
at the end of the day all you really need is the driver class placed on an edge node (client where you launch spark) and append it to your classpath then make the connection and parallelize your dataframe to scale performance since jdbc from rdbms reads data as single thread hence 1 partition

Resources