Pyspark crashing on Dataproc cluster for small dataset - apache-spark

I am running a jupyter notebook created on a gcp dataproc cluster consisting of 3 worker nodes and 1 master node of type "N1-standard2" (2 cores, 7.5GB RAM), for my data science project. The dataset consists of ~0.4 mn rows. I have called a groupBy function with the groupBy column consisting of only 10 unique values, so that the output dataframe should consist of just 10 rows!
It's susprising that it crashes everytime I call grouped_df.show() or grouped_df.toPandas(), where grouped_df is obtained after calling groupBy() and sum() function.
This should be a cakewalk for spark which was originally built for processing large datasets. I am attaching the spark config that I am using which I have defined in a function.
builder = SparkSession.builder \
.appName("Spark NLP Licensed") \
.master("local[*]") \
.config("spark.driver.memory", "40G") \
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
.config("spark.kryoserializer.buffer.max", "2000M") \
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.11:2.5.1") \
.config("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem") \
.config("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
return builder.getOrCreate()
`
This is the error I am getting. Please help.

Setting master's URL in setMaster() helped. Now I can load data as large as 20GB and perform groupBy() operations as well on the cluster.
Thanks #mazaneicha.

Related

Saving JSON DataFrame from Spark Standalone Cluster

I am running some tests on my local machine with a Spark Standalone Cluster (4 docker containers, 3 workers and 1 master). I'm trying to save a DataFrame using JSON format, like so:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName('sparkApp') \
.master("spark://172.19.0.2:7077") \
.getOrCreate()
data = spark.range(5, 10)
data.write.format("json") \
.mode("overwrite") \
.save("/home/alvaro/Alvaro/ICI/DeltaLake/SparkCluster/output11")
However, the folder created looks like the one in output9. I ran the same code, but replaced the master url to local[5] and it worked, resulting in the file output8.
What should I do to get the DataFrame JSON to be created on my local machine? Is it possible to do so using this kind of Spark cluster?
Thanks in advance!

Best way to process Redshift data on Spark (EMR) via Airflow MWAA?

We have an Airflow MWAA cluster and huge volume of Data in our Redshift data warehouse. We currently process the data directly in Redshift (w/ SQL) but given the amount of data, this puts a lot of pressure in the data warehouse and it is less and less resilient.
A potential solution we found would be to decouple the data storage (Redshift) from the data processing (Spark), first of all, what do you think about this solution?
To do this, we would like to use Airflow MWAA and SparkSQL to:
Transfer data from Redshift to Spark
Process the SQL scripts that were previously done in Redshift
Transfer the newly created table from Spark to Redshift
Is it a use case that someone here has already put in production?
What would in your opinion be the best way to interact with the Spark Cluster ? EmrAddStepsOperator vs PythonOperator + PySpark?
You can use one of the two drivers:
spark-redshift connector: open source connector developed and maintained by databricks
EMR spark-redshift connector: it is developed by AWS and based on the first one, but with some improvements (github).
To load data from Redshift to spark, you can read the data table and process them in spark:
df = sql_context.read \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") \
.option("dbtable", "my_table") \
.option("tempdir", "s3a://path/for/temp/data") \
.load()
Or take advantage of Redshift in a part of your processing by reading from a query result (you can filter, join or aggregate your data in Redshift before load them in spark)
df = sql_context.read \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") \
.option("query", "select x, count(*) my_table group by x") \
.option("tempdir", "s3a://path/for/temp/data") \
.load()
You can do what you want with the loaded dataframe, and you can store the result to another data store if needed. You can use the same connector to load the result (or any other dataframe) in Redshift:
df.write \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") \
.option("dbtable", "my_table_copy") \
.option("tempdir", "s3n://path/for/temp/data") \
.mode("error") \
.save()
P.S: the connector is fully supported by spark SQL, so you can add the dependencies to your EMR cluster, then use the operator SparkSqlOperator to extract, transform then re-load your Redshift tables (SQL syntax example), or the operator SparkSubmitOperator if you prefer Python/Scala/JAVA jobs.

Spark streaming aggregate stream is not updating when reading from new csv files

I'm trying to run some basic examples of Spark streaming via pyspark and the behavior of the latest version (3.0.1) is not as advertised and not as I remembered it from previous versions.
Specifically, I set up a streaming DF to read csv files from a folder. Each file contains two columns: stock and value and a series of randomly generated stock values for 4 different stocks. For example:
stock
value
HPE
11.7014
NHPI
0.00672
NHPI
0.00714
NHPI
0.008232
TSLA
337.9674
I then groupBy the stock name and average the price.
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
spark = SparkSession \
.builder \
.appName("StockTicker") \
.getOrCreate()
# Create DataFrame creating a stream from incoming csv files
#Define schema
from pyspark.sql.types import StructType
schema = StructType().add('stock','string').add('value','double')
dfCSV = spark \
.readStream \
.option('header',True) \
.schema(schema) \
.option('maxFilesPerTrigger',1) \
.csv("stocks")
# Generate running average price
avgStockPrice = dfCSV.groupBy("stock").avg()
# Start running the query that prints the running averages to the console
query = avgStockPrice \
.writeStream \
.outputMode("complete") \
.format("console") \
.start()
When I put in more than one file it calculates over all the files as expected, running a batch for each file. But then if I add any more files, a batch is triggered, but the results are exactly the same, not changing no matter how many new files I add.
I tried changing the avg to count to see if I was just getting an average price from the random function generating the files but I got the same result. The row count increased for each new file for the initial files in the folder but did not budge when adding more files. A computation was triggered, but no new results.
Using Ubuntu 18.04, python 3.7.10, spark 3.0.1, pyspark (jupyter notebook).

Spark load from Elasticsearch: number of executor and partitions

I'm trying to load data from an Elasticsearch index into a dataframe in Spark. My machine has 12 CPU's and 1 core. I'm using PySpark on a Jupyter Notebook with the following Spark config:
pathElkJar = currentUserFolder+"/elasticsearch-hadoop-"+connectorVersion+"/dist/elasticsearch- spark-20_2.11-"+connectorVersion+".jar"
spark = SparkSession.builder \
.appName("elastic") \
.config("spark.jars",pathElkJar) \
.enableHiveSupport() \
.getOrCreate()
Now whether I do:
df = es_reader.load()
or:
df = es_reader.load(numPartitions=12)
I get the same output from the following prints:
print('Master: {}'.format(spark.sparkContext.master))
print('Number of partitions: {}'.format(df.rdd.getNumPartitions()))
print('Number of executors:{}'.format(spark.sparkContext._conf.get('spark.executor.instances')))
print('Partitioner: {}'.format(df.rdd.partitioner))
print('Partitions structure: {}'.format(df.rdd.glom().collect()))
Master: local[*]
Number of partitions: 1
Number of executors: None
Partitioner: None
I was expecting 12 partitions, which I can only obtain by doing a repartition() on the dataframe. Furthermore I thought that the number of executors by default equals the number of CPU's. But even by doing the following:
spark.conf.set("spark.executor.instances", "12")
I can't manually set the number of executors. It is true I have 1 core for each of the 12 CPU's, but how should I go about it?
I modified the configuration file after creating the Spark session (without restarting this obviously leads to no changes), by specifying the number of executor as follows:
spark = SparkSession.builder \
.appName("elastic") \
.config("spark.jars",pathElkJar) \
.config("spark.executor.instances", "12") \
.enableHiveSupport() \
.getOrCreate()
I now correctly get 12 executors. Still I don't understand why it doesn't do it automatically and still the number of partitions when loading the dataframe is 1. I would expect it to be 12 as the number of executors, am I right?
The problem regarding the executors and partitioning arised from the fact that i was using spark in local mode which allows for one executor maximum. Using Yarn or other resource managers such as mesos solved the problem

Pyspark GroupBy and count too slow

I am running pyspark on dataproc cluster with 4 nodes, each node having 2 cores and 8 GB RAM.
I have a dataframe with a column containing list of words. I exploded this column and counted the number of occurences using-
df.groupBy("exploded_col").count()
Before exploding, there were ~78 mn rows.
But, running the above code takes too long (more than 4 hours). Why is spark taking unusually long time? I'm still new to spark, so I'm not fully aware of appropriate settings to deal with huge data.
I have the following settings for sparkContext
enter code here
SparkSession.builder \
.appName("Spark NLP Licensed") \
.master("yarn") \
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.11:2.5.1")
spark.conf.set("spark.sql.shuffle.partitions",20)
spark.conf.set("spark.num.executors",100)
spark.conf.set("spark.executor.cores",1)
spark.conf.set("spark.sql.files.ignoreCorruptFiles", "true")
I even set "spark.sql.shuffle.partitions" to 2001, but that didn't work either.
Please help.
The main reason for the poor performance is that groupBy usually cause a data shuffle between the executors. You can use the built in spark function countDistinct in this manner:
from spark.sql.functions import countDistinct
df.agg(countDistinct("exploded_col"))

Resources