I was encountering stackoverflowerrors when I was iteratively adding over 500 columns to my pyspark dataframe. So, I included checkpoints. The checkpoints did not help. So, I created the following toy application to test if my checkpoints were working correctly. All I do in this example is iteratively create columns by copying the original column over and over again. I persist, checkpoint and count every 10 iterations. I notice that my dataframe.rdd.isCheckpointed() always returns False. I can verify that the checkpoint folders are indeed being created and populated on disk. I am running on dataproc on glcoud.
Here is my code:
from pyspark import SparkContext, SparkConf
from pyspark import StorageLevel
from pyspark.sql import SparkSession
import pandas as pd
import numpy as np
import sys
APP_NAME = "isCheckPointWorking"
spark = SparkSession\
sc = SparkContext.getOrCreate()
#set the checkpoint directory
#create a spark dataframe with one column containing numbers 1 through 9
df4 = spark.createDataFrame(pd.DataFrame(np.arange(1,10),columns = ["A"]))
#create a list of new columns to be added to the dataframe
numberList = np.arange(0,40)
colNewList = ['col'+str(x) for x in numberList]
iterCount = 0
for colName in colNewList:
#copy column A in to the new column
df4 = df4.withColumn(colName,df4.A)
if (np.mod(iterCount,10) == 0):
df4 = df4.persist(StorageLevel.MEMORY_AND_DISK)
#checking if underlying RDD is being checkpointed
print("is data frame checkpointed "+str(df4.rdd.isCheckpointed()))
iterCount +=1
It is unclear why df4.rdd.isCheckpointed() is returning False each time, when I can see that the checkpoint folder is being populated. Any thoughts?

The checkpoint method returns a new check-pointed Dataset, it does not modify the current Dataset.
df4 = df4.checkpoint(eager=True)


Unable to read images simultaneously [in parallels] using pyspark

I have 10 jpeg images in a directory.
I want to read all them simultaneously using pyspark.
I tried as follows.
from PIL import Image
from pyspark import SparkContext, SparkConf
conf = SparkConf()
spark = SparkContext(conf=conf)
files = glob.glob("E:\\tests\\*.jpg")
files_ = spark.parallelize(files)
arrs = []
for fi in files_.toLocalIterator():
im =
data = np.asarray(im)
img = np.array(arrs)
print (img.shape)
The code ended without error and printed out img.shape; however, it did not run in parallel.
Could you help me?
You can use to load and transform the pictures in parallel and then collect the rdd into a Python list:
files = glob.glob("E:\\tests\\*.jpg")
file_rdd = spark.parallelize(files)
def image_to_array(path):
im =
data = np.asarray(im)
return data
array_rdd = f: image_to_array(f))
result_list = array_rdd.collect()
result_list is now a list with 10 elements, each element is a numpy.ndarray.
The function image_to_array will be executed on different Spark executors in parallel. If you have a multi-node Spark cluster, you have to make sure that all nodes can access E:\\tests\\.
After collecting the arrays, processing can continue with
img = np.array(result_list, dtype=object)
My solution follows the same idea from werner, but I did only using spark libs:
from import ImageSchema
import numpy as np
df = (spark
.option("pathGlobFilter", "*.jpg")
df ='image.*')
# Pre-caching the required schema. If you remove this line an error will be raised.
# Transforming images to np.array
arrays =
img = np.array(arrays)

How to add a column to a PySpark dataframe which contains the nth quantile of another column in the dataframe

I have a very large CSV file which has been imported as a PySpark dataframe: df. The dataframe contains many columns including column ireturn. I want to compute the 0.99 and 0.01 percentile of this column and then add another column to the dataframe df as new_col_99 and new_col_01 which contains the 0.99 and 0.01 percentiles, respectively. I wrote the following code which works for small dataframes, but I get errors when I apply it to my large dataframe.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df ="name of the file", inferSchema = True, header = True)
precentile_99 = df.selectExpr('percentile(val1, 0.99)').head(1)[0][0]
precentile_01 = df.selectExpr('percentile(val1, 0.01)').head(1)[0][0]
from pyspark.sql.functions import lit
df = df.withColumn("new_col_99", lit(precentile_99))
df = df.withColumn("new_col_01", lit(precentile_01))
I tried to replace head with collect, but it did not work either. I get this error:
Logging error ---
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (
Traceback (most recent call last):...
I have also tried the following:
percentile = df.approxQuantile('ireturn',[0.01,0.99],0.25)
df = df.withColumn("new_col_01", lit(percentile[0]))
df = df.withColumn("new_col_99", lit(percentile[1]))
The above code takes about 15-20 min to run but the result is wrong (my data on the column ireturn are less than 1 but it returns the 0.99 percentile as 6789....)
Late, but hopefully answers your concerns. You can get the result this way:
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()
df ="name of the file", inferSchema = True, header = True)
df = df.withColumn("new_col_99", F.expr('percentile(val1, 0.99) over()'))
df = df.withColumn("new_col_01", F.expr('percentile(val1, 0.01) over()'))
For large datasets percentile_approx may be better:
df = df.withColumn("new_col_99", F.expr('percentile_approx(val1, 0.99) over()'))
df = df.withColumn("new_col_01", F.expr('percentile_approx(val1, 0.01) over()'))

Convert a pandas dataframe to a PySpark dataframe [duplicate]

I have a script with the below setup.
I am using:
1) Spark dataframes to pull data in
2) Converting to pandas dataframes after initial aggregatioin
3) Want to convert back to Spark for writing to HDFS
The conversion from Spark --> Pandas was simple, but I am struggling with how to convert a Pandas dataframe back to spark.
Can you advise?
from pyspark.sql import SparkSession
import pyspark.sql.functions as sqlfunc
from pyspark.sql.types import *
import argparse, sys
from pyspark.sql import *
import pyspark.sql.functions as sqlfunc
import pandas as pd
def create_session(appname):
spark_session = SparkSession\
.config("hive.metastore.uris", "thrift://")\
return spark_session
### START MAIN ###
if __name__ == '__main__':
spark_session = create_session('testing_files')
I've tried the below - no errors, just no data! To confirm, df6 does have data & is a pandas dataframe
df6 = df5.sort_values(['sdsf'], ascending=["true"])
sdf = spark_session.createDataFrame(df6)
Here we go:
# Spark to Pandas
df_pd = df.toPandas()
# Pandas to Spark
df_sp = spark_session.createDataFrame(df_pd)

Spark on Google Cloud Dataproc job failures on last stages

I work with Spark cluster on Dataproc and my job fails in the end of processing.
My datasource is text logs files in csv format on Google Cloud Storage (total volume is 3.5TB, 5000 files).
The processing logic is following:
read files to DataFrame (schema ["timestamp", "message"]);
group all messages into window of 1 second;
apply pipeline [Tokenizer -> HashingTF] to every grouped message to extract words and their frequencies to build a feature vectors;
save feature vectors with timelines on GCS.
The issues that I'm having is that on small subset of data (like 10 files) processing works well, but when I'm running it on all files it fails in the very end with error like "Container killed by YARN for exceeding memory limits. 25.0 GB of 24 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead."
My cluster has 25 worker with n1-highmem-8 machines. So I googled for this error and literally increased "spark.yarn.executor.memoryOverhead" parameter to 6500MB.
Now my spark job still fails, but with error "Job aborted due to stage failure: Total size of serialized results of 4293 tasks (1920.0 MB) is bigger than spark.driver.maxResultSize (1920.0 MB)"
I'm new to spark and I believe that I'm doing something wrong or on the configuration level, or in my code. If you can help me to clean these thing up, it will be great!
Here is my code for the spark task:
import logging
import string
from datetime import datetime
import pyspark
import re
from pyspark.sql import SparkSession
from import HashingTF, IDF, Tokenizer
from import Pipeline
from pyspark.sql.types import StructType, StructField
from pyspark.sql.types import DoubleType, IntegerType, StringType, TimestampType, ArrayType
from pyspark.sql import functions as F
logger = logging.getLogger(__name__)
# Constants
NOW ="%Y%m%d%H%M%S")
START_DATE = '2016-01-01'
END_DATE = '2016-03-01'
sc = pyspark.SparkContext()
spark = SparkSession\
spark.conf.set('spark.sql.shuffle.partitions', 10000)"Start log processing at {}...".format(NOW))
# Filenames to read/write locations
logs_fn = 'gs://databucket/csv/*'
vectors_fn = 'gs://databucket/vectors_out_{}'.format(NOW)
pipeline_fn = 'gs://databucket/pipeline_vectors_out_{}'.format(NOW)
model_fn = 'gs://databucket/model_vectors_out_{}'.format(NOW)
# CSV data schema to build DataFrame
schema = StructType([
StructField("timestamp", StringType()),
StructField("message", StringType())])
# Helpers to clean strings in log fields
def cleaning_string(s):
# Remove ids (like: app[2352] -> app)
s = re.sub('\[.*\]', 'IDTAG', s)
if s == '':
s = 'EMPTY'
except Exception as e:
print("Skip string with exception {}".format(e))
return s
def normalize_string(s):
# Remove punctuation
s = re.sub('[{}]'.format(re.escape(string.punctuation)), ' ', s)
# Remove digits
s = re.sub('\d*', '', s)
# Remove extra spaces
s = ' '.join(s.split())
except Exception as e:
print("Skip string with exception {}".format(e))
return s
def line_splitter(line):
line = line.split(',')
timestamp = line[0]
full_message = ' '.join(line[1:])
full_message = normalize_string(cleaning_string(full_message))
return [timestamp, full_message]
# Read line from csv, split to date|message
# Read CSV to DataFrame and clean its fields"Read CSV to DF...")
logs_csv = sc.textFile(logs_fn)
logs_csv = line: line_splitter(line)).toDF(schema)
# Keep only lines for our date interval"Filter by dates...")
logs_csv = logs_csv.filter((logs_csv.timestamp>START_DATE) & (logs_csv.timestamp<END_DATE))
logs_csv = logs_csv.withColumn("timestamp", logs_csv.timestamp.cast("timestamp"))
# Helpers to join messages into window and convert sparse to dense
join_ = F.udf(lambda x: "| ".join(x), StringType())
asDense = F.udf(lambda v: v.toArray().tolist())
# Agg by time window"Group log messages by time window...")
logs_csv = logs_csv.groupBy(F.window("timestamp", "1 second"))\
# Turn message to hashTF
tokenizer = Tokenizer(inputCol="messages", outputCol="message_tokens")
hashingTF = HashingTF(inputCol="message_tokens", outputCol="tokens_counts", numFeatures=1000)
pipeline_tf = Pipeline(stages=[tokenizer, hashingTF])"Fit-Transform ML Pipeline...")
model_tf =
logs_csv = model_tf.transform(logs_csv)"Spase vectors to Dense list...")
logs_csv = logs_csv.sort("window.start").select(["window.start", "tokens_counts"])\
.withColumn("tokens_counts", asDense(logs_csv.tokens_counts))
# Save to disk
# Save Pipeline and Model"Save models...")
# Save to GCS"Save results to GCS...")
spark.driver.maxResultSize is an issue with the size of your driver, which in Dataproc runs on the master node.
By default 1/4 of the memory of the master is given to Driver and 1/2 of that is given set to spark.driver.maxResultSize (the largest RDD Spark will let you .collect().
I'm guessing Tokenizer or HashingTF are moving "metadata" through the driver that is the size of your keyspace. To increase the allowable size you can increase spark.driver.maxResultSize, but you might also want to increase spark.driver.memory and/or use a larger master as well. Spark's configuration guide has more information.

Reading data from HDFS on a cluster

I am trying to read data from HDFS on AWS EC2 cluster using Jupiter Notebook. It has 7 nodes. I am using HDP 2.4 and my code is below. The table has millions of rows but the code does not return any rows."" is the server (ambari-server).
from pyspark.sql import SQLContext
sqlContext = HiveContext(sc)
demography ="hdfs://", format="com.databricks.spark.csv", header="true", inferSchema="true")
print demography.count()
But using sc.textFile, I get the correct number of rows
data = sc.textFile("hdfs://")
schema= x: x.split(",")).first() #get schema
header = data.first() # extract header
data=data.filter(lambda x:x !=header) # filter out header
data= x: x.split(","))
The answer by Indrajit given here solved my problem. The problem was with the spark-csv jar.
