Spark Accumulator confusion - apache-spark

I'm writing a Spark job that takes in data from multiple sources, filters bad input rows, and outputs a slightly modified version of the input. The job has two additional requirements:
I must keep track of the number of bad inputs rows per source to notify those upstream providers.
I must support an output limit per source.
The job seemed straightforward and I approached the problem using accumulators to keep track of the number of filtered rows per source. However, when I implemented the final .limit(N), my accumulator behavior changed. Here's some striped down sample code that triggers the behavior on a single source:
from pyspark.sql import Row, SparkSession
from pyspark.sql.types import *
from random import randint
def filter_and_transform_parts(rows, filter_int, accum):
for r in rows:
if r[0] == filter_int:
accum.add(1)
continue
yield r[0], r[1] + 1, r[2] + 1
def main():
spark= SparkSession \
.builder \
.appName("Test") \
.getOrCreate()
sc = spark.sparkContext
accum = sc.accumulator(0)
# 20 inputs w/ tuple having 4 as first element
inputs = [(4, randint(1, 10), randint(1, 10)) if x % 5 == 0 else (randint(6, 10), randint(6, 10), randint(6, 10)) for x in xrange(100)]
rdd = sc.parallelize(inputs)
# filter out tuples where 4 is first element
rdd = rdd.mapPartitions(lambda r: filter_and_transform_parts(r, 4, accum))
# if not limit, accumulator value is 20
# if limit and limit_count <= 63, accumulator value is 0
# if limit and limit_count >= 64, accumulator value is 20
limit = True
limit_count = 63
if limit:
rdd = rdd.map(lambda r: Row(r[0], r[1], r[2]))
df_schema = StructType([StructField("val1", IntegerType(), False),
StructField("val2", IntegerType(), False),
StructField("val3", IntegerType(), False)])
df = spark.createDataFrame(rdd, schema=df_schema)
df = df.limit(limit_count)
df.write.mode("overwrite").csv('foo/')
else:
rdd.saveAsTextFile('foo/')
print "Accum value: {}".format(accum.value)
if __name__ == "__main__":
main()
The problem is that my accumulator sometimes reports the number of filtered rows and sometimes doesn't, depending on the limit specified and number of inputs for a source. However, in all situations the filtered rows don't make it into the output meaning the filter occurred and the accumulator should have a value.
If you can shed some light on this that'd be very helpful, thanks!
Update:
Adding a rdd.persist() call after mapPartitions made the accumulator behavior consistent.

Actually, it doesnt't matter what the limit_count's value is.
The reason why sometime Accum value is 0 is because you performe accumulator in transformations(e.g.: rdd.map,rdd.mapPartitions).
Spark only guaranty that accumulator works as well inside actions(e.g.: rdd.foreach)
Lets make a little bit of change on your code:
from pyspark.sql import *
from random import randint
def filter_and_transform_parts(rows, filter_int, accum):
for r in rows:
if r[0] == filter_int:
accum.add(1)
def main():
spark = SparkSession.builder.appName("Test").getOrCreate()
sc = spark.sparkContext
print(sc.applicationId)
accum = sc.accumulator(0)
inputs = [(4, x * 10, x * 100) if x % 5 == 0 else (randint(6, 10), x * 10, x * 100) for x in xrange(100)]
rdd = sc.parallelize(inputs)
rdd.foreachPartition(lambda r: filter_and_transform_parts(r, 4, accum))
limit = True
limit_count = 10 or 'whatever'
if limit:
rdd = rdd.map(lambda r: Row(val1=r[0], val2=r[1], val3=r[2]))
df = spark.createDataFrame(rdd)
df = df.limit(limit_count)
df.write.mode("overwrite").csv('file:///tmp/output')
else:
rdd.saveAsTextFile('file:///tmp/output')
print "Accum value: {}".format(accum.value)
if __name__ == "__main__":
main()
Accum value is equle to 20 all the time
For more information:
http://spark.apache.org/docs/2.0.2/programming-guide.html#accumulators

Related

Different outcome from seemingly equivalent implementation of PySpark transformations

I have a set of spark dataframe transforms which gives an out of memory error and has a messed up sql query plan while a different implemetation runs successfully.
%python
import pandas as pd
diction = {
'key': [1,2,3,4,5,6],
'f1' : [1,0,1,0,1,0],
'f2' : [0,1,0,1,0,1],
'f3' : [1,0,1,0,1,0],
'f4' : [0,1,0,1,0,1]}
bil = pd.DataFrame(diction)
# successfull logic
df = spark.createDataFrame(bil)
df = df.cache()
zdf = df
for i in [1,2,3]:
tempdf = zdf.select(['key'])
df = df.join(tempdf,on=['key'],how='left')
df.show()
# failed logic
df = spark.createDataFrame(bil)
df = df.cache()
for i in [1,2,3]:
tempdf = df.select(['key'])
df = df.join(tempdf,on=['key'],how='left')
df.show()
Logically thinking there must not be such a computational difference (more than double the time and memory used).
Can anyone help me understand this ?
DAG of successful logic:
DAG of failure logic:
I'm not sure what your use case is for this code, however the two pieces of code are not logically the same. In the second version you are joining the result of the previous iteration to itself three times. In the first version you are joining a 'copy' of the original df three times. If your key column is not unique, the second piece of code will 'explode' your dataframe more than the first.
To make this more clear we can make a simple example below where we have a non-unique key value. Taking your second example:
df = spark.createDataFrame([(1,'a'), (1,'b'), (3,'c')], ['key','val'])
for i in [1,2,3]:
tempdf = df.select(['key'])
df = df.join(tempdf,on=['key'],how='left')
df.count()
>>> 257
And your first piece of code:
df = spark.createDataFrame([(1,'a'), (1,'b'), (3,'c')], ['key','val'])
zdf = df
for i in [1,2,3]:
tempdf = zdf.select(['key'])
df = df.join(tempdf,on=['key'],how='left')
df.count()
>>> 17

Check ASCII pyspark dataframe

I need to check in a pyspark dataframe if all the values are ASCII, I do that with the following:
def is_ascii(s):
if s:
return all(ord(c) < 128 for c in s)
else:
return None
is_ascii_udf = udf(lambda l: is_ascii(l), BooleanType() )
df_result = df.select( *map(lambda col: is_ascii_udf(df[col]).alias(col), df.columns ) )
I am trying to use this with a new data that has 50MM rows and 9000 columns, and I get this error:
ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues.
It seems that the memory is full, I cannot get a bigger cluster, so I thougth of doing the following
import pyspark.sql.functions as F
import pandas as pd
from pyspark.sql.types import *
df = spark.read.parquet( path)
for i in df.columns:
df = spark.read.parquet( path)
df_result = df.select( *map(lambda col: is_ascii_udf(df[col]).alias(col), [i] ) )
n = df_result.filter( ~F.col(i) ).count()
if n>0:
print(i,n)
But I get the same error, why I am still getting the same error if each time I am reading the dataframe and just doing the udf to one column
The cluster has 50 GB of memory, 6 cores, max 8 workers
I think the error is with the function, or how I am using it
Regards
It could be possible that even running it on one column is too much for your cluster. Anyway, there are Spark SQL methods available for doing what you want, which should be more efficient in terms of both performance and memory.
The code below will give a Boolean, or the count, of non-ascii characters in each column and collect the result into a list.
df.createOrReplaceTempView('df')
is_not_ascii = [[col, spark.sql('select max(array_max(transform(split(%s, ""), x -> ascii(x))) >= 128) as is_ascii from df' % col).collect()[0][0]] for col in df.columns]
# e.g. [['key', False], ['val', False]]
count_not_ascii = [[col, spark.sql('select sum(cast(array_max(transform(split(%s, ""), x -> ascii(x))) >= 128 as int)) as is_ascii from df' % col).collect()[0][0]] for col in df.columns]
# e.g. [['key', 0], ['val', 0]]

Alternate or better approach to aggregateByKey in pyspark RDD

I have a weather data csv file in which each entry has station ID and the minimum or max value recorded for that day. The second element is key word to know what the value represents. Sample input is as below.
stationID feature value
ITE00100554 TMAX -75
ITE00100554 TMIN -148
GM000010962 PRCP 0
EZE00100082 TMAX -86
EZE00100082 TMIN -135
ITE00100554 TMAX -60
ITE00100554 TMIN -125
GM000010962 PRCP 0
EZE00100082 TMAX -44
EZE00100082 TMIN -130
ITE00100554 TMAX -23
I have filtered out entries with TMIN or TMAX. Each entry is recorded for a given data. I have stripped Date while building my RDD as it's not of interest. My goal is to find the Min and Max value of each station amongst all of its records i.e.,
ITE00100554, 'TMIN', <global_min_value recorded by that station>
ITE00100554, 'TMAX', <global_max_value>
EZE00100082, 'TMIN', <global_min_value>
EZE00100082, 'TMAX', <global_max_value>
I was able to accomplish this using aggregateByKey, but according to this link https://backtobazics.com/big-data/spark/apache-spark-aggregatebykey-example/ I dont have to use aggregateByKey since the input and output values format is the same. So I would like to know if there are an alternate or better ways to code this without defining so many functions.
stationtemps = entries.filter(lambda x: x[1] in ['TMIN', 'TMAX']).map(lambda x: (x[0], (x[1], x[2]))) # (stationID, (tempkey, value))
max_temp = stationtemps.values().values().max()
min_temp = stationtemps.values().values().min()
def max_seqOp(accumulator, element):
return (accumulator if accumulator[1] > element[1] else element)
def max_combOp(accu1, accu2):
return (accu1 if accu1[1] > accu2[1] else accu2)
def min_seqOp(accumulator, element):
return (accumulator if accumulator[1] < element[1] else element)
def min_combOp(accu1, accu2):
return (accu1 if accu1[1] < accu2[1] else accu2)
station_max_temps = stationtemps.aggregateByKey(('', min_temp), max_seqOp, max_combOp).sortByKey()
station_min_temps = stationtemps.aggregateByKey(('', max_temp), min_seqOp, min_combOp).sortByKey()
min_max_temps = station_max_temps.zip(station_min_temps).collect()
with open('1800_min_max.csv', 'w') as fd:
writer = csv.writer(fd)
writer.writerows(map(lambda x: list(list(x)), min_max_temps))
I am learning pyspark and havent mastered all different transforming functions.
Here simulated input and if the min and max is filled in correctly, then why the need for the indicator TMIN, TMAX? Indeed no need for an accumulator.
rdd = sc.parallelize([ ('s1','tmin',-3), ('s1','tmax', 5), ('s2','tmin',0), ('s2','tmax', 7), ('s0','tmax',14), ('s0','tmin', 3) ])
rddcollect = rdd.collect()
#print(rddcollect)
rdd2 = rdd.map(lambda x: (x[0], x[2]))
#rdd2collect = rdd2.collect()
#print(rdd2collect)
rdd3 = rdd2.groupByKey().sortByKey()
rdd4 = rdd3.map(lambda k_v: ( k_v[0], (sorted(k_v[1]))) )
rdd4.collect()
returns:
Out[27]: [('s0', [3, 14]), ('s1', [-3, 5]), ('s2', [0, 7])]
ALTERNATE ANSWER
after clarification
assuming that min and max values make sense
with my own data
there are other solutions BTW
Here goes:
include = ['tmin','tmax']
rdd0 = sc.parallelize([ ('s1','tmin',-3), ('s1','tmax', 5), ('s2','tmin',0), ('s2','tmin',-12), ('s2','tmax', 7), ('s2','tmax', 17), ('s2','tother', 17), ('s0','tmax',14), ('s0','tmin', 3) ])
rdd1 = rdd0.filter(lambda x: any(e in x for e in include) )
rdd2 = rdd1.map(lambda x: ( (x[0],x[1]), x[2]))
rdd3 = rdd2.groupByKey().sortByKey()
rdd4Min = rdd3.filter(lambda k_v: k_v[0][1] == 'tmin').map(lambda k_v: ( k_v[0][0], min( k_v[1] ) ))
rdd4Max = rdd3.filter(lambda k_v: k_v[0][1] == 'tmax').map(lambda k_v: ( k_v[0][0], max( k_v[1] ) ))
rdd5=rdd4Min.union(rdd4Max)
rdd6 = rdd5.groupByKey().sortByKey()
res = rdd6.map(lambda k_v: ( k_v[0], (sorted(k_v[1]))))
rescollect = res.collect()
print(rescollect)
returns:
[('s0', [3, 14]), ('s1', [-3, 5]), ('s2', [-12, 17])]
Following the same logic as #thebluephantom, this was my final code while reading from csv
def get_temp_points(item):
if item[0][1] == 'TMIN':
return (item[0], min(item[1]))
else:
return (item[0], max(item[1]))
data = lines.filter(lambda x: any(ele for ele in x if ele in ['TMIN', 'TMAX']))
temps = data.map(lambda x: ((x[0], x[2]), float(x[3]))
temp_list = temps.groupByKey().mapValues(list)
##((stationID, 'TMIN'/'TMAX'), listofvalues)
min_max_temps = temp_list.map(get_temp_points).collect()

Spark - How to join current and previous records in a DataFrame and assign an original field to all such occurences

I need to scan through a Hive table and add values from the first record in a sequence to all linked records.
The logic would be:-
Find the first record (where previous_id is blank).
Find the next record (current_id = previous_id).
Repeat until there are no more linked records.
Add columns from original record to all linked records.
Output results to a Hive table.
Example Source Data:-
current_id previous_id start_date
---------- ----------- ----------
100 01/01/2001
200 100 02/02/2002
300 200 03/03/2003
Example Output Data:-
current_id start_date
---------- ----------
100 01/01/2001
200 01/01/2001
300 01/01/2001
I can achieve this by creating two DataFrames from the source table and performing multiple joins. However, this approach does not seem ideal as data has to be cached to avoid re-querying the source data with each iteration.
Any suggestions on how to approach this problem?
I think you can accomplish this using GraphFrames Connected components
It will help you avoid writing the checkpointing and looping logic yourself. Essentially you create a graph from the current_id and previous_id pairs and use GraphFrames to the component for each vertex. That resulting DataFrame can then be joined to the original DataFrame to get the start_date.
from graphframes import *
sc.setCheckpointDir("/tmp/chk")
input = spark.createDataFrame([
(100, None, "2001-01-01"),
(200, 100, "2002-02-02"),
(300, 200, "2003-03-03"),
(400, None, "2004-04-04"),
(500, 400, "2005-05-05"),
(600, 500, "2006-06-06"),
(700, 300, "2007-07-07")
], ["current_id", "previous_id", "start_date"])
input.show()
vertices = input.select(input.current_id.alias("id"))
edges = input.select(input.current_id.alias("src"), input.previous_id.alias("dst"))
graph = GraphFrame(vertices, edges)
result = graph.connectedComponents()
result.join(input.previous_id.isNull(), result.component == input.current_id)\
.select(result.id.alias("current_id"), input.start_date)\
.orderBy("current_id")\
.show()
Results in the following output:
+----------+----------+
|current_id|start_date|
+----------+----------+
| 100|2001-01-01|
| 200|2001-01-01|
| 300|2001-01-01|
| 400|2004-04-04|
| 500|2004-04-04|
| 600|2004-04-04|
| 700|2001-01-01|
+----------+----------+
Here is an approach that I am not sure sits well with Spark.
There is a lack of a grouping id / key for the data.
Not sure how Catalyst would be able to optimize this - will look at at a later point in time. Memory errors if too large?
Have made the data more complicated, and this does work. Here goes:
# No grouping key evident, more a linked list with asc current_ids.
# Added more complexity to the example.
# Questions open on performance at scale. Interested to see how well Catalyst handles this.
# Need really some grouping id/key in the data.
from pyspark.sql import functions as f
from functools import reduce
from pyspark.sql import DataFrame
from pyspark.sql.functions import col
# Started from dataframe.
# Some more realistic data? At least more complex.
columns = ['current_id', 'previous_id', 'start_date']
vals = [
(100, None, '2001/01/01'),
(200, 100, '2002/02/02'),
(300, 200, '2003/03/03'),
(400, None, '2005/01/01'),
(500, 400, '2006/02/02'),
(600, 300, '2007/02/02'),
(700, 600, '2008/02/02'),
(800, None, '2009/02/02'),
(900, 800, '2010/02/02')
]
df = spark.createDataFrame(vals, columns)
df.createOrReplaceTempView("trans")
# Starting data. The null / None entries.
df2 = spark.sql("""
select *
from trans
where previous_id is null
""")
df2.cache
df2.createOrReplaceTempView("trans_0")
# Loop through the stuff based on traversing the list elements until exhaustion of data, and, write to dynamically named TempViews.
# May need to checkpoint? Depends on depth of chain of linked items.
# Spark not well suited to this type of processing.
dfX_cnt = 1
cnt = 1
while (dfX_cnt != 0):
tabname_prev = 'trans_' + str(cnt-1)
tabname = 'trans_' + str(cnt)
query = "select t2.current_id, t2.previous_id, t1.start_date from {} t1, trans t2 where t1.current_id = t2.previous_id".format(tabname_prev)
dfX = spark.sql(query)
dfX.cache
dfX_cnt = dfX.count()
if (dfX_cnt!=0):
#print('Looping for dynamic creation of TempViews')
dfX.createOrReplaceTempView(tabname)
cnt=cnt+1
# Reduce the TempViews all to one DF. Can reduce an array of DF's as well, but could not find my notes here in this regard.
# Will memory errors occur?
from pyspark.sql.types import *
fields = [StructField('current_id', LongType(), False),
StructField('previous_id', LongType(), True),
StructField('start_date', StringType(), False)]
schema = StructType(fields)
dfZ = spark.createDataFrame(sc.emptyRDD(), schema)
for i in range(0,cnt,1):
tabname = 'trans_' + str(i)
query = "select * from {}".format(tabname)
df = spark.sql(query)
dfZ = dfZ.union(df)
# Show final results.
dfZ.select('current_id', 'start_date').sort(col('current_id')).show()
returns:
+----------+----------+
|current_id|start_date|
+----------+----------+
| 100|2001/01/01|
| 200|2001/01/01|
| 300|2001/01/01|
| 400|2005/01/01|
| 500|2005/01/01|
| 600|2001/01/01|
| 700|2001/01/01|
| 800|2009/02/02|
| 900|2009/02/02|
+----------+----------+
Thanks for the suggestions posted here. After trying various approaches I have gone with the following solution which works for multiple iterations (e.g. 20 loops), and does not cause any memory issues.
The "Physical Plan" is still huge, but caching means most of the steps are skipped, keeping performance acceptable.
input = spark.createDataFrame([
(100, None, '2001/01/01'),
(200, 100, '2002/02/02'),
(300, 200, '2003/03/03'),
(400, None, '2005/01/01'),
(500, 400, '2006/02/02'),
(600, 300, '2007/02/02'),
(700, 600, '2008/02/02'),
(800, None, '2009/02/02'),
(900, 800, '2010/02/02')
], ["current_id", "previous_id", "start_date"])
input.createOrReplaceTempView("input")
cur = spark.sql("select * from input where previous_id is null")
nxt = spark.sql("select * from input where previous_id is not null")
cur.cache()
nxt.cache()
cur.createOrReplaceTempView("cur0")
nxt.createOrReplaceTempView("nxt")
i = 1
while True:
spark.sql("set table_name=cur" + str(i - 1))
cur = spark.sql(
"""
SELECT nxt.current_id as current_id,
nxt.previous_id as previous_id,
cur.start_date as start_date
FROM ${table_name} cur,
nxt nxt
WHERE cur.current_id = nxt.previous_id
""").cache()
cur.createOrReplaceTempView("cur" + str(i))
i = i + 1
if cur.count() == 0:
break
for x in range(0, i):
spark.sql("set table_name=cur" + str(x))
cur = spark.sql("select * from ${table_name}")
if x == 0:
out = cur
else:
out = out.union(cur)

PySpark UDF with multiple arguments returns null

I have a PySpark Dataframe with two columns (A, B, whose type is double) whose values are either 0.0 or 1.0.
I am trying to add a new column, which is the sum of those two.
I followed examples in Pyspark: Pass multiple columns in UDF
import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType, StringType
sum_cols = F.udf(lambda x: x[0]+x[1], IntegerType())
df_with_sum = df.withColumn('SUM_COL',sum_cols(F.array('A','B')))
df_with_sum.select(['SUM_COL']).toPandas()
This shows a Series of NULLs instead of the results I expect.
I tried any of the following to see if there's an issue with data types
sum_cols = F.udf(lambda x: x[0], IntegerType())
sum_cols = F.udf(lambda x: int(x[0]), IntegerType())
still getting Nulls.
I tried removing the array:
sum_cols = F.udf(lambda x: x, IntegerType())
df_with_sum = df.withColumn('SUM_COL',sum_cols(df.A))
This works fine and shows 0/1
I tried removing the UDF, but leaving the array:
df_with_sum = df.withColumn('SUM_COL', F.array('A','B'))
This works fine and shows a series of arrays of [0.0/1.0, 0.0/1.0]
So, array works fine, UDF works fine, it is just when I try to pass an array to UDF that things break down. What am I doing wrong?
The problem is that you are trying to return a double in a function that is supposed to output an integer, which does not fit, and pyspark by default silently resorts to NULL when a casting fails:
df_with_doubles = spark.createDataFrame([(1.0,1.0), (2.0,2.0)], ['A', 'B'])
sum_cols = F.udf(lambda x: x[0]+x[1], IntegerType())
df_with_sum = df_with_double.withColumn('SUM_COL',sum_cols(F.array('A','B')))
df_with_sum.select(['SUM_COL']).toPandas()
You get:
SUM_COL
0 None
1 None
However, if you do:
df_with_integers = spark.createDataFrame([(1,1), (2,2)], ['A', 'B'])
sum_cols = F.udf(lambda x: x[0]+x[1], IntegerType())
df_with_sum = df_with_integers.withColumn('SUM_COL',sum_cols(F.array('A','B')))
df_with_sum.select(['SUM_COL']).toPandas()
You get:
SUM_COL
0 2
1 4
So, either cast your columns to IntegerType beforehand (or cast them in the UDF), or change the return type of the UDF to DoubleType.

Resources