Check ASCII pyspark dataframe - python-3.x

I need to check in a pyspark dataframe if all the values are ASCII, I do that with the following:
def is_ascii(s):
if s:
return all(ord(c) < 128 for c in s)
else:
return None
is_ascii_udf = udf(lambda l: is_ascii(l), BooleanType() )
df_result = df.select( *map(lambda col: is_ascii_udf(df[col]).alias(col), df.columns ) )
I am trying to use this with a new data that has 50MM rows and 9000 columns, and I get this error:
ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues.
It seems that the memory is full, I cannot get a bigger cluster, so I thougth of doing the following
import pyspark.sql.functions as F
import pandas as pd
from pyspark.sql.types import *
df = spark.read.parquet( path)
for i in df.columns:
df = spark.read.parquet( path)
df_result = df.select( *map(lambda col: is_ascii_udf(df[col]).alias(col), [i] ) )
n = df_result.filter( ~F.col(i) ).count()
if n>0:
print(i,n)
But I get the same error, why I am still getting the same error if each time I am reading the dataframe and just doing the udf to one column
The cluster has 50 GB of memory, 6 cores, max 8 workers
I think the error is with the function, or how I am using it
Regards

It could be possible that even running it on one column is too much for your cluster. Anyway, there are Spark SQL methods available for doing what you want, which should be more efficient in terms of both performance and memory.
The code below will give a Boolean, or the count, of non-ascii characters in each column and collect the result into a list.
df.createOrReplaceTempView('df')
is_not_ascii = [[col, spark.sql('select max(array_max(transform(split(%s, ""), x -> ascii(x))) >= 128) as is_ascii from df' % col).collect()[0][0]] for col in df.columns]
# e.g. [['key', False], ['val', False]]
count_not_ascii = [[col, spark.sql('select sum(cast(array_max(transform(split(%s, ""), x -> ascii(x))) >= 128 as int)) as is_ascii from df' % col).collect()[0][0]] for col in df.columns]
# e.g. [['key', 0], ['val', 0]]

Related

Different outcome from seemingly equivalent implementation of PySpark transformations

I have a set of spark dataframe transforms which gives an out of memory error and has a messed up sql query plan while a different implemetation runs successfully.
%python
import pandas as pd
diction = {
'key': [1,2,3,4,5,6],
'f1' : [1,0,1,0,1,0],
'f2' : [0,1,0,1,0,1],
'f3' : [1,0,1,0,1,0],
'f4' : [0,1,0,1,0,1]}
bil = pd.DataFrame(diction)
# successfull logic
df = spark.createDataFrame(bil)
df = df.cache()
zdf = df
for i in [1,2,3]:
tempdf = zdf.select(['key'])
df = df.join(tempdf,on=['key'],how='left')
df.show()
# failed logic
df = spark.createDataFrame(bil)
df = df.cache()
for i in [1,2,3]:
tempdf = df.select(['key'])
df = df.join(tempdf,on=['key'],how='left')
df.show()
Logically thinking there must not be such a computational difference (more than double the time and memory used).
Can anyone help me understand this ?
DAG of successful logic:
DAG of failure logic:
I'm not sure what your use case is for this code, however the two pieces of code are not logically the same. In the second version you are joining the result of the previous iteration to itself three times. In the first version you are joining a 'copy' of the original df three times. If your key column is not unique, the second piece of code will 'explode' your dataframe more than the first.
To make this more clear we can make a simple example below where we have a non-unique key value. Taking your second example:
df = spark.createDataFrame([(1,'a'), (1,'b'), (3,'c')], ['key','val'])
for i in [1,2,3]:
tempdf = df.select(['key'])
df = df.join(tempdf,on=['key'],how='left')
df.count()
>>> 257
And your first piece of code:
df = spark.createDataFrame([(1,'a'), (1,'b'), (3,'c')], ['key','val'])
zdf = df
for i in [1,2,3]:
tempdf = zdf.select(['key'])
df = df.join(tempdf,on=['key'],how='left')
df.count()
>>> 17

Spark - How to join current and previous records in a DataFrame and assign an original field to all such occurences

I need to scan through a Hive table and add values from the first record in a sequence to all linked records.
The logic would be:-
Find the first record (where previous_id is blank).
Find the next record (current_id = previous_id).
Repeat until there are no more linked records.
Add columns from original record to all linked records.
Output results to a Hive table.
Example Source Data:-
current_id previous_id start_date
---------- ----------- ----------
100 01/01/2001
200 100 02/02/2002
300 200 03/03/2003
Example Output Data:-
current_id start_date
---------- ----------
100 01/01/2001
200 01/01/2001
300 01/01/2001
I can achieve this by creating two DataFrames from the source table and performing multiple joins. However, this approach does not seem ideal as data has to be cached to avoid re-querying the source data with each iteration.
Any suggestions on how to approach this problem?
I think you can accomplish this using GraphFrames Connected components
It will help you avoid writing the checkpointing and looping logic yourself. Essentially you create a graph from the current_id and previous_id pairs and use GraphFrames to the component for each vertex. That resulting DataFrame can then be joined to the original DataFrame to get the start_date.
from graphframes import *
sc.setCheckpointDir("/tmp/chk")
input = spark.createDataFrame([
(100, None, "2001-01-01"),
(200, 100, "2002-02-02"),
(300, 200, "2003-03-03"),
(400, None, "2004-04-04"),
(500, 400, "2005-05-05"),
(600, 500, "2006-06-06"),
(700, 300, "2007-07-07")
], ["current_id", "previous_id", "start_date"])
input.show()
vertices = input.select(input.current_id.alias("id"))
edges = input.select(input.current_id.alias("src"), input.previous_id.alias("dst"))
graph = GraphFrame(vertices, edges)
result = graph.connectedComponents()
result.join(input.previous_id.isNull(), result.component == input.current_id)\
.select(result.id.alias("current_id"), input.start_date)\
.orderBy("current_id")\
.show()
Results in the following output:
+----------+----------+
|current_id|start_date|
+----------+----------+
| 100|2001-01-01|
| 200|2001-01-01|
| 300|2001-01-01|
| 400|2004-04-04|
| 500|2004-04-04|
| 600|2004-04-04|
| 700|2001-01-01|
+----------+----------+
Here is an approach that I am not sure sits well with Spark.
There is a lack of a grouping id / key for the data.
Not sure how Catalyst would be able to optimize this - will look at at a later point in time. Memory errors if too large?
Have made the data more complicated, and this does work. Here goes:
# No grouping key evident, more a linked list with asc current_ids.
# Added more complexity to the example.
# Questions open on performance at scale. Interested to see how well Catalyst handles this.
# Need really some grouping id/key in the data.
from pyspark.sql import functions as f
from functools import reduce
from pyspark.sql import DataFrame
from pyspark.sql.functions import col
# Started from dataframe.
# Some more realistic data? At least more complex.
columns = ['current_id', 'previous_id', 'start_date']
vals = [
(100, None, '2001/01/01'),
(200, 100, '2002/02/02'),
(300, 200, '2003/03/03'),
(400, None, '2005/01/01'),
(500, 400, '2006/02/02'),
(600, 300, '2007/02/02'),
(700, 600, '2008/02/02'),
(800, None, '2009/02/02'),
(900, 800, '2010/02/02')
]
df = spark.createDataFrame(vals, columns)
df.createOrReplaceTempView("trans")
# Starting data. The null / None entries.
df2 = spark.sql("""
select *
from trans
where previous_id is null
""")
df2.cache
df2.createOrReplaceTempView("trans_0")
# Loop through the stuff based on traversing the list elements until exhaustion of data, and, write to dynamically named TempViews.
# May need to checkpoint? Depends on depth of chain of linked items.
# Spark not well suited to this type of processing.
dfX_cnt = 1
cnt = 1
while (dfX_cnt != 0):
tabname_prev = 'trans_' + str(cnt-1)
tabname = 'trans_' + str(cnt)
query = "select t2.current_id, t2.previous_id, t1.start_date from {} t1, trans t2 where t1.current_id = t2.previous_id".format(tabname_prev)
dfX = spark.sql(query)
dfX.cache
dfX_cnt = dfX.count()
if (dfX_cnt!=0):
#print('Looping for dynamic creation of TempViews')
dfX.createOrReplaceTempView(tabname)
cnt=cnt+1
# Reduce the TempViews all to one DF. Can reduce an array of DF's as well, but could not find my notes here in this regard.
# Will memory errors occur?
from pyspark.sql.types import *
fields = [StructField('current_id', LongType(), False),
StructField('previous_id', LongType(), True),
StructField('start_date', StringType(), False)]
schema = StructType(fields)
dfZ = spark.createDataFrame(sc.emptyRDD(), schema)
for i in range(0,cnt,1):
tabname = 'trans_' + str(i)
query = "select * from {}".format(tabname)
df = spark.sql(query)
dfZ = dfZ.union(df)
# Show final results.
dfZ.select('current_id', 'start_date').sort(col('current_id')).show()
returns:
+----------+----------+
|current_id|start_date|
+----------+----------+
| 100|2001/01/01|
| 200|2001/01/01|
| 300|2001/01/01|
| 400|2005/01/01|
| 500|2005/01/01|
| 600|2001/01/01|
| 700|2001/01/01|
| 800|2009/02/02|
| 900|2009/02/02|
+----------+----------+
Thanks for the suggestions posted here. After trying various approaches I have gone with the following solution which works for multiple iterations (e.g. 20 loops), and does not cause any memory issues.
The "Physical Plan" is still huge, but caching means most of the steps are skipped, keeping performance acceptable.
input = spark.createDataFrame([
(100, None, '2001/01/01'),
(200, 100, '2002/02/02'),
(300, 200, '2003/03/03'),
(400, None, '2005/01/01'),
(500, 400, '2006/02/02'),
(600, 300, '2007/02/02'),
(700, 600, '2008/02/02'),
(800, None, '2009/02/02'),
(900, 800, '2010/02/02')
], ["current_id", "previous_id", "start_date"])
input.createOrReplaceTempView("input")
cur = spark.sql("select * from input where previous_id is null")
nxt = spark.sql("select * from input where previous_id is not null")
cur.cache()
nxt.cache()
cur.createOrReplaceTempView("cur0")
nxt.createOrReplaceTempView("nxt")
i = 1
while True:
spark.sql("set table_name=cur" + str(i - 1))
cur = spark.sql(
"""
SELECT nxt.current_id as current_id,
nxt.previous_id as previous_id,
cur.start_date as start_date
FROM ${table_name} cur,
nxt nxt
WHERE cur.current_id = nxt.previous_id
""").cache()
cur.createOrReplaceTempView("cur" + str(i))
i = i + 1
if cur.count() == 0:
break
for x in range(0, i):
spark.sql("set table_name=cur" + str(x))
cur = spark.sql("select * from ${table_name}")
if x == 0:
out = cur
else:
out = out.union(cur)

PySpark UDF with multiple arguments returns null

I have a PySpark Dataframe with two columns (A, B, whose type is double) whose values are either 0.0 or 1.0.
I am trying to add a new column, which is the sum of those two.
I followed examples in Pyspark: Pass multiple columns in UDF
import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType, StringType
sum_cols = F.udf(lambda x: x[0]+x[1], IntegerType())
df_with_sum = df.withColumn('SUM_COL',sum_cols(F.array('A','B')))
df_with_sum.select(['SUM_COL']).toPandas()
This shows a Series of NULLs instead of the results I expect.
I tried any of the following to see if there's an issue with data types
sum_cols = F.udf(lambda x: x[0], IntegerType())
sum_cols = F.udf(lambda x: int(x[0]), IntegerType())
still getting Nulls.
I tried removing the array:
sum_cols = F.udf(lambda x: x, IntegerType())
df_with_sum = df.withColumn('SUM_COL',sum_cols(df.A))
This works fine and shows 0/1
I tried removing the UDF, but leaving the array:
df_with_sum = df.withColumn('SUM_COL', F.array('A','B'))
This works fine and shows a series of arrays of [0.0/1.0, 0.0/1.0]
So, array works fine, UDF works fine, it is just when I try to pass an array to UDF that things break down. What am I doing wrong?
The problem is that you are trying to return a double in a function that is supposed to output an integer, which does not fit, and pyspark by default silently resorts to NULL when a casting fails:
df_with_doubles = spark.createDataFrame([(1.0,1.0), (2.0,2.0)], ['A', 'B'])
sum_cols = F.udf(lambda x: x[0]+x[1], IntegerType())
df_with_sum = df_with_double.withColumn('SUM_COL',sum_cols(F.array('A','B')))
df_with_sum.select(['SUM_COL']).toPandas()
You get:
SUM_COL
0 None
1 None
However, if you do:
df_with_integers = spark.createDataFrame([(1,1), (2,2)], ['A', 'B'])
sum_cols = F.udf(lambda x: x[0]+x[1], IntegerType())
df_with_sum = df_with_integers.withColumn('SUM_COL',sum_cols(F.array('A','B')))
df_with_sum.select(['SUM_COL']).toPandas()
You get:
SUM_COL
0 2
1 4
So, either cast your columns to IntegerType beforehand (or cast them in the UDF), or change the return type of the UDF to DoubleType.

Spark Accumulator confusion

I'm writing a Spark job that takes in data from multiple sources, filters bad input rows, and outputs a slightly modified version of the input. The job has two additional requirements:
I must keep track of the number of bad inputs rows per source to notify those upstream providers.
I must support an output limit per source.
The job seemed straightforward and I approached the problem using accumulators to keep track of the number of filtered rows per source. However, when I implemented the final .limit(N), my accumulator behavior changed. Here's some striped down sample code that triggers the behavior on a single source:
from pyspark.sql import Row, SparkSession
from pyspark.sql.types import *
from random import randint
def filter_and_transform_parts(rows, filter_int, accum):
for r in rows:
if r[0] == filter_int:
accum.add(1)
continue
yield r[0], r[1] + 1, r[2] + 1
def main():
spark= SparkSession \
.builder \
.appName("Test") \
.getOrCreate()
sc = spark.sparkContext
accum = sc.accumulator(0)
# 20 inputs w/ tuple having 4 as first element
inputs = [(4, randint(1, 10), randint(1, 10)) if x % 5 == 0 else (randint(6, 10), randint(6, 10), randint(6, 10)) for x in xrange(100)]
rdd = sc.parallelize(inputs)
# filter out tuples where 4 is first element
rdd = rdd.mapPartitions(lambda r: filter_and_transform_parts(r, 4, accum))
# if not limit, accumulator value is 20
# if limit and limit_count <= 63, accumulator value is 0
# if limit and limit_count >= 64, accumulator value is 20
limit = True
limit_count = 63
if limit:
rdd = rdd.map(lambda r: Row(r[0], r[1], r[2]))
df_schema = StructType([StructField("val1", IntegerType(), False),
StructField("val2", IntegerType(), False),
StructField("val3", IntegerType(), False)])
df = spark.createDataFrame(rdd, schema=df_schema)
df = df.limit(limit_count)
df.write.mode("overwrite").csv('foo/')
else:
rdd.saveAsTextFile('foo/')
print "Accum value: {}".format(accum.value)
if __name__ == "__main__":
main()
The problem is that my accumulator sometimes reports the number of filtered rows and sometimes doesn't, depending on the limit specified and number of inputs for a source. However, in all situations the filtered rows don't make it into the output meaning the filter occurred and the accumulator should have a value.
If you can shed some light on this that'd be very helpful, thanks!
Update:
Adding a rdd.persist() call after mapPartitions made the accumulator behavior consistent.
Actually, it doesnt't matter what the limit_count's value is.
The reason why sometime Accum value is 0 is because you performe accumulator in transformations(e.g.: rdd.map,rdd.mapPartitions).
Spark only guaranty that accumulator works as well inside actions(e.g.: rdd.foreach)
Lets make a little bit of change on your code:
from pyspark.sql import *
from random import randint
def filter_and_transform_parts(rows, filter_int, accum):
for r in rows:
if r[0] == filter_int:
accum.add(1)
def main():
spark = SparkSession.builder.appName("Test").getOrCreate()
sc = spark.sparkContext
print(sc.applicationId)
accum = sc.accumulator(0)
inputs = [(4, x * 10, x * 100) if x % 5 == 0 else (randint(6, 10), x * 10, x * 100) for x in xrange(100)]
rdd = sc.parallelize(inputs)
rdd.foreachPartition(lambda r: filter_and_transform_parts(r, 4, accum))
limit = True
limit_count = 10 or 'whatever'
if limit:
rdd = rdd.map(lambda r: Row(val1=r[0], val2=r[1], val3=r[2]))
df = spark.createDataFrame(rdd)
df = df.limit(limit_count)
df.write.mode("overwrite").csv('file:///tmp/output')
else:
rdd.saveAsTextFile('file:///tmp/output')
print "Accum value: {}".format(accum.value)
if __name__ == "__main__":
main()
Accum value is equle to 20 all the time
For more information:
http://spark.apache.org/docs/2.0.2/programming-guide.html#accumulators

Filtering rows in Spark Dataframe based on multiple values in a list [duplicate]

I want to filter a Pyspark DataFrame with a SQL-like IN clause, as in
sc = SparkContext()
sqlc = SQLContext(sc)
df = sqlc.sql('SELECT * from my_df WHERE field1 IN a')
where a is the tuple (1, 2, 3). I am getting this error:
java.lang.RuntimeException: [1.67] failure: ``('' expected but identifier a found
which is basically saying it was expecting something like '(1, 2, 3)' instead of a.
The problem is I can't manually write the values in a as it's extracted from another job.
How would I filter in this case?
String you pass to SQLContext it evaluated in the scope of the SQL environment. It doesn't capture the closure. If you want to pass a variable you'll have to do it explicitly using string formatting:
df = sc.parallelize([(1, "foo"), (2, "x"), (3, "bar")]).toDF(("k", "v"))
df.registerTempTable("df")
sqlContext.sql("SELECT * FROM df WHERE v IN {0}".format(("foo", "bar"))).count()
## 2
Obviously this is not something you would use in a "real" SQL environment due to security considerations but it shouldn't matter here.
In practice DataFrame DSL is a much better choice when you want to create dynamic queries:
from pyspark.sql.functions import col
df.where(col("v").isin({"foo", "bar"})).count()
## 2
It is easy to build and compose and handles all details of HiveQL / Spark SQL for you.
reiterating what #zero323 has mentioned above : we can do the same thing using a list as well (not only set) like below
from pyspark.sql.functions import col
df.where(col("v").isin(["foo", "bar"])).count()
Just a little addition/update:
choice_list = ["foo", "bar", "jack", "joan"]
If you want to filter your dataframe "df", such that you want to keep rows based upon a column "v" taking only the values from choice_list, then
from pyspark.sql.functions import col
df_filtered = df.where( ( col("v").isin (choice_list) ) )
You can also do this for integer columns:
df_filtered = df.filter("field1 in (1,2,3)")
or this for string columns:
df_filtered = df.filter("field1 in ('a','b','c')")
A slightly different approach that worked for me is to filter with a custom filter function.
def filter_func(a):
"""wrapper function to pass a in udf"""
def filter_func_(col):
"""filtering function"""
if col in a.value:
return True
return False
return udf(filter_func_, BooleanType())
# Broadcasting allows to pass large variables efficiently
a = sc.broadcast((1, 2, 3))
df = my_df.filter(filter_func(a)(col('field1'))) \
from pyspark.sql import SparkSession
import pandas as pd
spark=SparkSession.builder.appName('Practise').getOrCreate()
df_pyspark=spark.read.csv('datasets/myData.csv',header=True,inferSchema=True)
df_spark.createOrReplaceTempView("df") # we need to create a Temp table first
spark.sql("SELECT * FROM df where Departments in ('IOT','Big Data') order by Departments").show()

Resources