Filtering rows in Spark Dataframe based on multiple values in a list [duplicate] - apache-spark

I want to filter a Pyspark DataFrame with a SQL-like IN clause, as in
sc = SparkContext()
sqlc = SQLContext(sc)
df = sqlc.sql('SELECT * from my_df WHERE field1 IN a')
where a is the tuple (1, 2, 3). I am getting this error:
java.lang.RuntimeException: [1.67] failure: ``('' expected but identifier a found
which is basically saying it was expecting something like '(1, 2, 3)' instead of a.
The problem is I can't manually write the values in a as it's extracted from another job.
How would I filter in this case?

String you pass to SQLContext it evaluated in the scope of the SQL environment. It doesn't capture the closure. If you want to pass a variable you'll have to do it explicitly using string formatting:
df = sc.parallelize([(1, "foo"), (2, "x"), (3, "bar")]).toDF(("k", "v"))
df.registerTempTable("df")
sqlContext.sql("SELECT * FROM df WHERE v IN {0}".format(("foo", "bar"))).count()
## 2
Obviously this is not something you would use in a "real" SQL environment due to security considerations but it shouldn't matter here.
In practice DataFrame DSL is a much better choice when you want to create dynamic queries:
from pyspark.sql.functions import col
df.where(col("v").isin({"foo", "bar"})).count()
## 2
It is easy to build and compose and handles all details of HiveQL / Spark SQL for you.

reiterating what #zero323 has mentioned above : we can do the same thing using a list as well (not only set) like below
from pyspark.sql.functions import col
df.where(col("v").isin(["foo", "bar"])).count()

Just a little addition/update:
choice_list = ["foo", "bar", "jack", "joan"]
If you want to filter your dataframe "df", such that you want to keep rows based upon a column "v" taking only the values from choice_list, then
from pyspark.sql.functions import col
df_filtered = df.where( ( col("v").isin (choice_list) ) )

You can also do this for integer columns:
df_filtered = df.filter("field1 in (1,2,3)")
or this for string columns:
df_filtered = df.filter("field1 in ('a','b','c')")

A slightly different approach that worked for me is to filter with a custom filter function.
def filter_func(a):
"""wrapper function to pass a in udf"""
def filter_func_(col):
"""filtering function"""
if col in a.value:
return True
return False
return udf(filter_func_, BooleanType())
# Broadcasting allows to pass large variables efficiently
a = sc.broadcast((1, 2, 3))
df = my_df.filter(filter_func(a)(col('field1'))) \

from pyspark.sql import SparkSession
import pandas as pd
spark=SparkSession.builder.appName('Practise').getOrCreate()
df_pyspark=spark.read.csv('datasets/myData.csv',header=True,inferSchema=True)
df_spark.createOrReplaceTempView("df") # we need to create a Temp table first
spark.sql("SELECT * FROM df where Departments in ('IOT','Big Data') order by Departments").show()

Related

concat_ws and coalesce in pyspark

In Pyspark, I want to combine concat_ws and coalesce whilst using the list method. For example I know this works:
from pyspark.sql.functions import concat_ws, col
df = spark.createDataFrame([["A", "B"], ["C", None], [None, "D"]]).toDF("Type", "Segment")
#display(df)
df = df.withColumn("concat_ws2", concat_ws(':', coalesce('Type', lit("")), coalesce('Segment', lit(""))))
display(df)
But I want to be able to utilise the *[list] method so I don't have to list out all the columns within that bit of code, i.e. something like this instead:
from pyspark.sql.functions import concat_ws, col
df = spark.createDataFrame([["A", "B"], ["C", None], [None, "D"]]).toDF("Type", "Segment")
list = ["Type", "Segment"]
df = df.withColumn("almost_desired_output", concat_ws(':', *list))
display(df)
However as you can see, I want to be able to coalesce NULL with a blank, but not sure if that's possible using the *[list] method or do I really have to list out all the columns?
This would work:
Iterate over list of columns names
df=df.withColumn("almost_desired_output", concat_ws(':', *[coalesce(name, lit('')).alias(name) for name in df.schema.names]))
Output:
Or, Use fill - it'll fill all the null values across all columns of Dataframe (but this changes in the actual column, which may can break some use-cases)
df.na.fill("").withColumn("almost_desired_output", concat_ws(':', *list)
Or, Use selectExpr (again this changes in the actual column, which may can break some use-cases)
list = ["Type", "Segment"] # or just use df.schema.names
list2 = ["coalesce(type,' ') as Type", "coalesce(Segment,' ') as Segment"]
df=df.selectExpr(list2).withColumn("almost_desired_output", concat_ws(':', *list))

Pyspark Dataframe: Transform many columns

I have a pyspark dataframe with 10 columns as read from a parquet file
df = spark.read.parquet(path)
I want to apply several pre-processing steps to a subset of this dataframe's columns: col_list.
The following works fine, but apart from a bit ugly, I also have the feeling it is not optimal.
import pyspark.sql.functions as F
for col in col_list:
df = df.withColumn(col, F.regexp_replace(col, ".", " ")
df = df.withColumn(col, F.regexp_replace(col, "_[A-Z]_", "")
and the list goes on with other similar text processing steps.
So the question is whether the above is as optimal and elegant as it gets and also if/how I can use transform to achieve a sequential execution of the above steps.
Thanks a lot.
You can select all the required columns in one go:
import pyspark.sql.functions as F
df2 = df.select(
*[c for c in df.columns if c not in col_list],
*[F.regexp_replace(F.regexp_replace(c, ".", " "), "_[A-Z]_", "").alias(c) for c in df.columns if c in col_list]
)

Different outcome from seemingly equivalent implementation of PySpark transformations

I have a set of spark dataframe transforms which gives an out of memory error and has a messed up sql query plan while a different implemetation runs successfully.
%python
import pandas as pd
diction = {
'key': [1,2,3,4,5,6],
'f1' : [1,0,1,0,1,0],
'f2' : [0,1,0,1,0,1],
'f3' : [1,0,1,0,1,0],
'f4' : [0,1,0,1,0,1]}
bil = pd.DataFrame(diction)
# successfull logic
df = spark.createDataFrame(bil)
df = df.cache()
zdf = df
for i in [1,2,3]:
tempdf = zdf.select(['key'])
df = df.join(tempdf,on=['key'],how='left')
df.show()
# failed logic
df = spark.createDataFrame(bil)
df = df.cache()
for i in [1,2,3]:
tempdf = df.select(['key'])
df = df.join(tempdf,on=['key'],how='left')
df.show()
Logically thinking there must not be such a computational difference (more than double the time and memory used).
Can anyone help me understand this ?
DAG of successful logic:
DAG of failure logic:
I'm not sure what your use case is for this code, however the two pieces of code are not logically the same. In the second version you are joining the result of the previous iteration to itself three times. In the first version you are joining a 'copy' of the original df three times. If your key column is not unique, the second piece of code will 'explode' your dataframe more than the first.
To make this more clear we can make a simple example below where we have a non-unique key value. Taking your second example:
df = spark.createDataFrame([(1,'a'), (1,'b'), (3,'c')], ['key','val'])
for i in [1,2,3]:
tempdf = df.select(['key'])
df = df.join(tempdf,on=['key'],how='left')
df.count()
>>> 257
And your first piece of code:
df = spark.createDataFrame([(1,'a'), (1,'b'), (3,'c')], ['key','val'])
zdf = df
for i in [1,2,3]:
tempdf = zdf.select(['key'])
df = df.join(tempdf,on=['key'],how='left')
df.count()
>>> 17

Data type conversion in spark

I have an column id which had type int but later changed to bigint.
It has both types of values.
from pyspark.sql.functions import *
from pyspark.sql.types import *
df = spark.read.parquet('hdfs path')
df = df.select("id", "code")
df=df.withColumn("id1", df["id"].cast(LongType()))
res1=df.select("id1", "code")
res1.show(1, False)
It shows me the data frame but when i try to perform some operations on them
example:
res1.groupBy('code').agg(countDistinct("id1")).show(1, False)
I get Column: [id], Expected: int, Found: INT64
I tried mergeSchema did not work either.
from pyspark.sql.functions import *
from pyspark.sql.types import *
df1 = spark.read.parquet('hdfs path')
df2 = df1.select("id", "code")
df3 = df2.withColumn("id1", df2["id"].cast(LongType()))
res1=df3.select("id1", "code")
res1.show(1, False)
res1.groupBy("code").agg(countDistinct("id1")).show(1, False)
This should work. In spark Dataframes are immutable so you should not assign the value of transformation operation to a same df variable, you should use a different variable name. In scala it would give you compile time error but in python its allowed so you don't notice it.
if you want you could also chain all of your transformation and get a single df variable and perform groupby operation on it as below :
df = spark.read.parquet('hdfs path').select("id", "code").withColumn("id1", col("id").cast(LongType())).select("id1", "code")
df.groupBy("code").agg(countDistinct("id1")).show(1, False)

PySpark - Add a new nested column or change the value of existing nested columns

Supposing, I have a json file with lines in follow structure:
{
"a": 1,
"b": {
"bb1": 1,
"bb2": 2
}
}
I want to change the value of key bb1 or add a new key, like: bb3.
Currently, I use spark.read.json to load the json file into spark as DataFrame and df.rdd.map to map each row of RDD to dict. Then, change nested key value or add a nested key and convert the dict to row. Finally, convert RDD to DataFrame.
The workflow works as follow:
def map_func(row):
dictionary = row.asDict(True)
adding new key or changing key value
return as_row(dictionary) # as_row convert dict to row recursively
df = spark.read.json("json_file")
df.rdd.map(map_func).toDF().write.json("new_json_file")
This could work for me. But I concern that converting DataFrame -> RDD ( Row -> dict -> Row) -> DataFrame would kill the efficiency.
Is there any other methods that could work for this demand but not at the cost of efficiency?
The final solution that I used is using withColumn and dynamically building the schema of b.
Firstly, we can get the b_schema from df schema by:
b_schema = next(field['type'] for field in df.schema.jsonValue()['fields'] if field['name'] == 'b')
After that, b_schema is dict and we can add new field into it by:
b_schema['fields'].append({"metadata":{},"type":"string","name":"bb3","nullable":True})
And then, we could convert it to StructType by:
new_b = StructType.fromJson(b_schema)
In the map_func, we could convert Row to dict and populate the new field:
def map_func(row):
data = row.asDict(True)
data['bb3'] = data['bb1'] + data['bb2']
return data
map_udf = udf(map_func, new_b)
df.withColumn('b', map_udf('b')).collect()
Thanks #Mariusz
You can use map_func as udf and therefore omit converting DF -> RDD -> DF, still having the flexibility of python to implement business logic. All you need is to create schema object:
>>> from pyspark.sql.types import *
>>> new_b = StructType([StructField('bb1', LongType()), StructField('bb2', LongType()), StructField('bb3', LongType())])
Then you define map_func and udf:
>>> from pyspark.sql.functions import *
>>> def map_func(data):
... return {'bb1': 4, 'bb2': 5, 'bb3': 6}
...
>>> map_udf = udf(map_func, new_b)
Finally apply this UDF to dataframe:
>>> df = spark.read.json('sample.json')
>>> df.withColumn('b', map_udf('b')).first()
Row(a=1, b=Row(bb1=4, bb2=5, bb3=6))
EDIT:
According to the comment: You can add a field to existing StructType in a easier way, for example:
>>> df = spark.read.json('sample.json')
>>> new_b = df.schema['b'].dataType.add(StructField('bb3', LongType()))

Resources