How to convert a Spark DataFrame column into a list? - apache-spark

I want to convert a Spark DataFrame into another DataFrame with a specific manner as follows:
I have Spark DataFrame:
col des
A a
A b
B b
B c
As a result of the operation I would like to a have also a Spark DataFrame as:
col des
A a,b
B b,c
I tried to use:
result <- summarize(groupBy(df, df$col), des = n(df$des))
As a result I obtained the count. Is there any parameter of (summarize or agg) that converts column into a list or something similar, but with assumption that all operations are done on Spark?
Thank you in advance

Here is the solution in scala, you need to figure out for the SparkR.
val dataframe = spark.sparkContext.parallelize(Seq(
("A", "a"),
("A", "b"),
("B", "b"),
("B", "c")
)).toDF("col", "desc")
dataframe.groupBy("col").agg(collect_list(struct("desc")).as("desc")).show
Hope this helps!

sparkR code:
sc <- sparkR.init()
sqlContext <- sparkRSQL.init(sc)
#create R data frame
df <- data.frame(col= c("A","A","B","B"),des= c("a","b","b","c"))
#converting to spark dataframe
sdf <- createDataFrame( sqlContext, df)
registerTempTable(sdf, "sdf")
head(sql(sqlContext, "SQL QUERY"))
write the corresponding sql query in it and execute it.

Related

Pyspark Dataframe: Transform many columns

I have a pyspark dataframe with 10 columns as read from a parquet file
df = spark.read.parquet(path)
I want to apply several pre-processing steps to a subset of this dataframe's columns: col_list.
The following works fine, but apart from a bit ugly, I also have the feeling it is not optimal.
import pyspark.sql.functions as F
for col in col_list:
df = df.withColumn(col, F.regexp_replace(col, ".", " ")
df = df.withColumn(col, F.regexp_replace(col, "_[A-Z]_", "")
and the list goes on with other similar text processing steps.
So the question is whether the above is as optimal and elegant as it gets and also if/how I can use transform to achieve a sequential execution of the above steps.
Thanks a lot.
You can select all the required columns in one go:
import pyspark.sql.functions as F
df2 = df.select(
*[c for c in df.columns if c not in col_list],
*[F.regexp_replace(F.regexp_replace(c, ".", " "), "_[A-Z]_", "").alias(c) for c in df.columns if c in col_list]
)

How to subtract two DataFrames keeping duplicates in Spark 2.3.0

Spark 2.4.0 introduces new handy function exceptAll which allows to subtract two dataframes, keeping duplicates.
Example
val df1 = Seq(
("a", 1L),
("a", 1L),
("a", 1L),
("b", 2L)
).toDF("id", "value")
val df2 = Seq(
("a", 1L),
("b", 2L)
).toDF("id", "value")
df1.exceptAll(df2).collect()
// will return
Seq(("a", 1L),("a", 1L))
However I can only use Spark 2.3.0.
What is the best way to implement this using only functions from Spark 2.3.0?
One option is to use row_number to generate a sequential number column and use it on a left join to get the missing rows.
PySpark solution shown here.
from pyspark.sql.functions import row_number
from pyspark.sql import Window
w1 = Window.partitionBy(df1.id).orderBy(df1.value)
w2 = Window.partitionBy(df2.id).orderBy(df2.value)
df1 = df1.withColumn("rnum", row_number().over(w1))
df2 = df2.withColumn("rnum", row_number().over(w2))
res_like_exceptAll = df1.join(df2, (df1.id==df2.id) & (df1.val == df2.val) & (df1.rnum == df2.rnum), 'left') \
.filter(df2.id.isNull()) \ #Identifies missing rows
.select(df1.id,df1.value)
res_like_exceptAll.show()

convert rdd to dataframe without schema in pyspark

I'm trying to convert an rdd to dataframe with out any schema.
I tried below code. It's working fine, but the dataframe columns are getting shuffled.
def f(x):
d = {}
for i in range(len(x)):
d[str(i)] = x[i]
return d
rdd = sc.textFile("test")
df = rdd.map(lambda x:x.split(",")).map(lambda x :Row(**f(x))).toDF()
df.show()
If you don't want to specify a schema, do not convert use Row in the RDD. If you simply have a normal RDD (not an RDD[Row]) you can use toDF() directly.
df = rdd.map(lambda x: x.split(",")).toDF()
You can give names to the columns using toDF() as well,
df = rdd.map(lambda x: x.split(",")).toDF("col1_name", ..., "colN_name")
If what you have is an RDD[Row] you need to actually know the type of each column. This can be done by specifying a schema or as follows
val df = rdd.map({
case Row(val1: String, ..., valN: Long) => (val1, ..., valN)
}).toDF("col1_name", ..., "colN_name")

HiveQL to PySpark - issue with aggregated column in SELECT statement

I have following HQL script which needs to be puti nto pyspark, spark 1.6
insert into table db.temp_avg
select
a,
avg(b) ,
c
from db.temp WHERE flag is not null GROUP BY a, c;
I created few versions of spark code, but I'm stuggling how to get this averaged column into select.
Also I found out that groupped data cannot be write this way:
df3 = df2.groupBy...
df3.write.mode('overwrite').saveAsTable('db.temp_avg')
part of pyspark code:
temp_table = sqlContext.table("db.temp")
df = temp_table.select('a', 'avg(b)', 'c', 'flag').toDF('a', 'avg(b)', 'c', 'flag')
df = df.where(['flag'] != 'null'))
# this ofc does not work along with the avg(b)
df2 = df.groupBy('a', 'c')
df3.write.mode('overwrite').saveAsTable('db.temp_avg')
Thx for your help.
Correct solution:
import pyspark.sql.functions as F
df = sqlContext.sql("SELECT * FROM db.temp_avg").alias("temp")
df = df.select('a', 'b', 'c')\
.filter(F.col("temp.flag").isNotNULL())\
.groupby('a', 'c')\
.agg(F.avg('b').alias("avg_b"))
import pyspark.sql.functions as F
df = sqlContext.sql("select * from db.temp_avg")
df = df.select('a',
b,
'c')\
.filter(F.col("flag").isNotNULL())\
.groupby('a', 'c')\
.agg(F.avg('b').alias("avg_b"))
Then you can save the table by
df.saveAsTable("tabe_name")

Filtering rows in Spark Dataframe based on multiple values in a list [duplicate]

I want to filter a Pyspark DataFrame with a SQL-like IN clause, as in
sc = SparkContext()
sqlc = SQLContext(sc)
df = sqlc.sql('SELECT * from my_df WHERE field1 IN a')
where a is the tuple (1, 2, 3). I am getting this error:
java.lang.RuntimeException: [1.67] failure: ``('' expected but identifier a found
which is basically saying it was expecting something like '(1, 2, 3)' instead of a.
The problem is I can't manually write the values in a as it's extracted from another job.
How would I filter in this case?
String you pass to SQLContext it evaluated in the scope of the SQL environment. It doesn't capture the closure. If you want to pass a variable you'll have to do it explicitly using string formatting:
df = sc.parallelize([(1, "foo"), (2, "x"), (3, "bar")]).toDF(("k", "v"))
df.registerTempTable("df")
sqlContext.sql("SELECT * FROM df WHERE v IN {0}".format(("foo", "bar"))).count()
## 2
Obviously this is not something you would use in a "real" SQL environment due to security considerations but it shouldn't matter here.
In practice DataFrame DSL is a much better choice when you want to create dynamic queries:
from pyspark.sql.functions import col
df.where(col("v").isin({"foo", "bar"})).count()
## 2
It is easy to build and compose and handles all details of HiveQL / Spark SQL for you.
reiterating what #zero323 has mentioned above : we can do the same thing using a list as well (not only set) like below
from pyspark.sql.functions import col
df.where(col("v").isin(["foo", "bar"])).count()
Just a little addition/update:
choice_list = ["foo", "bar", "jack", "joan"]
If you want to filter your dataframe "df", such that you want to keep rows based upon a column "v" taking only the values from choice_list, then
from pyspark.sql.functions import col
df_filtered = df.where( ( col("v").isin (choice_list) ) )
You can also do this for integer columns:
df_filtered = df.filter("field1 in (1,2,3)")
or this for string columns:
df_filtered = df.filter("field1 in ('a','b','c')")
A slightly different approach that worked for me is to filter with a custom filter function.
def filter_func(a):
"""wrapper function to pass a in udf"""
def filter_func_(col):
"""filtering function"""
if col in a.value:
return True
return False
return udf(filter_func_, BooleanType())
# Broadcasting allows to pass large variables efficiently
a = sc.broadcast((1, 2, 3))
df = my_df.filter(filter_func(a)(col('field1'))) \
from pyspark.sql import SparkSession
import pandas as pd
spark=SparkSession.builder.appName('Practise').getOrCreate()
df_pyspark=spark.read.csv('datasets/myData.csv',header=True,inferSchema=True)
df_spark.createOrReplaceTempView("df") # we need to create a Temp table first
spark.sql("SELECT * FROM df where Departments in ('IOT','Big Data') order by Departments").show()

Resources