better way to select all columns and join in pyspark data frames - apache-spark

I have two data frames in pyspark. Their schema's are below
df1
DataFrame[customer_id: int, email: string, city: string, state: string, postal_code: string, serial_number: string]
df2
DataFrame[serial_number: string, model_name: string, mac_address: string]
Now I want to do a full outer join on these two data frames by using coalesce on the column common in both the data frames.
I have done like below. I got the expected result.
full_df = df1.join(df2, df1.serial_number == df2.serial_number, 'full_outer').select(df1.customer_id, df1.email, df1.city, df1.state, df1.postal_code, f.coalesce(df1.serial_number, df2.serial_number).alias('serial_number'), df2.model_name, df2.mac_address)
Now I want to do the above little differently. Instead of writing all the column names near select in the join statement i want to do something like using * on the data frame. Basically I want something like below.
full_df = df1.join(df2, df1.serial_number == df2.serial_number, 'full_outer').select('df1.*', f.coalesce(df1.serial_number, df2.serial_number).alias('serial_number1'), df2.model_name, df2.mac_address).drop('serial_number')
I am getting what I want. Is there a better way to this kind of operation in pyspark
edit
This is not a duplicate of https://stackoverflow.com/questions/36132322/join-two-data-frames-select-all-columns-from-one-and-some-columns-from-the-othe?rq=1 I am using a coalesce in the join statement. I want to know if there is a way where we can exclude the column on which I am using the coalesce function

You can do something like this:
(df1
.join(df2, df1.serial_number == df2.serial_number, 'full_outer')
.select(
[df1[c] for c in df1.columns if c != 'serial_number'] +
[f.coalesce(df1.serial_number, df2.serial_number)]
))

Related

How to use groupby with array elements in Pyspark?

I'm running a groupBy operation on a dataframe in Pyspark and I need to groupby a list which may be by one or two features.. How can I execute this?
record_fields = [['record_edu_desc'], ['record_construction_desc'],['record_cost_grp'],['record_bsmnt_typ_grp_desc'], ['record_shape_desc'],
['record_sqft_dec_grp', 'record_renter_grp_c_flag'],['record_home_age'],
['record_home_age_grp','record_home_age_missing']]
for field in record_fields:
df_group = df.groupBy('year', 'area', 'state', 'code', field).sum('net_contributions')
### df write to csv operation
My first thought was to create a list of lists and pass it to the groupby operation, but I get the following error:
TypeError: Invalid argument, not a string or column:
['record_edu_desc'] of type . For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
How do I make this work? I'm open to other ways I could do this.
Try this (note that * [asterisk] before field):
for field in record_fields:
df_group = df.groupBy('year', 'area', 'state', 'code', *field).sum('net_contributions')
Also take a look at this question to know more about asterisk in python.

Multiple values in WHERE clause using sqldf in R

I am trying to query multiple values in the WHERE clause, using sqldf in R. I have the following query, however, it continues to throw an error. Any help would be appreciated.
sqldf("SELECT amount
from df
where category = 'description' and 'original description'")
ERROR: <0 rows> (or 0-length row.names)
You just need to use in condition
sqldf("SELECT amount
from df
where category in ('description','original description')")
If you want to use like operator, you need to use OR instead of AND.(not sure what other entries are in the category, if you don't have any other category that has "description" in its name, the following might be enough
sqldf("SELECT amount from df where category LIKE 'descriptio%'")
You need to define each where clause explicitly, so
SELECT amount FROM df WHERE category = 'description' OR category = 'original description'
You can pass in multiple values, it's done with the IN operator:
SELECT amount FROM df WHERE category IN ( 'description', 'original description' )

How to efficiently select distinct rows on an RDD based on a subset of its columns`

Consider a Case Class:
case class Prod(productId: String, date: String, qty: Int, many other attributes ..)
And an
val rdd: RDD[Prod]
containing many instances of that class.
The unique key is intended to be the (productId,date) tuple. However we do have some duplicates.
Is there any efficient means to remove the duplicates?
The operation
rdd.distinct
would look for entire rows that are duplicated.
A fallback would involve joining the unique (productId,date) combinations back to the entire rows: I am working through exactly how to do this. But even so it is several operations. A simpler approach (faster as well?) would be useful if it exists.
I'd use dropDuplicates on Dataset:
val rdd = sc.parallelize(Seq(
Prod("foo", "2010-01-02", 1), Prod("foo", "2010-01-02", 2)
))
rdd.toDS.dropDuplicates("productId", "date")
but reduceByKey should work as well:
rdd.keyBy(prod => (prod.productId, prod.date)).reduceByKey((x, _) => x).values

How to Flatten spark dataframe Row to multiple Dataframe Rows

Hi I have a spark data frame which prints like this (single row)
[abc,WrappedArray(11918,1233),WrappedArray(46734,1234),1487530800317]
So inside a row i have wrapped array, I want to flatten it and create a dataframe which has single value for each array for example above row should transform something like this
[abc,11918,46734,1487530800317]
[abc,1233,1234,1487530800317]
So i got dataframe with 2 Rows instead of 1, So each corresponding element from wrapped array should go in new row.
Edit 1 after 1st answer:
What if i have 3 arrays in my input
WrappedArray(46734,1234,[abc,WrappedArray(11918,1233),WrappedArray(46734,1234),WrappedArray(1,2),1487530800317]
my output should be
[abc,11918,46734,1,1487530800317]
[abc,1233,1234,2,1487530800317]
Definitely not the best solution, but this would work:
case class TestFormat(a: String, b: Seq[String], c: Seq[String], d: String)
val data = Seq(TestFormat("abc", Seq("11918","1233"),
Seq("46734","1234"), "1487530800317")).toDS
val zipThem: (Seq[String], Seq[String]) => Seq[(String, String)] = _.zip(_)
val udfZip = udf(zipThem)
data.select($"a", explode(udfZip($"b", $"c")) as "tmp", $"d")
.select($"a", $"tmp._1" as "b", $"tmp._2" as "c", $"d")
.show
The problem is that by default you cannot be sure that both Sequences are of equal length.
The probably better solution would be to reformat the whole data frame into a structure that models the data, e.g.
root
-- a
-- d
-- records
---- b
---- c
Thanks for answering #swebbo, you answer helped me getting this done:
I did this:
import org.apache.spark.sql.functions.{explode, udf}
import sqlContext.implicits._
val zipColumns = udf((x: Seq[Long], y: Seq[Long], z: Seq[Long]) => (x.zip(y).zip(z)) map {
case ((a,b),c) => (a,b,c)
})
val flattened = subDf.withColumn("columns", explode(zipColumns($"col3", $"col4", $"col5"))).select(
$"col1", $"col2",
$"columns._1".alias("col3"), $"columns._2".alias("col4"), $"columns._3".alias("col5"))
flattened.show
Hope that is understandable :)

spark: row to element

New to Spark.
I'd like to do some transformation on the "wordList" column of a spark DataFrame, df, of the type org.apache.spark.sql.DataFrame = [id: string, wordList: array<string>].
I use dataBricks. df looks like:
+--------------------+--------------------+
| id| wordList|
+--------------------+--------------------+
|08b0a9b6-3b9a-47a...| [a]|
|23c2ef79-8dce-4ad...|[ag, adfg, asdfgg...|
|26a7682f-2ce6-4eb...|[ghe, gener, ghee...|
|2ab530b5-04bc-463...|[bap, pemm, pava,...|
+--------------------+--------------------+
More specifically, I have defined a function shrinkList(ol: List[String]): List[String] that takes a list and returns a shorter list, and would like to apply it on the wordList column. The question is, how do I convert the row to a list?
df.select("wordList").map(t => shrinkList(t(1))) give the error: type mismatch;
found : Any
required: List[String]
Also, I'm not sure about "t(1)" here. I'd rather use the column name instead of the index, in case the order of the columns change in the future. But I can't seem to make t$"wordList" or t.wordList or t("wordList") work. So instead of using t(1), what selector can I use to select the "wordList" column?
Try:
df.select("wordList").map(t => shrinkList(t.getSeq[String](0).toList))
or
df.select("wordList").map(t => shrinkList(t.getAs[Seq[String]]("wordList").toList))

Resources