I want to select a column that equals to a certain value. I am doing this in scala and having a little trouble.
Heres my code
df.select(df("state")==="TX").show()
this returns the state column with boolean values instead of just TX
Ive also tried
df.select(df("state")=="TX").show()
but this doesn't work either.
I had the same issue, and the following syntax worked for me:
df.filter(df("state")==="TX").show()
I'm using Spark 1.6.
There is another simple sql like option. With Spark 1.6 below also should work.
df.filter("state = 'TX'")
This is a new way of specifying sql like filters. For a full list of supported operators, check out this class.
You should be using where, select is a projection that returns the output of the statement, thus why you get boolean values. where is a filter that keeps the structure of the dataframe, but only keeps data where the filter works.
Along the same line though, per the documentation, you can write this in 3 different ways
// The following are equivalent:
peopleDf.filter($"age" > 15)
peopleDf.where($"age" > 15)
peopleDf($"age" > 15)
To get the negation, do this ...
df.filter(not( ..expression.. ))
eg
df.filter(not($"state" === "TX"))
df.filter($"state" like "T%%") for pattern matching
df.filter($"state" === "TX") or df.filter("state = 'TX'") for equality
Worked on Spark V2.*
import sqlContext.implicits._
df.filter($"state" === "TX")
if needs to be compared against a variable (e.g., var):
import sqlContext.implicits._
df.filter($"state" === var)
Note : import sqlContext.implicits._
We can write multiple Filter/where conditions in Dataframe.
For example:
table1_df
.filter($"Col_1_name" === "buddy") // check for equal to string
.filter($"Col_2_name" === "A")
.filter(not($"Col_2_name".contains(" .sql"))) // filter a string which is not relevent
.filter("Col_2_name is not null") // no null filter
.take(5).foreach(println)
Here is the complete example using spark2.2+ taking data in json...
val myjson = "[{\"name\":\"Alabama\",\"abbreviation\":\"AL\"},{\"name\":\"Alaska\",\"abbreviation\":\"AK\"},{\"name\":\"American Samoa\",\"abbreviation\":\"AS\"},{\"name\":\"Arizona\",\"abbreviation\":\"AZ\"},{\"name\":\"Arkansas\",\"abbreviation\":\"AR\"},{\"name\":\"California\",\"abbreviation\":\"CA\"},{\"name\":\"Colorado\",\"abbreviation\":\"CO\"},{\"name\":\"Connecticut\",\"abbreviation\":\"CT\"},{\"name\":\"Delaware\",\"abbreviation\":\"DE\"},{\"name\":\"District Of Columbia\",\"abbreviation\":\"DC\"},{\"name\":\"Federated States Of Micronesia\",\"abbreviation\":\"FM\"},{\"name\":\"Florida\",\"abbreviation\":\"FL\"},{\"name\":\"Georgia\",\"abbreviation\":\"GA\"},{\"name\":\"Guam\",\"abbreviation\":\"GU\"},{\"name\":\"Hawaii\",\"abbreviation\":\"HI\"},{\"name\":\"Idaho\",\"abbreviation\":\"ID\"},{\"name\":\"Illinois\",\"abbreviation\":\"IL\"},{\"name\":\"Indiana\",\"abbreviation\":\"IN\"},{\"name\":\"Iowa\",\"abbreviation\":\"IA\"},{\"name\":\"Kansas\",\"abbreviation\":\"KS\"},{\"name\":\"Kentucky\",\"abbreviation\":\"KY\"},{\"name\":\"Louisiana\",\"abbreviation\":\"LA\"},{\"name\":\"Maine\",\"abbreviation\":\"ME\"},{\"name\":\"Marshall Islands\",\"abbreviation\":\"MH\"},{\"name\":\"Maryland\",\"abbreviation\":\"MD\"},{\"name\":\"Massachusetts\",\"abbreviation\":\"MA\"},{\"name\":\"Michigan\",\"abbreviation\":\"MI\"},{\"name\":\"Minnesota\",\"abbreviation\":\"MN\"},{\"name\":\"Mississippi\",\"abbreviation\":\"MS\"},{\"name\":\"Missouri\",\"abbreviation\":\"MO\"},{\"name\":\"Montana\",\"abbreviation\":\"MT\"},{\"name\":\"Nebraska\",\"abbreviation\":\"NE\"},{\"name\":\"Nevada\",\"abbreviation\":\"NV\"},{\"name\":\"New Hampshire\",\"abbreviation\":\"NH\"},{\"name\":\"New Jersey\",\"abbreviation\":\"NJ\"},{\"name\":\"New Mexico\",\"abbreviation\":\"NM\"},{\"name\":\"New York\",\"abbreviation\":\"NY\"},{\"name\":\"North Carolina\",\"abbreviation\":\"NC\"},{\"name\":\"North Dakota\",\"abbreviation\":\"ND\"},{\"name\":\"Northern Mariana Islands\",\"abbreviation\":\"MP\"},{\"name\":\"Ohio\",\"abbreviation\":\"OH\"},{\"name\":\"Oklahoma\",\"abbreviation\":\"OK\"},{\"name\":\"Oregon\",\"abbreviation\":\"OR\"},{\"name\":\"Palau\",\"abbreviation\":\"PW\"},{\"name\":\"Pennsylvania\",\"abbreviation\":\"PA\"},{\"name\":\"Puerto Rico\",\"abbreviation\":\"PR\"},{\"name\":\"Rhode Island\",\"abbreviation\":\"RI\"},{\"name\":\"South Carolina\",\"abbreviation\":\"SC\"},{\"name\":\"South Dakota\",\"abbreviation\":\"SD\"},{\"name\":\"Tennessee\",\"abbreviation\":\"TN\"},{\"name\":\"Texas\",\"abbreviation\":\"TX\"},{\"name\":\"Utah\",\"abbreviation\":\"UT\"},{\"name\":\"Vermont\",\"abbreviation\":\"VT\"},{\"name\":\"Virgin Islands\",\"abbreviation\":\"VI\"},{\"name\":\"Virginia\",\"abbreviation\":\"VA\"},{\"name\":\"Washington\",\"abbreviation\":\"WA\"},{\"name\":\"West Virginia\",\"abbreviation\":\"WV\"},{\"name\":\"Wisconsin\",\"abbreviation\":\"WI\"},{\"name\":\"Wyoming\",\"abbreviation\":\"WY\"}]"
import spark.implicits._
val df = spark.read.json(Seq(myjson).toDS)
df.show
import spark.implicits._
val df = spark.read.json(Seq(myjson).toDS)
df.show
scala> df.show
+------------+--------------------+
|abbreviation| name|
+------------+--------------------+
| AL| Alabama|
| AK| Alaska|
| AS| American Samoa|
| AZ| Arizona|
| AR| Arkansas|
| CA| California|
| CO| Colorado|
| CT| Connecticut|
| DE| Delaware|
| DC|District Of Columbia|
| FM|Federated States ...|
| FL| Florida|
| GA| Georgia|
| GU| Guam|
| HI| Hawaii|
| ID| Idaho|
| IL| Illinois|
| IN| Indiana|
| IA| Iowa|
| KS| Kansas|
+------------+--------------------+
// equals matching
scala> df.filter(df("abbreviation") === "TX").show
+------------+-----+
|abbreviation| name|
+------------+-----+
| TX|Texas|
+------------+-----+
// or using lit
scala> df.filter(df("abbreviation") === lit("TX")).show
+------------+-----+
|abbreviation| name|
+------------+-----+
| TX|Texas|
+------------+-----+
//not expression
scala> df.filter(not(df("abbreviation") === "TX")).show
+------------+--------------------+
|abbreviation| name|
+------------+--------------------+
| AL| Alabama|
| AK| Alaska|
| AS| American Samoa|
| AZ| Arizona|
| AR| Arkansas|
| CA| California|
| CO| Colorado|
| CT| Connecticut|
| DE| Delaware|
| DC|District Of Columbia|
| FM|Federated States ...|
| FL| Florida|
| GA| Georgia|
| GU| Guam|
| HI| Hawaii|
| ID| Idaho|
| IL| Illinois|
| IN| Indiana|
| IA| Iowa|
| KS| Kansas|
+------------+--------------------+
only showing top 20 rows
Let's create a sample dataset and do a deep dive into exactly why OP's code didn't work.
Here's our sample data:
val df = Seq(
("Rockets", 2, "TX"),
("Warriors", 6, "CA"),
("Spurs", 5, "TX"),
("Knicks", 2, "NY")
).toDF("team_name", "num_championships", "state")
We can pretty print our dataset with the show() method:
+---------+-----------------+-----+
|team_name|num_championships|state|
+---------+-----------------+-----+
| Rockets| 2| TX|
| Warriors| 6| CA|
| Spurs| 5| TX|
| Knicks| 2| NY|
+---------+-----------------+-----+
Let's examine the results of df.select(df("state")==="TX").show():
+------------+
|(state = TX)|
+------------+
| true|
| false|
| true|
| false|
+------------+
It's easier to understand this result by simply appending a column - df.withColumn("is_state_tx", df("state")==="TX").show():
+---------+-----------------+-----+-----------+
|team_name|num_championships|state|is_state_tx|
+---------+-----------------+-----+-----------+
| Rockets| 2| TX| true|
| Warriors| 6| CA| false|
| Spurs| 5| TX| true|
| Knicks| 2| NY| false|
+---------+-----------------+-----+-----------+
The other code OP tried (df.select(df("state")=="TX").show()) returns this error:
<console>:27: error: overloaded method value select with alternatives:
[U1](c1: org.apache.spark.sql.TypedColumn[org.apache.spark.sql.Row,U1])org.apache.spark.sql.Dataset[U1] <and>
(col: String,cols: String*)org.apache.spark.sql.DataFrame <and>
(cols: org.apache.spark.sql.Column*)org.apache.spark.sql.DataFrame
cannot be applied to (Boolean)
df.select(df("state")=="TX").show()
^
The === operator is defined in the Column class. The Column class doesn't define a == operator and that's why this code is erroring out.
Here's the accepted answer that works:
df.filter(df("state")==="TX").show()
+---------+-----------------+-----+
|team_name|num_championships|state|
+---------+-----------------+-----+
| Rockets| 2| TX|
| Spurs| 5| TX|
+---------+-----------------+-----+
As other posters have mentioned, the === method takes an argument with an Any type, so this isn't the only solution that works. This works too for example:
df.filter(df("state") === lit("TX")).show
+---------+-----------------+-----+
|team_name|num_championships|state|
+---------+-----------------+-----+
| Rockets| 2| TX|
| Spurs| 5| TX|
+---------+-----------------+-----+
The Column equalTo method can also be used:
df.filter(df("state").equalTo("TX")).show()
+---------+-----------------+-----+
|team_name|num_championships|state|
+---------+-----------------+-----+
| Rockets| 2| TX|
| Spurs| 5| TX|
+---------+-----------------+-----+
It worthwhile studying this example in detail. Scala's syntax seems magical at times, especially when method are invoked without dot notation. It's hard for the untrained eye to see that === is a method defined in the Column class!
In Spark 2.4
To compare with one value:
df.filter(lower(trim($"col_name")) === "<value>").show()
To compare with collection of value:
df.filter($"col_name".isInCollection(new HashSet<>(Arrays.asList("value1", "value2")))).show()
Suppose we have a table A and we are doing a left join with a large Table B (to fetch field colB)
Then the output is again left joined with a large table C (to fetch field colC) and finally we left join this with a table D (to fetch field colD)
So above 3 left-joins help to create a final dataset that is shared by multiple consumers.
As a consumer of this code , i do a select colA's , colD from the final dataset (I don't need colB and colC )
Is there a feature which will skip 2 joins with B & C (since colB and colC are not required downstream in my case)
FYI :
I don't want to change the implementation(i.e. 3 joins) since this
method is used by mutiple teams.
I dont want to create my own implementation (avoid code duplication ,
and to stay up to date with the logic that is used across the teams )
PS for clarity:
B,C,D are huge dim tables
A is a fact table (relatively smaller than B,C,D)
I do not think that this is possible without changing the original code. The reason is that even if the final result does not contain columns from tables B and C, the result might still depend on which tables were part of the join chain.
An example: lets assume we have this data and we want to join the four tables with the id column.
Table A Table B Table C Table D
+---+----+ +---+----+ +---+----+ +---+----+
| id|colA| | id|colB| | id|colC| | id|colD|
+---+----+ +---+----+ +---+----+ +---+----+
| 1| A1| | 1| B1| | 1| C1| | 1| D1|
| 2| A2| | 2| B2| | 2| C2| | 2| D2|
+---+----+ +---+----+ | 2| C2b| +---+----+
+---+----+
The important point to note is that the table C contains a duplicate value in the join column.
If the four tables are joined with a left join and the columns A and D are selected, the result would be
+---+----+----+----+----+ +---+----+----+
| id|colA|colB|colC|colD| | id|colA|colD|
+---+----+----+----+----+ +---+----+----+
| 1| A1| B1| C1| D1| ==> | 1| A1| D1|
| 2| A2| B2| C2b| D2| | 2| A2| D2|
| 2| A2| B2| C2| D2| | 2| A2| D2|
+---+----+----+----+----+ +---+----+----+
On the other hand, if only the tables A and D are joined directly without tables B and C, the result would be
+---+----+----+
| id|colA|colD|
+---+----+----+
| 1| A1| D1|
| 2| A2| D2|
+---+----+----+
So even if the final result contains no columns from tables B and C, the result is different if you join A->D or A->B->C->D. So the Spark code cannot skip the joins of the tables B and C.
The good news: if you go the way A->B->C->D and exclude the columns from tables B and C, Spark will only process the join column(s) of tables B and C and skips (for example during a shuffle) all other columns. So at least the amount of data that is processed will be lower when not selecting columns from tables B and C.
I'm working on Apache spark 2.3.0 cloudera4 and I have an issue processing a Dataframe.
I've got this input dataframe:
+---+---+----+
| id| d1| d2 |
+---+---+----+
| 1| | 2.0|
| 2| |-4.0|
| 3| | 6.0|
| 4|3.0| |
+---+---+----+
And I need this output:
+---+---+----+----+
| id| d1| d2 | r |
+---+---+----+----+
| 1| | 2.0| 7.0|
| 2| |-4.0| 5.0|
| 3| | 6.0| 9.0|
| 4|3.0| | 3.0|
+---+---+.---+----+
Which is, from an iterating perspective, get the biggest id row (4) and put the d1 value on the r column, then take the next row (3) and put r[4] + d2[3] on r column, and so on.
Is it posible to do something like that on Spark? because I will need a computed value from a row to calculate the value for another row.
How about this? The important bit is sum($"r1").over(Window.orderBy($"id".desc) which calculates a cumulative sum of a column. Other than that, I'm creating a couple of helper columns to get the max id and get the ordering right.
val result = df
.withColumn("max_id", max($"id").over(Window.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)))
.withColumn("r1", when($"id" === $"max_id", $"d1").otherwise($"d2"))
.withColumn("r", sum($"r1").over(Window.orderBy($"id".desc)))
.drop($"max_id").drop($"r1")
.orderBy($"id")
result.show
+---+----+----+---+
| id| d1| d2| r|
+---+----+----+---+
| 1|null| 2.0|7.0|
| 2|null|-4.0|5.0|
| 3|null| 6.0|9.0|
| 4| 3.0|null|3.0|
+---+----+----+---+
I have a data frame:
+---------+---------------------+
| id| Name|
+---------+---------------------+
| 1| 'Gary'|
| 1| 'Danny'|
| 2| 'Christopher'|
| 2| 'Kevin'|
+---------+---------------------+
I need to combine all the Name values in the id column. Please tell me how to get from it:
+---------+------------------------+
| id| Name|
+---------+------------------------+
| 1| ['Gary', 'Danny']|
| 2| ['Kevin','Christopher']|
+---------+------------------------+
You can use groupBy and collect functions. Based on your need you can use list or set etc.
df.groupBy(col("id")).agg(collect_list(col("Name"))
in case you want duplicate values
df.groupBy(col("id")).agg(collect_set(col("Name"))
if you want unique values
Use groupBy and collect_list functions for this case.
from pyspark.sql.functions import *
df.groupBy(col("id")).agg(collect_list(col("Name")).alias("Name")).show(10,False)
#+---+------------------------+
#|id |Name |
#+---+------------------------+
#|1 |['Gary', 'Danny'] |
#|2 |['Kevin', 'Christopher']|
#+---+------------------------+
df.groupby('id')['Name'].apply(list)
graph frames has a nice example for stateful motifs.
How can I explicitly return the counts? As you see the output only contains vertices and friends but not the counts.
How can I modify it to not (only) have access to the edges but access to the labels of the vertices as well?
when(relationship === "friend", cnt + 1).otherwise(cnt)
I.e. how could I enhance the count to count
the friends of each vertex with age > 30
the percentage of friendsGreater30 / allFriends
val g = examples.Graphs.friends // get example graph
// Find chains of 4 vertices.
val chain4 = g.find("(a)-[ab]->(b); (b)-[bc]->(c); (c)-[cd]->(d)")
// Query on sequence, with state (cnt)
// (a) Define method for updating state given the next element of the motif.
def sumFriends(cnt: Column, relationship: Column): Column = {
when(relationship === "friend", cnt + 1).otherwise(cnt)
}
// (b) Use sequence operation to apply method to sequence of elements in motif.
// In this case, the elements are the 3 edges.
val condition = Seq("ab", "bc", "cd").
foldLeft(lit(0))((cnt, e) => sumFriends(cnt, col(e)("relationship")))
// (c) Apply filter to DataFrame.
val chainWith2Friends2 = chain4.where(condition >= 2)
http://graphframes.github.io/user-guide.html
chainWith2Friends2.show()
Which will output
+-------------+------------+-------------+------------+-------------+------------+--------------+
| a| ab| b| bc| c| cd| d|
+-------------+------------+-------------+------------+-------------+------------+--------------+
|[e,Esther,32]|[e,d,friend]| [d,David,29]|[d,a,friend]| [a,Alice,34]|[a,e,friend]| [e,Esther,32]|
|[e,Esther,32]|[e,d,friend]| [d,David,29]|[d,a,friend]| [a,Alice,34]|[a,b,friend]| [b,Bob,36]|
| [d,David,29]|[d,a,friend]| [a,Alice,34]|[a,e,friend]|[e,Esther,32]|[e,d,friend]| [d,David,29]|
| [d,David,29]|[d,a,friend]| [a,Alice,34]|[a,e,friend]|[e,Esther,32]|[e,f,follow]| [f,Fanny,36]|
| [d,David,29]|[d,a,friend]| [a,Alice,34]|[a,b,friend]| [b,Bob,36]|[b,c,follow]|[c,Charlie,30]|
| [a,Alice,34]|[a,e,friend]|[e,Esther,32]|[e,d,friend]| [d,David,29]|[d,a,friend]| [a,Alice,34]|
+-------------+------------+-------------+------------+-------------+------------+--------------+
Note that sumFriends returns a Column, so condition is a column. This is why you can access it in a where statement without quotes. So all you have to do is add that column to your dataframe. After running the above code, I can run
chain4.withColumn("condition",condition).select("condition").show
+---------+
|condition|
+---------+
| 1|
| 0|
| 0|
| 0|
| 0|
| 3|
| 3|
| 3|
| 2|
| 2|
| 3|
| 1|
+---------+
you could also use chain4.select(condition)
Hope this helps