I have data as below
+-----+---------+----------+
| TYPE|DTIN_MNTH|DTOUT_MNTH|
+-----+---------+----------+
| A| 2022-03| 2022-05|
| B| 2022-04| 2022-04|
| C| 2022-05| 2022-07|
+-----+---------+----------+
I want to expand the DTIN_MNTH and DTOUT_MNTH in rows as below,
+-----+---------+
| TYPE| MNTH|
+-----+---------+
| A| 2022-03|
| A| 2022-04|
| A| 2022-05|
| B| 2022-04|
| C| 2022-05|
| C| 2022-06|
| C| 2022-07|
+-----+---------+
From Spark 2.4, you can use sequence built-in function to generate array of months between two dates, then use explode to expand array of months. You can then reformat your expanded column to month format.
It gives you the following code, with input your input dataframe:
from pyspark.sql import functions as F
result = input.select(
F.col('TYPE'),
F.explode(
F.sequence(
F.to_timestamp(F.col('DTIN_MNTH'), 'yyyy-MM'),
F.to_timestamp(F.col('DTOUT_MNTH'), 'yyyy-MM'),
F.expr('INTERVAL 1 MONTH'))
).alias('MNTH')
).withColumn('MNTH', F.date_format(F.col('MNTH'), 'yyyy-MM'))
I want to select a column that equals to a certain value. I am doing this in scala and having a little trouble.
Heres my code
df.select(df("state")==="TX").show()
this returns the state column with boolean values instead of just TX
Ive also tried
df.select(df("state")=="TX").show()
but this doesn't work either.
I had the same issue, and the following syntax worked for me:
df.filter(df("state")==="TX").show()
I'm using Spark 1.6.
There is another simple sql like option. With Spark 1.6 below also should work.
df.filter("state = 'TX'")
This is a new way of specifying sql like filters. For a full list of supported operators, check out this class.
You should be using where, select is a projection that returns the output of the statement, thus why you get boolean values. where is a filter that keeps the structure of the dataframe, but only keeps data where the filter works.
Along the same line though, per the documentation, you can write this in 3 different ways
// The following are equivalent:
peopleDf.filter($"age" > 15)
peopleDf.where($"age" > 15)
peopleDf($"age" > 15)
To get the negation, do this ...
df.filter(not( ..expression.. ))
eg
df.filter(not($"state" === "TX"))
df.filter($"state" like "T%%") for pattern matching
df.filter($"state" === "TX") or df.filter("state = 'TX'") for equality
Worked on Spark V2.*
import sqlContext.implicits._
df.filter($"state" === "TX")
if needs to be compared against a variable (e.g., var):
import sqlContext.implicits._
df.filter($"state" === var)
Note : import sqlContext.implicits._
We can write multiple Filter/where conditions in Dataframe.
For example:
table1_df
.filter($"Col_1_name" === "buddy") // check for equal to string
.filter($"Col_2_name" === "A")
.filter(not($"Col_2_name".contains(" .sql"))) // filter a string which is not relevent
.filter("Col_2_name is not null") // no null filter
.take(5).foreach(println)
Here is the complete example using spark2.2+ taking data in json...
val myjson = "[{\"name\":\"Alabama\",\"abbreviation\":\"AL\"},{\"name\":\"Alaska\",\"abbreviation\":\"AK\"},{\"name\":\"American Samoa\",\"abbreviation\":\"AS\"},{\"name\":\"Arizona\",\"abbreviation\":\"AZ\"},{\"name\":\"Arkansas\",\"abbreviation\":\"AR\"},{\"name\":\"California\",\"abbreviation\":\"CA\"},{\"name\":\"Colorado\",\"abbreviation\":\"CO\"},{\"name\":\"Connecticut\",\"abbreviation\":\"CT\"},{\"name\":\"Delaware\",\"abbreviation\":\"DE\"},{\"name\":\"District Of Columbia\",\"abbreviation\":\"DC\"},{\"name\":\"Federated States Of Micronesia\",\"abbreviation\":\"FM\"},{\"name\":\"Florida\",\"abbreviation\":\"FL\"},{\"name\":\"Georgia\",\"abbreviation\":\"GA\"},{\"name\":\"Guam\",\"abbreviation\":\"GU\"},{\"name\":\"Hawaii\",\"abbreviation\":\"HI\"},{\"name\":\"Idaho\",\"abbreviation\":\"ID\"},{\"name\":\"Illinois\",\"abbreviation\":\"IL\"},{\"name\":\"Indiana\",\"abbreviation\":\"IN\"},{\"name\":\"Iowa\",\"abbreviation\":\"IA\"},{\"name\":\"Kansas\",\"abbreviation\":\"KS\"},{\"name\":\"Kentucky\",\"abbreviation\":\"KY\"},{\"name\":\"Louisiana\",\"abbreviation\":\"LA\"},{\"name\":\"Maine\",\"abbreviation\":\"ME\"},{\"name\":\"Marshall Islands\",\"abbreviation\":\"MH\"},{\"name\":\"Maryland\",\"abbreviation\":\"MD\"},{\"name\":\"Massachusetts\",\"abbreviation\":\"MA\"},{\"name\":\"Michigan\",\"abbreviation\":\"MI\"},{\"name\":\"Minnesota\",\"abbreviation\":\"MN\"},{\"name\":\"Mississippi\",\"abbreviation\":\"MS\"},{\"name\":\"Missouri\",\"abbreviation\":\"MO\"},{\"name\":\"Montana\",\"abbreviation\":\"MT\"},{\"name\":\"Nebraska\",\"abbreviation\":\"NE\"},{\"name\":\"Nevada\",\"abbreviation\":\"NV\"},{\"name\":\"New Hampshire\",\"abbreviation\":\"NH\"},{\"name\":\"New Jersey\",\"abbreviation\":\"NJ\"},{\"name\":\"New Mexico\",\"abbreviation\":\"NM\"},{\"name\":\"New York\",\"abbreviation\":\"NY\"},{\"name\":\"North Carolina\",\"abbreviation\":\"NC\"},{\"name\":\"North Dakota\",\"abbreviation\":\"ND\"},{\"name\":\"Northern Mariana Islands\",\"abbreviation\":\"MP\"},{\"name\":\"Ohio\",\"abbreviation\":\"OH\"},{\"name\":\"Oklahoma\",\"abbreviation\":\"OK\"},{\"name\":\"Oregon\",\"abbreviation\":\"OR\"},{\"name\":\"Palau\",\"abbreviation\":\"PW\"},{\"name\":\"Pennsylvania\",\"abbreviation\":\"PA\"},{\"name\":\"Puerto Rico\",\"abbreviation\":\"PR\"},{\"name\":\"Rhode Island\",\"abbreviation\":\"RI\"},{\"name\":\"South Carolina\",\"abbreviation\":\"SC\"},{\"name\":\"South Dakota\",\"abbreviation\":\"SD\"},{\"name\":\"Tennessee\",\"abbreviation\":\"TN\"},{\"name\":\"Texas\",\"abbreviation\":\"TX\"},{\"name\":\"Utah\",\"abbreviation\":\"UT\"},{\"name\":\"Vermont\",\"abbreviation\":\"VT\"},{\"name\":\"Virgin Islands\",\"abbreviation\":\"VI\"},{\"name\":\"Virginia\",\"abbreviation\":\"VA\"},{\"name\":\"Washington\",\"abbreviation\":\"WA\"},{\"name\":\"West Virginia\",\"abbreviation\":\"WV\"},{\"name\":\"Wisconsin\",\"abbreviation\":\"WI\"},{\"name\":\"Wyoming\",\"abbreviation\":\"WY\"}]"
import spark.implicits._
val df = spark.read.json(Seq(myjson).toDS)
df.show
import spark.implicits._
val df = spark.read.json(Seq(myjson).toDS)
df.show
scala> df.show
+------------+--------------------+
|abbreviation| name|
+------------+--------------------+
| AL| Alabama|
| AK| Alaska|
| AS| American Samoa|
| AZ| Arizona|
| AR| Arkansas|
| CA| California|
| CO| Colorado|
| CT| Connecticut|
| DE| Delaware|
| DC|District Of Columbia|
| FM|Federated States ...|
| FL| Florida|
| GA| Georgia|
| GU| Guam|
| HI| Hawaii|
| ID| Idaho|
| IL| Illinois|
| IN| Indiana|
| IA| Iowa|
| KS| Kansas|
+------------+--------------------+
// equals matching
scala> df.filter(df("abbreviation") === "TX").show
+------------+-----+
|abbreviation| name|
+------------+-----+
| TX|Texas|
+------------+-----+
// or using lit
scala> df.filter(df("abbreviation") === lit("TX")).show
+------------+-----+
|abbreviation| name|
+------------+-----+
| TX|Texas|
+------------+-----+
//not expression
scala> df.filter(not(df("abbreviation") === "TX")).show
+------------+--------------------+
|abbreviation| name|
+------------+--------------------+
| AL| Alabama|
| AK| Alaska|
| AS| American Samoa|
| AZ| Arizona|
| AR| Arkansas|
| CA| California|
| CO| Colorado|
| CT| Connecticut|
| DE| Delaware|
| DC|District Of Columbia|
| FM|Federated States ...|
| FL| Florida|
| GA| Georgia|
| GU| Guam|
| HI| Hawaii|
| ID| Idaho|
| IL| Illinois|
| IN| Indiana|
| IA| Iowa|
| KS| Kansas|
+------------+--------------------+
only showing top 20 rows
Let's create a sample dataset and do a deep dive into exactly why OP's code didn't work.
Here's our sample data:
val df = Seq(
("Rockets", 2, "TX"),
("Warriors", 6, "CA"),
("Spurs", 5, "TX"),
("Knicks", 2, "NY")
).toDF("team_name", "num_championships", "state")
We can pretty print our dataset with the show() method:
+---------+-----------------+-----+
|team_name|num_championships|state|
+---------+-----------------+-----+
| Rockets| 2| TX|
| Warriors| 6| CA|
| Spurs| 5| TX|
| Knicks| 2| NY|
+---------+-----------------+-----+
Let's examine the results of df.select(df("state")==="TX").show():
+------------+
|(state = TX)|
+------------+
| true|
| false|
| true|
| false|
+------------+
It's easier to understand this result by simply appending a column - df.withColumn("is_state_tx", df("state")==="TX").show():
+---------+-----------------+-----+-----------+
|team_name|num_championships|state|is_state_tx|
+---------+-----------------+-----+-----------+
| Rockets| 2| TX| true|
| Warriors| 6| CA| false|
| Spurs| 5| TX| true|
| Knicks| 2| NY| false|
+---------+-----------------+-----+-----------+
The other code OP tried (df.select(df("state")=="TX").show()) returns this error:
<console>:27: error: overloaded method value select with alternatives:
[U1](c1: org.apache.spark.sql.TypedColumn[org.apache.spark.sql.Row,U1])org.apache.spark.sql.Dataset[U1] <and>
(col: String,cols: String*)org.apache.spark.sql.DataFrame <and>
(cols: org.apache.spark.sql.Column*)org.apache.spark.sql.DataFrame
cannot be applied to (Boolean)
df.select(df("state")=="TX").show()
^
The === operator is defined in the Column class. The Column class doesn't define a == operator and that's why this code is erroring out.
Here's the accepted answer that works:
df.filter(df("state")==="TX").show()
+---------+-----------------+-----+
|team_name|num_championships|state|
+---------+-----------------+-----+
| Rockets| 2| TX|
| Spurs| 5| TX|
+---------+-----------------+-----+
As other posters have mentioned, the === method takes an argument with an Any type, so this isn't the only solution that works. This works too for example:
df.filter(df("state") === lit("TX")).show
+---------+-----------------+-----+
|team_name|num_championships|state|
+---------+-----------------+-----+
| Rockets| 2| TX|
| Spurs| 5| TX|
+---------+-----------------+-----+
The Column equalTo method can also be used:
df.filter(df("state").equalTo("TX")).show()
+---------+-----------------+-----+
|team_name|num_championships|state|
+---------+-----------------+-----+
| Rockets| 2| TX|
| Spurs| 5| TX|
+---------+-----------------+-----+
It worthwhile studying this example in detail. Scala's syntax seems magical at times, especially when method are invoked without dot notation. It's hard for the untrained eye to see that === is a method defined in the Column class!
In Spark 2.4
To compare with one value:
df.filter(lower(trim($"col_name")) === "<value>").show()
To compare with collection of value:
df.filter($"col_name".isInCollection(new HashSet<>(Arrays.asList("value1", "value2")))).show()
I have a data frame:
+---------+---------------------+
| id| Name|
+---------+---------------------+
| 1| 'Gary'|
| 1| 'Danny'|
| 2| 'Christopher'|
| 2| 'Kevin'|
+---------+---------------------+
I need to combine all the Name values in the id column. Please tell me how to get from it:
+---------+------------------------+
| id| Name|
+---------+------------------------+
| 1| ['Gary', 'Danny']|
| 2| ['Kevin','Christopher']|
+---------+------------------------+
You can use groupBy and collect functions. Based on your need you can use list or set etc.
df.groupBy(col("id")).agg(collect_list(col("Name"))
in case you want duplicate values
df.groupBy(col("id")).agg(collect_set(col("Name"))
if you want unique values
Use groupBy and collect_list functions for this case.
from pyspark.sql.functions import *
df.groupBy(col("id")).agg(collect_list(col("Name")).alias("Name")).show(10,False)
#+---+------------------------+
#|id |Name |
#+---+------------------------+
#|1 |['Gary', 'Danny'] |
#|2 |['Kevin', 'Christopher']|
#+---+------------------------+
df.groupby('id')['Name'].apply(list)
I have a column in my dataframe that is a string with the value like ["value_a", "value_b"].
What is the best way to convert this column to Array and explode it? For now, I'm doing something like:
explode(split(col("value"), ",")).alias("value")
But I'm getting strings like ["awesome" or "John" or "Maria]" and the expected output should be awesome, John, Maria (one item per line, that why i'm using explode).
Sample code to reproduce:
sample_input = [
{"id":1,"value":"[\"johnny\", \"maria\"]"},
{"id":2,"value":"[\"awesome\", \"George\"]"}
]
df = spark.createDataFrame(sample_input)
df.select(col("id"), explode(split(col("value"), ",")).alias("value")).show(n=10)
Output generated by code above:
+---+----------+
| id| value|
+---+----------+
| 1| ["johnny"|
| 1| "maria"]|
| 2|["awesome"|
| 2| "George"]|
+---+----------+
Expected should be:
+---+----------+
| id| value|
+---+----------+
| 1| johnny |
| 1| maria |
| 2| awesome|
| 2| George|
+---+----------+
Worked for me.
sample_input = [
{"id":1,"value":["johnny", "maria"]},
{"id":2,"value":["awesome", "George"]}
]
df = spark.createDataFrame(sample_input)
df.select(col("id"), explode(col("value")).alias("value")).show(n=10)
+---+-------+
| id| value|
+---+-------+
| 1| johnny|
| 1| maria|
| 2|awesome|
| 2| George|
+---+-------+