I want to select a column that equals to a certain value. I am doing this in scala and having a little trouble.
Heres my code
df.select(df("state")==="TX").show()
this returns the state column with boolean values instead of just TX
Ive also tried
df.select(df("state")=="TX").show()
but this doesn't work either.
I had the same issue, and the following syntax worked for me:
df.filter(df("state")==="TX").show()
I'm using Spark 1.6.
There is another simple sql like option. With Spark 1.6 below also should work.
df.filter("state = 'TX'")
This is a new way of specifying sql like filters. For a full list of supported operators, check out this class.
You should be using where, select is a projection that returns the output of the statement, thus why you get boolean values. where is a filter that keeps the structure of the dataframe, but only keeps data where the filter works.
Along the same line though, per the documentation, you can write this in 3 different ways
// The following are equivalent:
peopleDf.filter($"age" > 15)
peopleDf.where($"age" > 15)
peopleDf($"age" > 15)
To get the negation, do this ...
df.filter(not( ..expression.. ))
eg
df.filter(not($"state" === "TX"))
df.filter($"state" like "T%%") for pattern matching
df.filter($"state" === "TX") or df.filter("state = 'TX'") for equality
Worked on Spark V2.*
import sqlContext.implicits._
df.filter($"state" === "TX")
if needs to be compared against a variable (e.g., var):
import sqlContext.implicits._
df.filter($"state" === var)
Note : import sqlContext.implicits._
We can write multiple Filter/where conditions in Dataframe.
For example:
table1_df
.filter($"Col_1_name" === "buddy") // check for equal to string
.filter($"Col_2_name" === "A")
.filter(not($"Col_2_name".contains(" .sql"))) // filter a string which is not relevent
.filter("Col_2_name is not null") // no null filter
.take(5).foreach(println)
Here is the complete example using spark2.2+ taking data in json...
val myjson = "[{\"name\":\"Alabama\",\"abbreviation\":\"AL\"},{\"name\":\"Alaska\",\"abbreviation\":\"AK\"},{\"name\":\"American Samoa\",\"abbreviation\":\"AS\"},{\"name\":\"Arizona\",\"abbreviation\":\"AZ\"},{\"name\":\"Arkansas\",\"abbreviation\":\"AR\"},{\"name\":\"California\",\"abbreviation\":\"CA\"},{\"name\":\"Colorado\",\"abbreviation\":\"CO\"},{\"name\":\"Connecticut\",\"abbreviation\":\"CT\"},{\"name\":\"Delaware\",\"abbreviation\":\"DE\"},{\"name\":\"District Of Columbia\",\"abbreviation\":\"DC\"},{\"name\":\"Federated States Of Micronesia\",\"abbreviation\":\"FM\"},{\"name\":\"Florida\",\"abbreviation\":\"FL\"},{\"name\":\"Georgia\",\"abbreviation\":\"GA\"},{\"name\":\"Guam\",\"abbreviation\":\"GU\"},{\"name\":\"Hawaii\",\"abbreviation\":\"HI\"},{\"name\":\"Idaho\",\"abbreviation\":\"ID\"},{\"name\":\"Illinois\",\"abbreviation\":\"IL\"},{\"name\":\"Indiana\",\"abbreviation\":\"IN\"},{\"name\":\"Iowa\",\"abbreviation\":\"IA\"},{\"name\":\"Kansas\",\"abbreviation\":\"KS\"},{\"name\":\"Kentucky\",\"abbreviation\":\"KY\"},{\"name\":\"Louisiana\",\"abbreviation\":\"LA\"},{\"name\":\"Maine\",\"abbreviation\":\"ME\"},{\"name\":\"Marshall Islands\",\"abbreviation\":\"MH\"},{\"name\":\"Maryland\",\"abbreviation\":\"MD\"},{\"name\":\"Massachusetts\",\"abbreviation\":\"MA\"},{\"name\":\"Michigan\",\"abbreviation\":\"MI\"},{\"name\":\"Minnesota\",\"abbreviation\":\"MN\"},{\"name\":\"Mississippi\",\"abbreviation\":\"MS\"},{\"name\":\"Missouri\",\"abbreviation\":\"MO\"},{\"name\":\"Montana\",\"abbreviation\":\"MT\"},{\"name\":\"Nebraska\",\"abbreviation\":\"NE\"},{\"name\":\"Nevada\",\"abbreviation\":\"NV\"},{\"name\":\"New Hampshire\",\"abbreviation\":\"NH\"},{\"name\":\"New Jersey\",\"abbreviation\":\"NJ\"},{\"name\":\"New Mexico\",\"abbreviation\":\"NM\"},{\"name\":\"New York\",\"abbreviation\":\"NY\"},{\"name\":\"North Carolina\",\"abbreviation\":\"NC\"},{\"name\":\"North Dakota\",\"abbreviation\":\"ND\"},{\"name\":\"Northern Mariana Islands\",\"abbreviation\":\"MP\"},{\"name\":\"Ohio\",\"abbreviation\":\"OH\"},{\"name\":\"Oklahoma\",\"abbreviation\":\"OK\"},{\"name\":\"Oregon\",\"abbreviation\":\"OR\"},{\"name\":\"Palau\",\"abbreviation\":\"PW\"},{\"name\":\"Pennsylvania\",\"abbreviation\":\"PA\"},{\"name\":\"Puerto Rico\",\"abbreviation\":\"PR\"},{\"name\":\"Rhode Island\",\"abbreviation\":\"RI\"},{\"name\":\"South Carolina\",\"abbreviation\":\"SC\"},{\"name\":\"South Dakota\",\"abbreviation\":\"SD\"},{\"name\":\"Tennessee\",\"abbreviation\":\"TN\"},{\"name\":\"Texas\",\"abbreviation\":\"TX\"},{\"name\":\"Utah\",\"abbreviation\":\"UT\"},{\"name\":\"Vermont\",\"abbreviation\":\"VT\"},{\"name\":\"Virgin Islands\",\"abbreviation\":\"VI\"},{\"name\":\"Virginia\",\"abbreviation\":\"VA\"},{\"name\":\"Washington\",\"abbreviation\":\"WA\"},{\"name\":\"West Virginia\",\"abbreviation\":\"WV\"},{\"name\":\"Wisconsin\",\"abbreviation\":\"WI\"},{\"name\":\"Wyoming\",\"abbreviation\":\"WY\"}]"
import spark.implicits._
val df = spark.read.json(Seq(myjson).toDS)
df.show
import spark.implicits._
val df = spark.read.json(Seq(myjson).toDS)
df.show
scala> df.show
+------------+--------------------+
|abbreviation| name|
+------------+--------------------+
| AL| Alabama|
| AK| Alaska|
| AS| American Samoa|
| AZ| Arizona|
| AR| Arkansas|
| CA| California|
| CO| Colorado|
| CT| Connecticut|
| DE| Delaware|
| DC|District Of Columbia|
| FM|Federated States ...|
| FL| Florida|
| GA| Georgia|
| GU| Guam|
| HI| Hawaii|
| ID| Idaho|
| IL| Illinois|
| IN| Indiana|
| IA| Iowa|
| KS| Kansas|
+------------+--------------------+
// equals matching
scala> df.filter(df("abbreviation") === "TX").show
+------------+-----+
|abbreviation| name|
+------------+-----+
| TX|Texas|
+------------+-----+
// or using lit
scala> df.filter(df("abbreviation") === lit("TX")).show
+------------+-----+
|abbreviation| name|
+------------+-----+
| TX|Texas|
+------------+-----+
//not expression
scala> df.filter(not(df("abbreviation") === "TX")).show
+------------+--------------------+
|abbreviation| name|
+------------+--------------------+
| AL| Alabama|
| AK| Alaska|
| AS| American Samoa|
| AZ| Arizona|
| AR| Arkansas|
| CA| California|
| CO| Colorado|
| CT| Connecticut|
| DE| Delaware|
| DC|District Of Columbia|
| FM|Federated States ...|
| FL| Florida|
| GA| Georgia|
| GU| Guam|
| HI| Hawaii|
| ID| Idaho|
| IL| Illinois|
| IN| Indiana|
| IA| Iowa|
| KS| Kansas|
+------------+--------------------+
only showing top 20 rows
Let's create a sample dataset and do a deep dive into exactly why OP's code didn't work.
Here's our sample data:
val df = Seq(
("Rockets", 2, "TX"),
("Warriors", 6, "CA"),
("Spurs", 5, "TX"),
("Knicks", 2, "NY")
).toDF("team_name", "num_championships", "state")
We can pretty print our dataset with the show() method:
+---------+-----------------+-----+
|team_name|num_championships|state|
+---------+-----------------+-----+
| Rockets| 2| TX|
| Warriors| 6| CA|
| Spurs| 5| TX|
| Knicks| 2| NY|
+---------+-----------------+-----+
Let's examine the results of df.select(df("state")==="TX").show():
+------------+
|(state = TX)|
+------------+
| true|
| false|
| true|
| false|
+------------+
It's easier to understand this result by simply appending a column - df.withColumn("is_state_tx", df("state")==="TX").show():
+---------+-----------------+-----+-----------+
|team_name|num_championships|state|is_state_tx|
+---------+-----------------+-----+-----------+
| Rockets| 2| TX| true|
| Warriors| 6| CA| false|
| Spurs| 5| TX| true|
| Knicks| 2| NY| false|
+---------+-----------------+-----+-----------+
The other code OP tried (df.select(df("state")=="TX").show()) returns this error:
<console>:27: error: overloaded method value select with alternatives:
[U1](c1: org.apache.spark.sql.TypedColumn[org.apache.spark.sql.Row,U1])org.apache.spark.sql.Dataset[U1] <and>
(col: String,cols: String*)org.apache.spark.sql.DataFrame <and>
(cols: org.apache.spark.sql.Column*)org.apache.spark.sql.DataFrame
cannot be applied to (Boolean)
df.select(df("state")=="TX").show()
^
The === operator is defined in the Column class. The Column class doesn't define a == operator and that's why this code is erroring out.
Here's the accepted answer that works:
df.filter(df("state")==="TX").show()
+---------+-----------------+-----+
|team_name|num_championships|state|
+---------+-----------------+-----+
| Rockets| 2| TX|
| Spurs| 5| TX|
+---------+-----------------+-----+
As other posters have mentioned, the === method takes an argument with an Any type, so this isn't the only solution that works. This works too for example:
df.filter(df("state") === lit("TX")).show
+---------+-----------------+-----+
|team_name|num_championships|state|
+---------+-----------------+-----+
| Rockets| 2| TX|
| Spurs| 5| TX|
+---------+-----------------+-----+
The Column equalTo method can also be used:
df.filter(df("state").equalTo("TX")).show()
+---------+-----------------+-----+
|team_name|num_championships|state|
+---------+-----------------+-----+
| Rockets| 2| TX|
| Spurs| 5| TX|
+---------+-----------------+-----+
It worthwhile studying this example in detail. Scala's syntax seems magical at times, especially when method are invoked without dot notation. It's hard for the untrained eye to see that === is a method defined in the Column class!
In Spark 2.4
To compare with one value:
df.filter(lower(trim($"col_name")) === "<value>").show()
To compare with collection of value:
df.filter($"col_name".isInCollection(new HashSet<>(Arrays.asList("value1", "value2")))).show()
I have a column in my dataframe that is a string with the value like ["value_a", "value_b"].
What is the best way to convert this column to Array and explode it? For now, I'm doing something like:
explode(split(col("value"), ",")).alias("value")
But I'm getting strings like ["awesome" or "John" or "Maria]" and the expected output should be awesome, John, Maria (one item per line, that why i'm using explode).
Sample code to reproduce:
sample_input = [
{"id":1,"value":"[\"johnny\", \"maria\"]"},
{"id":2,"value":"[\"awesome\", \"George\"]"}
]
df = spark.createDataFrame(sample_input)
df.select(col("id"), explode(split(col("value"), ",")).alias("value")).show(n=10)
Output generated by code above:
+---+----------+
| id| value|
+---+----------+
| 1| ["johnny"|
| 1| "maria"]|
| 2|["awesome"|
| 2| "George"]|
+---+----------+
Expected should be:
+---+----------+
| id| value|
+---+----------+
| 1| johnny |
| 1| maria |
| 2| awesome|
| 2| George|
+---+----------+
Worked for me.
sample_input = [
{"id":1,"value":["johnny", "maria"]},
{"id":2,"value":["awesome", "George"]}
]
df = spark.createDataFrame(sample_input)
df.select(col("id"), explode(col("value")).alias("value")).show(n=10)
+---+-------+
| id| value|
+---+-------+
| 1| johnny|
| 1| maria|
| 2|awesome|
| 2| George|
+---+-------+
I have a big data set with multiple columns in it. Data Frame example is below, Here column 'first' holds names which I want to check whether is Proper case or not ? like aamir should be Aamir and Aamir malik should be Aamir Malik.
I want something like below.
I used Pyspark and below codes where I am getting the right answer but I want to detect it first and then make changes.
Here I have add a new column 'correct' and performed function.
name_check_1 = name_check.withColumn("correct", initcap(col("first")))
Then compare columns correct and first so it gives me not proper case name.
name_check_2 = name_check_1.filter('correct != first')
I need a way to get not proper case first and then correction.
My solution below :
Logic : Slice the string for first alphabet check it with correct string if equal it is valid else invalid. Make uppercase the first alphabet of firstname and lastname and rest to lower case and concatenate. Select only relevant columns.
from pyspark.sql.functions import *
from pyspark.sql.types import *
values = [
(1,"aamir"),
(2,"Aamir"),
(3,"atif"),
(4,"Atif"),
(5,"tahir"),
(6,"sameer"),
(7,"ifzaan"),
(8,"Ifzaan"),
(9,"Saquib"),
(10,"aamir malik"),
(11,"adcA")
]
rdd = sc.parallelize(values)
schema = StructType([
StructField("IDs", IntegerType(), True),
StructField("first", StringType(), True)
])
#create dataframe
data = spark.createDataFrame(rdd, schema)
#split first column into firstname and lastname
data = data.withColumn("firstname", split(data["first"]," ")[0]).withColumn("lastname", split(data["first"]," ")[1])
data = data \
.withColumn("flag", when((trim(substring(data["firstname"],0,1)) == upper(trim(substring(data["firstname"],0,1)))) |
(trim(substring(data["lastname"],0,1)) == upper(trim(substring(data["lastname"],0,1)))), lit("valid")).otherwise(lit("invalid"))) \
.withColumn("correct" , concat(concat(upper(trim(substring(data["firstname"],0,1))), trim(lower(substring(data["firstname"],2,1000)))),lit(" "),
when(data["lastname"].isNull(),lit("")) \
.otherwise(concat(upper(trim(substring(data["lastname"],0,1))),trim(lower(substring(data["lastname"],2,1000))))))) \
.select("IDs","first","flag","correct")
data.show()
#Result
+---+-----------+-------+-----------+
|IDs| first| flag| correct|
+---+-----------+-------+-----------+
| 1| aamir|invalid| Aamir |
| 2| Aamir| valid| Aamir |
| 3| atif|invalid| Atif |
| 4| Atif| valid| Atif |
| 5| tahir|invalid| Tahir |
| 6| sameer|invalid| Sameer |
| 7| ifzaan|invalid| Ifzaan |
| 8| Ifzaan| valid| Ifzaan |
| 9| Saquib| valid| Saquib |
| 10|aamir malik|invalid|Aamir Malik|
| 11| adcA|invalid| Adca |
+---+-----------+-------+-----------+
You know how to use initcap, so just create new column correct and compare it to the column first to check if it's already valid or not:
df.withColumn("correct", initcap(lower(col("first")))) \
.withColumn("flag", when(col("correct") != col("first"), lit("invalid")).otherwise("valid")) \
.show()
Gives:
+---+-----------+-----------+-------+
| id| first| correct| flag|
+---+-----------+-----------+-------+
| 1| aamir| Aamir|invalid|
| 2| Aamir| Aamir| valid|
| 3| atif| Atif|invalid|
| 4| Atif| Atif| valid|
| 5| tahir| Tahir|invalid|
| 6| sameer| Sameer|invalid|
| 7| ifzaan| Ifzaan|invalid|
| 8|Ifzaan Abcd|Ifzaan Abcd| valid|
| 9|Saquib abcd|Saquib Abcd|invalid|
+---+-----------+-----------+-------+
Given CSV file, I converted to Dataframe using something code like the following.
raw_df = spark.read.csv(input_data, header=True)
That creates dataframe looks something like this:
| Name |
========
| 23 |
| hi2 |
| me3 |
| do |
I want to convert this column to only contain numbers. The final result should be like where hi and me are removed:
| Name |
========
| 23 |
| 2 |
| 3 |
| do |
I want to sanitize the values and make sure it only contains number. But I'm not sure if it's possible in Spark.
Yes, It's possible. You can use regex_replace from function.
Please check this:
import pyspark.sql.functions as f
df = spark.sparkContext.parallelize([('12',), ('hi2',), ('me3',)]).toDF(["name"])
df.show()
+----+
|name|
+----+
| 12|
| hi2|
| me3|
+----+
final_df = df.withColumn('sanitize', f.regexp_replace('name', '[a-zA-Z]', ''))
final_df.show()
+----+--------+
|name|sanitize|
+----+--------+
| 12| 12|
| hi2| 2|
| me3| 3|
+----+--------+
final_df.withColumn('len', f.length('sanitize')).show()
+----+--------+---+
|name|sanitize|len|
+----+--------+---+
| 12| 12| 2|
| hi2| 2| 1|
| me3| 3| 1|
+----+--------+---+
You can adjust regex.
Otherway doing the same. It's just an another way but better use spark inbuilt functions if available. as shown above also.
from pyspark.sql.functions import udf
import re
user_func = udf (lambda x: re.findall("\d+", x)[0])
newdf = df.withColumn('new_column',user_func(df.Name))
>>> newdf.show()
+----+----------+
|Name|new_column|
+----+----------+
| 23| 23|
| hi2| 2|
| me3| 3|
+----+----------+