Multiple filters per column in Spark - apache-spark

This might sound like a stupid question but any help would be appreciated. I am trying to apply a filter on my RDD based on a date column.
val tran1 = sc
.textFile("TranData.Dat")
.map(_.split("\t"))
.map(p => postran(
p(0), p(1), p(2), p(3), p(4), p(5), p(6), p(7), p(8),
p(9).toDouble, p(10).toDouble,p(11).toDouble))
I was able to apply a single date filter like below.
val tran = tran1.filter(x => x.PeriodDate == "2015-03-21 00:00:00.000")
How do I add more dates to this filter ? Is there a way I can read the comma separated date values in a variable and just pass that variable inside the filter() ?
Thanks

The following SQL:
select * From Table where age in (25, 35, 45) and country in ("Ireland", "Italy")
can be written with the following Scala:
val allowedAges: Seq[Int] = Seq(25, 35, 45)
val allowedCountries: Seq[String] = Seq("Ireland", "Italy")
val result = table.filter(x => (allowedAges.contains(x.age) && allowedCountries.contains(x.country))

Related

Spark SQL - Get Column Names of a Hive Table in a String

I'm trying to get the column names of a Hive table in a comma separated String. This is what I'm doing
val colNameDF = spark.sql("show columns in hive_table")
val colNameStr = colNameDF.select("col_name").collect.mkString(", ")
And the output I get is
res0: String = [col_1], [col_2], [col_3]
But what I want is col_1, col_2, col_3. I can remove [ and ] from the String, but I'm curious as to whether we can get the column names without the brackets in the first place.
Edit: The column names in the Hive table don't contain [ ]
Instead of show columns, Try below approach as it is faster than yours.
val colNameDF = spark.sql("select * from hive_table").limit(0)
Or
val colNameDF = spark.table("hive_table").limit(0)
val colNameStr = colNameDF.columns.mkString(", ")
The collect returns to you an array of Row which is particularly represented internally as array of values, so you need to trick it like this:
val colNameDF = spark.sql("show columns in hive_table")
val colNameStr = colNameDF.select("col_name").collect.map(r=>r.getString(0)).mkString(", ")
Building on #Srinivas' answer above, here is the equivalent Python code. It is very fast:
colNameStr = ",".join(spark.table(hive_table).limit(0).columns)

Fetching columns dynamically from dataframe , column name would come in variable

I am not able to fetch values for given dynamic columns. Any help ?
var dynamicColumns = "col(\"one\"),col(\"two\"),col(\"three\")"
dataFrame.select(dynamicColumns)
Just use names alone:
val dynamicColumns = Seq("one", "two", "three")
dataFrame.select(dynamicColumns map col: _*)
and if you don't have control over the format, use regexp to extract names first
val dynamicColumns = "col(\"one\"),col(\"two\"),col(\"three\")"
val p = """(?<=col\(").+?(?="\))""".r
dataFrame.select(p.findAllIn(dynamicColumns) map col toSeq: _*)

Spark: subset a few columns and remove null rows

I am running spark 2.1 on windows 10, I have fetched data from MySQL to spark using JDBC and the table looks like this
x y z
------------------
1 a d1
Null v ed
5 Null Null
7 s Null
Null bd Null
I want to create a new spark dataset with only x and y columns from the above table and I wan't to keep only those rows which do not have null in either of those 2 columns. My resultant table should look like this
x y
--------
1 a
7 s
The following is the code:
val load_DF = spark.read.format("jdbc").option("url", "jdbc:mysql://100.150.200.250:3306").option("dbtable", "schema.table_name").option("user", "uname1").option("password", "Pass1").load()
val filter_DF = load_DF.select($"x".isNotNull,$"y".isNotNull).rdd
// lets print first 5 values of filter_DF
filter_DF.take(5)
res0: Array[org.apache.spark.sql.Row] = Array([true,true], [false,true], [true,false], [true,true], [false,true])
As shown, the above result doesn't give me actual values but it returns Boolean values (true when value is not Null and false when value is Null)
Try this;
val load_DF = spark.read.format("jdbc").option("url", "jdbc:mysql://100.150.200.250:3306").option("dbtable", "schema.table_name").option("user", "uname1").option("password", "Pass1").load()
Now;
load_DF.select($"x",$"y").filter("x !== null").filter("y !== null")
Spark provides DataFrameNaFunctions for this purpose of dropping null values, etc.
In your example above you just need to call the following on a DataSet that you load
val noNullValues = load_DF.na.drop("all", Seq("x", "y"))
This will drop records where nulls occur in either field x or y but not z. You can read up on DataFrameNaFunctions for further options to fill in data, or translate values if required.
Apply "any" in na.drop:
df = df.select("x", "y")
.na.drop("any", Seq("x", "y"))
You are simply applying a function (in this case isNotNull) to the values when you do a select - instead you need to replace select with filter.
val filter_DF = load_DF.filter($"x".isNotNull && $"y".isNotNull)
or if you prefer:
val filter_DF = load_DF.filter($"x".isNotNull).filter($"y".isNotNull)

Date add/subtract in cassandra/spark query

I have a scenario where I need to join multiple tables and identify if the date + another integer column is greater than another date column.
Select case when (manufacturedate + LeadTime < DueDate) then numericvalue ((DueDate - manufacturepdate) + 1) else PartSource.EffLeadTime)
Is there a way to handle it in spark sql?
Thanks,
Ash
I tried with sqlcontext, there is a date_add('date',integer). date_add() is hive functionality and it works for cassandra context too.
cc.sql("select date_add(current_date(),1) from table").show
Thanks
Aravinth
Assuming you have a DataFrame with your data, you are using Scala and the "another integer" represents a number of days, one way to do it is the following:
import org.apache.spark.sql.functions._
val numericvalue = 1
val column = when(
datediff(col("DueDate"), col("manufacturedate")) > col("LeadTime"), lit(numericvalue)
).otherwise(col("PartSource.EffLeadTime"))
val result = df.withColumn("newVal", column)
The desired value will be in a new column called "newVal".

How to dynamically create the list of the columns to include in select?

I tried to "generate" a spark query in this way
def stdizedOperationmode(sqLContext: SQLContext,withrul: DataFrame): DataFrame = {
// see http://spark.apache.org/docs/latest/sql-programming-guide.html
import sqLContext.implicits._
val AZ: Column = lit(0.00000001)
def opMode(id:Int): Column = {
(column("s"+id) - coalesce(column("a"+id) / column("sd"+id), column("a"+id) / lit(AZ))).as("std"+id)
}
// add the 21 std<i> columns based on s<i> - (a<id>/sd<id>)
val columns: IndexedSeq[Column] = 1 to 21 map(id => opMode(id))
val withStd = withrul.select(columns:_*)
withStd
}
Question how do I add "all other columns" (*) idea: something like withrul.select('* :+ columns:_*)
You can try the following :
// add the 21 std<i> columns based on s<i> - (a<id>/sd<id>)
val columns: IndexedSeq[Column] = 1 to 21 map(id => opMode(id))
val selectAll: Array[Column] = (for {
i <- withrul.columns
} yield withrul(i)) union columns.toSeq
val withStd = withrul.select(selectAll :_*)
The second line will yeild all the columns from withrul adding them with column as a Seq[Column]
You are not obliged to create a value to return it afterward, can replace the last 2 lines with :
withrul.select(selectAll : _*)

Resources