Spark: subset a few columns and remove null rows - apache-spark

I am running spark 2.1 on windows 10, I have fetched data from MySQL to spark using JDBC and the table looks like this
x y z
------------------
1 a d1
Null v ed
5 Null Null
7 s Null
Null bd Null
I want to create a new spark dataset with only x and y columns from the above table and I wan't to keep only those rows which do not have null in either of those 2 columns. My resultant table should look like this
x y
--------
1 a
7 s
The following is the code:
val load_DF = spark.read.format("jdbc").option("url", "jdbc:mysql://100.150.200.250:3306").option("dbtable", "schema.table_name").option("user", "uname1").option("password", "Pass1").load()
val filter_DF = load_DF.select($"x".isNotNull,$"y".isNotNull).rdd
// lets print first 5 values of filter_DF
filter_DF.take(5)
res0: Array[org.apache.spark.sql.Row] = Array([true,true], [false,true], [true,false], [true,true], [false,true])
As shown, the above result doesn't give me actual values but it returns Boolean values (true when value is not Null and false when value is Null)

Try this;
val load_DF = spark.read.format("jdbc").option("url", "jdbc:mysql://100.150.200.250:3306").option("dbtable", "schema.table_name").option("user", "uname1").option("password", "Pass1").load()
Now;
load_DF.select($"x",$"y").filter("x !== null").filter("y !== null")

Spark provides DataFrameNaFunctions for this purpose of dropping null values, etc.
In your example above you just need to call the following on a DataSet that you load
val noNullValues = load_DF.na.drop("all", Seq("x", "y"))
This will drop records where nulls occur in either field x or y but not z. You can read up on DataFrameNaFunctions for further options to fill in data, or translate values if required.

Apply "any" in na.drop:
df = df.select("x", "y")
.na.drop("any", Seq("x", "y"))

You are simply applying a function (in this case isNotNull) to the values when you do a select - instead you need to replace select with filter.
val filter_DF = load_DF.filter($"x".isNotNull && $"y".isNotNull)
or if you prefer:
val filter_DF = load_DF.filter($"x".isNotNull).filter($"y".isNotNull)

Related

Spark SQL - Get Column Names of a Hive Table in a String

I'm trying to get the column names of a Hive table in a comma separated String. This is what I'm doing
val colNameDF = spark.sql("show columns in hive_table")
val colNameStr = colNameDF.select("col_name").collect.mkString(", ")
And the output I get is
res0: String = [col_1], [col_2], [col_3]
But what I want is col_1, col_2, col_3. I can remove [ and ] from the String, but I'm curious as to whether we can get the column names without the brackets in the first place.
Edit: The column names in the Hive table don't contain [ ]
Instead of show columns, Try below approach as it is faster than yours.
val colNameDF = spark.sql("select * from hive_table").limit(0)
Or
val colNameDF = spark.table("hive_table").limit(0)
val colNameStr = colNameDF.columns.mkString(", ")
The collect returns to you an array of Row which is particularly represented internally as array of values, so you need to trick it like this:
val colNameDF = spark.sql("show columns in hive_table")
val colNameStr = colNameDF.select("col_name").collect.map(r=>r.getString(0)).mkString(", ")
Building on #Srinivas' answer above, here is the equivalent Python code. It is very fast:
colNameStr = ",".join(spark.table(hive_table).limit(0).columns)

opposite of spark dataframe `withColumn` method?

I'd like to be able to chain a transformation on my DataFrame that drops a column, rather than assigning the DataFrame to a variable (i.e. df.drop()). If I wanted to add a column, I could simply call df.withColumn(). What is the way to drop a column in an in-line chain of transformations?
For the entire example use this as baseline:
val testVariable = 10
var finalDF = spark.sql("'test' as test_column")
val iDF = spark.sql("select 'John Smith' as Name, cast('10' as integer) as Age, 'Illinois' as State")
val iDF2 = spark.sql("select 'Jane Doe' as Name, cast('40' as integer) as Age, 'Iowa' as State")
val iDF3 = spark.sql("select 'Blobby' as Name, cast('150' as integer) as Age, 'Non-US' as State")
val nameDF = iDF.unionAll(iDF2).unionAll(iDF3)
1 Conditional Drop
If you want to only drop on certain outputs and these are known outputs, you can build out conditional loops to check if the iterator needs to be dropped or not. In this case if the test variable exceeds 4 it will drop the name column, else it adds a new column.
finalDF = if (testVariable>=5) {
nameDF.drop("Name")
} else {
nameDF.withColumn("Cooler_Name", lit("Cool_Name")
}
finalDF.printSchema
2 Programmatically build the select statement. Baseline the select expression statement takes in independent strings and build them into commands that can be read by Spark. In the case below we know we have a test for drop but we do know what columns might be dropped. In this case if a column gets a test values that does not equal 1 we do not include the value in out command array. When we run the command array against the select expression on the table, those columns are dropped.
val columnNames = nameDF.columns
val arrayTestOutput = Array(1,0,1)
var iteratorArray = 1
var commandArray = Array("")
while(iteratorArray <= columnNames.length) {
if (arrayTestOutput(iteratorArray-1) == 1) {
if (iteratorArray == 1) {
commandArray = columnNames(iteratorArray-1)
} else {
commandArray = commandArray ++ columnNames(iteratorArray-1)
}
}
iteratorArray = iteratorArray + 1
}
finalDF=nameDF.selectExpr(commandArray:_*)
finalDF.printSchema

How to use isin function with values from text file?

I'd like to filter a dataframe using an external file.
This is how I use the filter now:
val Insert = Append_Ot.filter(
col("Name2").equalTo("brazil") ||
col("Name2").equalTo("france") ||
col("Name2").equalTo("algeria") ||
col("Name2").equalTo("tunisia") ||
col("Name2").equalTo("egypte"))
Instead of using hardcoded string literals, I'd like to create an external file with the values to filter by.
So I create this file:
val filter_numfile = sc.textFile("/user/zh/worskspace/filter_nmb.txt")
.map(_.split(" ")(1))
.collect
This gives me:
filter_numfile: Array[String] = Array(brazil, france, algeria, tunisia, egypte)
And then, I use isin function on Name2 column.
val Insert = Append_Ot.where($"Name2".isin(filter_numfile: _*))
But this gives me an empty dataframe. Why?
I am just adding some information to philantrovert answer in filter dataframe from external file
His answer is perfect but there might be some case unmatch so you will have to check for case mismatch as well
tl;dr Make sure that the letters use consistent case, i.e. they are all in upper or lower case. Simply use upper or lower standard functions.
lets say you have input file as
1 Algeria
2 tunisia
3 brazil
4 Egypt
you read the text file and change all the countries to lowercase as
val countries = sc.textFile("path to input file").map(_.split(" ")(1).trim)
.collect.toSeq
val array = Array(countries.map(_.toLowerCase) : _*)
Then you have your dataframe
val Append_Ot = sc.parallelize(Seq(("brazil"),("tunisia"),("algeria"),("name"))).toDF("Name2")
where you apply following condition
import org.apache.spark.sql.functions._
val Insert = Append_Ot.where(lower($"Name2").isin(array : _* ))
you should have output as
+-------+
|Name2 |
+-------+
|brazil |
|tunisia|
|algeria|
+-------+
The empty dataframe might be due to spelling mismatch too.

Fitter Spark RDD based on result from filtering of different RDD

conf = SparkConf().setAppName("my_app")
with SparkContext(conf=conf) as sc:
sqlContext = SQLContext(sc)
df = sqlContext.read.parquet(*s3keys)
# this gives me distinct values as list
rdd = df.filter(
(1442170800000 <= df.timestamp) & (
df.timestamp <= 1442185200000) & (
df.lat > 40.7480) & (df.lat < 40.7513) & (
df.lon > -73.8492) & (
df.lon < -73.8438)).map(lambda p: p.userid).distinct()
# how do I apply the above list to filter another rdd?
df2 = sqlContext.read.parquet(*s3keys_part2)
# example:
rdd = df2.filter(df2.col1 in (rdd values from above))
As mentioned by Matthew Graves what you need here is a join. It means more or less something like this:
pred = ((1442170800000 <= df.timestamp) &
(df.timestamp <= 1442185200000) &
(df.lat > 40.7480) &
(df.lat < 40.7513) &
(df.lon > -73.8492) &
(df.lon < -73.8438))
users = df.filter(pred).select("userid").distinct()
users.join(df2, users.userid == df2.col1)
This is Scala code, instead of Python, but hopefully it can still serve as an example.
val x = 1 to 9
val df2 = sc.parallelize(x.map(a => (a,a*a))).toDF()
val df3 = sc.parallelize(x.map(a => (a,a*a*a))).toDF()
This gives us two dataframes, each with columns named _1 and _2, which are the first nine natural numbers and their squares/cubes.
val fil = df2.filter("_1 < 5") // Nine is too many, let's go to four.
val filJoin = fil.join(df3,fil("_1") === df3("_1")
filJoin.collect
This gets us:
Array[org.apache.spark.sql.Row] = Array([1,1,1,1], [2,4,2,8], [3,9,3,27], [4,16,4,64])
To apply this to your problem, I would start with something like the following:
rdd2 = rdd.join(df2, rdd.userid == df2.userid, 'inner')
But notice that we need to tell it what columns to join on, which might be something other than userid for df2. I'd also recommend, instead of map(lambda p: p.userid) you use .select('userid').distinct() so that it's still a dataframe.
You can find out more about join here.

Multiple filters per column in Spark

This might sound like a stupid question but any help would be appreciated. I am trying to apply a filter on my RDD based on a date column.
val tran1 = sc
.textFile("TranData.Dat")
.map(_.split("\t"))
.map(p => postran(
p(0), p(1), p(2), p(3), p(4), p(5), p(6), p(7), p(8),
p(9).toDouble, p(10).toDouble,p(11).toDouble))
I was able to apply a single date filter like below.
val tran = tran1.filter(x => x.PeriodDate == "2015-03-21 00:00:00.000")
How do I add more dates to this filter ? Is there a way I can read the comma separated date values in a variable and just pass that variable inside the filter() ?
Thanks
The following SQL:
select * From Table where age in (25, 35, 45) and country in ("Ireland", "Italy")
can be written with the following Scala:
val allowedAges: Seq[Int] = Seq(25, 35, 45)
val allowedCountries: Seq[String] = Seq("Ireland", "Italy")
val result = table.filter(x => (allowedAges.contains(x.age) && allowedCountries.contains(x.country))

Resources