How to access an array using foreach in spark? - apache-spark

I have data like below :
tab1,c1|c2|c3
tab2,d1|d2|d3|d4|d5
tab3,e1|e2|e3|e4
I need to convert it to as below in spark:
select c1,c2,c3 from tab1;
select d1,d2,d3,d4,d5 from tab2;
select e1,e2,e3,e4 from tab3;
I am able to get like this:
d.foreach(f=>{println("select"+" "+f+" from"+";")})
select tab3,e1,e2,e3,e4 from;
select tab1,c1,c2,c3 from;
select tab2,d1,d2,d3,d4,d5 from;
Can anyone suggest?

I'm not seeing where spark fits in your question. What does the variable 'd' represent?
Here is my guess at something that may be helpful.
from pyspark.sql.types import *
from pyspark.sql.functions import *
mySchema = StructType([
StructField("table_name", StringType()),
StructField("column_name",
ArrayType(StringType())
)
])
df = spark.createDataFrame([
("tab1",["c1","c2","c3"]),
("tab2",["d1","d2","d3","d4","d5"]),
("tab3",["e1","e2","e3","e4"])
],
schema = mySchema
)
df.selectExpr('concat("select ", concat_ws(",", column_name), " from ", table_name, ";") as select_string').show(3, False)
Output:
+--------------------------------+
|select_string |
+--------------------------------+
|select c1,c2,c3 from tab1; |
|select d1,d2,d3,d4,d5 from tab2;|
|select e1,e2,e3,e4 from tab3; |
+--------------------------------+

You can also use a map operation on RDD.
Assuming you have a RDD of String like:
val rdd = spark.parallelize(Seq(("tab1,c1|c2|c3"), ("tab2,d1|d2|d3|d4|d5"), ("tab3,e1|e2|e3|e4")))
with this operation:
val select = rdd.map(str=> {
val separated = str.split(",", -1)
val table = separated(0)
val cols = separated(1).split("\\|", -1).mkString(",")
"select " + cols + " from " + table + ";"
})
you will get the expected result:
select.foreach(println(_))
select d1,d2,d3,d4,d5 from tab2;
select e1,e2,e3,e4 from tab3;
select c1,c2,c3 from tab1;

Related

Spark aggregate rows with custom function

To make it simple, let's assume we have a dataframe containing the following data:
+----------+---------+----------+----------+
|firstName |lastName |Phone |Address |
+----------+---------+----------+----------+
|firstName1|lastName1|info1 |info2 |
|firstName1|lastName1|myInfo1 |dummyInfo2|
|firstName1|lastName1|dummyInfo1|myInfo2 |
+----------+---------+----------+----------+
How can I merge all rows grouping by (firstName,lastName) and keep in the columns Phone and Address only data starting by "my" to get the following :
+----------+---------+----------+----------+
|firstName |lastName |Phone |Address |
+----------+---------+----------+----------+
|firstName1|lastName1|myInfo1 |myInfo2 |
+----------+---------+----------+----------+
Maybe should I use agg function with a custom UDAF? But how can I implement it?
Note: I'm using Spark 2.2 along with Scala 2.11.
You can use groupBy and collect_set aggregation function and use a udf function to filter in the first string that starts with "my"
import org.apache.spark.sql.functions._
def myudf = udf((array: Seq[String]) => array.filter(_.startsWith("my")).head)
df.groupBy("firstName ", "lastName")
.agg(myudf(collect_set("Phone")).as("Phone"), myudf(collect_set("Address")).as("Address"))
.show(false)
which should give you
+----------+---------+-------+-------+
|firstName |lastName |Phone |Address|
+----------+---------+-------+-------+
|firstName1|lastName1|myInfo1|myInfo2|
+----------+---------+-------+-------+
I hope the answer is helpful
If only two columns involved, filtering and join can be used instead of UDF:
val df = List(
("firstName1", "lastName1", "info1", "info2"),
("firstName1", "lastName1", "myInfo1", "dummyInfo2"),
("firstName1", "lastName1", "dummyInfo1", "myInfo2")
).toDF("firstName", "lastName", "Phone", "Address")
val myPhonesDF = df.filter($"Phone".startsWith("my"))
val myAddressDF = df.filter($"Address".startsWith("my"))
val result = myPhonesDF.alias("Phones").join(myAddressDF.alias("Addresses"), Seq("firstName", "lastName"))
.select("firstName", "lastName", "Phones.Phone", "Addresses.Address")
result.show(false)
Output:
+----------+---------+-------+-------+
|firstName |lastName |Phone |Address|
+----------+---------+-------+-------+
|firstName1|lastName1|myInfo1|myInfo2|
+----------+---------+-------+-------+
For many columns, when only one row expected, such construction can be used:
val columnsForSearch = List("Phone", "Address")
val minExpressions = columnsForSearch.map(c => min(when(col(c).startsWith("my"), col(c)).otherwise(null)).alias(c))
df.groupBy("firstName", "lastName").agg(minExpressions.head, minExpressions.tail: _*)
Output is the same.
UDF with two parameters example:
val twoParamFunc = (firstName: String, Phone: String) => firstName + ": " + Phone
val twoParamUDF = udf(twoParamFunc)
df.select(twoParamUDF($"firstName", $"Phone")).show(false)

How to pass dataframe in ISIN operator in spark dataframe

I want to pass dataframe which has set of values to new query but it fails.
1) Here I am selecting particular column so that I can pass under ISIN in next query
scala> val managerIdDf=finalEmployeesDf.filter($"manager_id"!==0).select($"manager_id").distinct
managerIdDf: org.apache.spark.sql.DataFrame = [manager_id: bigint]
2) My sample data:
scala> managerIdDf.show
+----------+
|manager_id|
+----------+
| 67832|
| 65646|
| 5646|
| 67858|
| 69062|
| 68319|
| 66928|
+----------+
3) When I execute final query it fails:
scala> finalEmployeesDf.filter($"emp_id".isin(managerIdDf)).select("*").show
java.lang.RuntimeException: Unsupported literal type class org.apache.spark.sql.DataFrame [manager_id: bigint]
I also tried converting to List and Seq but it generates an error only. Like below when I try to convert to Seq and re run the query it throws an error:
scala> val seqDf=managerIdDf.collect.toSeq
seqDf: Seq[org.apache.spark.sql.Row] = WrappedArray([67832], [65646], [5646], [67858], [69062], [68319], [66928])
scala> finalEmployeesDf.filter($"emp_id".isin(seqDf)).select("*").show
java.lang.RuntimeException: Unsupported literal type class scala.collection.mutable.WrappedArray$ofRef WrappedArray([67832], [65646], [5646], [67858], [69062], [68319], [66928])
I also referred this post but in vain. This type of query I am trying it for solving subqueries in spark dataframe. Anyone here pls ?
An alternative approach using the dataframes and tempviews and free format SQL of SPARK SQL - don't worry about the logic, it's just convention and an alternative to your initial approach - that should equally suffice:
val df2 = Seq(
("Peter", "Doe", Seq(("New York", "A000000"), ("Warsaw", null))),
("Bob", "Smith", Seq(("Berlin", null))),
("John", "Jones", Seq(("Paris", null)))
).toDF("firstname", "lastname", "cities")
df2.createOrReplaceTempView("persons")
val res = spark.sql("""select *
from persons
where firstname
not in (select firstname
from persons
where lastname <> 'Doe')""")
res.show
or
val list = List("Bob", "Daisy", "Peter")
val res2 = spark.sql("select firstname, lastname from persons")
.filter($"firstname".isin(list:_*))
res2.show
or
val query = s"select * from persons where firstname in (${list.map ( x => "'" + x + "'").mkString(",") })"
val res3 = spark.sql(query)
res3.show
or
df2.filter($"firstname".isin(list: _*)).show
or
val list2 = df2.select($"firstname").rdd.map(r => r(0).asInstanceOf[String]).collect.toList
df2.filter($"firstname".isin(list2: _*)).show
In your case specifically:
val seqDf=managerIdDf.rdd.map(r => r(0).asInstanceOf[Long]).collect.toList 2)
finalEmployeesDf.filter($"emp_id".isin(seqDf: _)).select("").show
Yes, you cannot pass a DataFrame in isin. isin requires some values that it will filter against.
If you want an example, you can check my answer here
As per question update, you can make the following change,
.isin(seqDf)
to
.isin(seqDf: _*)

Remove multiple blanks with a single blank in Spark SQL

I have DataFrame created with HiveContext where one of the columns hold records like:
text1 text2
We want the in between spaces between the 2 texts to be replaced with a single text and get final output as :
text1 text2
Ho can we achieve that in Spark SQL? Note we are using Hive Context, registering temp table and writing SQL queries over it.
Even better that I have now been enlightened by a real expert. It's simpler in fact:
import org.apache.spark.sql.functions._
// val myUDf = udf((s:String) => Array(s.trim.replaceAll(" +", " ")))
val myUDf = udf((s:String) => s.trim.replaceAll("\\s+", " ")) // <-- no Array(...)
// Then there is no need to play with columns excessively:
val data = List("i like cheese", " the dog runs ", "text111111 text2222222")
val df = data.toDF("val")
df.show()
val new_df = df.withColumn("new_val", myUDf(col("val")))
new_df.show
import org.apache.spark.sql.functions._
val myUDf = udf((s:String) => Array(s.trim.replaceAll(" +", " ")))
//error: object java.lang.String is not a value --> use Array
val data = List("i like cheese", " the dog runs ", "text111111 text2222222")
val df = data.toDF("val")
df.show()
val new_df = df
.withColumn("udfResult",myUDf(col("val")))
.withColumn("new_val", col("udfResult")(0))
.drop("udfResult")
new_df.show
Output on Databricks
+--------------------+
| val|
+--------------------+
| i like cheese|
| the dog runs |
|text111111 text...|
+--------------------+
+--------------------+--------------------+
| val| new_val|
+--------------------+--------------------+
| i like cheese| i like cheese|
| the dog runs | the dog runs|
|text111111 text...|text111111 text22...|
+--------------------+--------------------+
just do in spark.sql
regexp_replace( COLUMN, ' +', ' ')
https://spark.apache.org/docs/latest/api/sql/index.html#regexp_replace
check it:
spark.sql("""
select regexp_replace(col1, ' +', ' ') as col2
from (
select 'text1 text2 text3' as col1
)
""").show(20,False)
output
+-----------------+
|col2 |
+-----------------+
|text1 text2 text3|
+-----------------+

Filling null values with the mean of the column in HiveQL and Spark

I am using HiveQL in spark and woul like to fill null values by the mean of the column in spark.
Using below codes:
StringBuilder query = new StringBuilder("select `ts0` as ts ");
String[] cols = dataFrame.columns();
for (String col : cols) {
query.append(",`" + col + "` as " + trimmedCol);
}
}
I think I should use "case" command when there is a null value. Can anyone guide me how to do above?
You could to try this following
scala> val df = sqlContext.read.format("com.databricks.spark.csv").option("header","true").option("inferSchema","true").load("na_test.csv")
scala> df.show()
scala> df.na.fill(10.0,Seq("age"))
scala> df.na.fill(10.0,Seq("age")).show
scala> df.na.replace("age", Map(35 -> 61,24 -> 12))).show()

SPARK DataFrame: Remove MAX value in a group

My data is like:
id | val
----------------
a1 | 10
a1 | 20
a2 | 5
a2 | 7
a2 | 2
I am trying to delete row that has MAX(val) in the group if I group on "id".
Result should be like:
id | val
----------------
a1 | 10
a2 | 5
a2 | 2
I am using SPARK DataFrame and SQLContext. I need some way like:
DataFrame df = sqlContext.sql("SELECT * FROM jsontable WHERE (id, val) NOT IN (SELECT is,MAX(val) from jsontable GROUP BY id)");
How can I do that?
You can do that using dataframe operations and Window functions. Assuming you have your data in the dataframe df1:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val maxOnWindow = max(col("val")).over(Window.partitionBy(col("id")))
val df2 = df1
.withColumn("max", maxOnWindow)
.where(col("val") < col("max"))
.select("id", "val")
In Java, the equivalent would be something like:
import org.apache.spark.sql.functions.Window;
import static org.apache.spark.sql.functions.*;
Column maxOnWindow = max(col("val")).over(Window.partitionBy("id"));
DataFrame df2 = df1
.withColumn("max", maxOnWindow)
.where(col("val").lt(col("max")))
.select("id", "val");
Here's a nice article about window functions: https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
Below is the Java implementation of Mario's scala code:
DataFrame df = sqlContext.read().json(input);
DataFrame dfMaxRaw = df.groupBy("id").max("val");
DataFrame dfMax = dfMaxRaw.select(
dfMaxRaw.col("id").as("max_id"), dfMaxRaw.col("max(val)").as("max_val")
);
DataFrame combineMaxWithData = df.join(dfMax, df.col("id")
.equalTo(dfMax.col("max_id")));
DataFrame finalResult = combineMaxWithData.filter(
combineMaxWithData.col("id").equalTo(combineMaxWithData.col("max_id"))
.and(combineMaxWithData.col("val").notEqual(combineMaxWithData.col("max_val")))
);
Here is how to do this using RDD and a more Scala-flavored approach:
// Let's first get the data in key-value pair format
val data = sc.makeRDD( Seq( ("a",20), ("a", 1), ("a",8), ("b",3), ("b",10), ("b",9) ) )
// Next let's find the max value from each group
val maxGroups = data.reduceByKey( Math.max(_,_) )
// We join the max in the group with the original data
val combineMaxWithData = maxGroups.join(data)
// Finally we filter out the values that agree with the max
val finalResults = combineMaxWithData.filter{ case (gid, (max,curVal)) => max != curVal }.map{ case (gid, (max,curVal)) => (gid,curVal) }
println( finalResults.collect.toList )
>List((a,1), (a,8), (b,3), (b,9))

Resources