How to concat two columns into map in hive - apache-spark

I have a hive table
row1 (Id_1, locale_1, value_1)
row2 (Id_1, locale_2, value_2)
row3 (Id_1, locale_3, value_3)
row4 (Id_2, locale_1, value_1)
row5 (Id_2, locale_3, value_3)
How to make it like a map with primary key id?
row (Id, Map<locale, value>)
like
row1 (Id_1, {locale_1 -> value_1, locale_2 -> value_2, locale_3 -> value_3})
row2 (Id_2, {locale_1 -> value_1, locale_3 -> value_3})
Thank you!

If you are not worried about the order of key value pairs in your map coulmns, here is a quick way using map function in Spark DataFrame API
val data = Seq(
("Id_1", "locale_1", "value_1"),
("Id_1", "locale_2", "value_2"),
("Id_1", "locale_3", "value_3"),
("Id_2", "locale_1", "value_1"),
("Id_2", "locale_3", "value_3")
)
import spark.implicits._
val df = data.toDF("id", "locale", "value")
import org.apache.spark.sql.functions._
val df2 = df.withColumn("myMap", map($"locale", $"value")).drop("locale").drop("value")
val df3 = df2
.withColumn("myMap", map_entries(col("myMap")))
.groupBy("id")
.agg(map_from_entries(flatten(collect_set("myMap"))).as("myMap"))
Output of df3.show(false) will be

Related

Why does Spark SQL require double percent symbols for like conditions?

When I am using a like condition in Spark SQL, it seems that it requires the use of 2 percent symbols %%.
However, I could not find any documentation on this in the Spark SQL docs. I am curious as to why my set-up might be causing this requirement.
https://spark.apache.org/docs/3.3.0/sql-ref-syntax-qry-select-like.html
Example data
product_table
id
product_type
region
location
measurement
43635665
ORANGE - Blood Orange
EU
FRA
30.5
78960788
APPLE GrannySmith
NA
USA
16.0
12312343
APPLE [Organic Washington]
NA
CAN
7.1
67867634
ORANGE, NavelOrange
NA
MEX
88.4
import pyspark
from pyspark.sql import functions as F
APP_NAME = "Product: Fruit Template"
SPARK_CONF = [
("spark.dynamicAllocation.maxExecutors", "5"),
("spark.executor.memory", "10g"),
("spark.executor.cores", "4"),
("spark.executor.memoryOverhead", "2000"),
]
spark_conf = pyspark.SparkConf()
spark_conf.setAppName(APP_NAME)
spark_conf.setAll(SPARK_CONF)
sc = pyspark.SparkContext(conf=spark_conf)
spark = pyspark.sql.SparkSession(sc)
def sql(query):
return spark.sql(query)
df = sql("""
SELECT *
FROM product_table
""")
this returns data
df.filter(F.col("product_type").like("ORANGE%%")).show()
whereas this returns an empty dataframe
df.filter(F.col("product_type").like("ORANGE%")).show()
Maybe worth noting, the same issue happens when the LIKE condition is used in the SQL string
this returns data
df_new = sql("""
SELECT *
FROM product_table
WHERE product_type like 'ORANGE%%'
""")
df_new.show()
whereas this returns an empty dataframe
df_new = sql("""
SELECT *
FROM product_table
WHERE product_type like 'ORANGE%'
""")
df_new.show()
i am using PySpark Version: 2.3.2.
conf = (SparkConf()
.set("spark.executor.instances", "24")
.set("spark.executor.cores", "5")
.set("spark.executor.memory", "33g")
.set("spark.driver.memory", "55g")
.set("spark.driver.maxResultSize", "10g")
.set("spark.sql.catalogImplementation", "hive")
.set("mapreduce.fileoutputcommitter.algorithm.version", "2")
)
spark = (
SparkSession.builder.appName("default")
.enableHiveSupport()
.config(conf=conf)
.getOrCreate()
)
df = spark.createDataFrame(
[('43635665','ORANGE - Blood Orange'),
('78960788','APPLE GrannySmith'),
('12312343','APPLE [Organic Washington'),
('67867634','ORANGE, NavelOrange')],
['id', 'product_type'])
df.createOrReplaceTempView("product_table")
def sql(query):
print(query)
return spark.sql(query)
df2 = sql("""
SELECT *
FROM product_table
""")
df2.filter(F.col("product_type").like("ORANGE%")).show(truncate=False)

Passing an entire row as an argument to spark udf through spark dataframe - throws AnalysisException

I am trying to pass an entire row to the spark udf along with few other arguments, I am not using spark sql rather I am using dataframe withColumn api, but I am getting the following exception:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Resolved attribute(s) col3#9 missing from col1#7,col2#8,col3#13 in operator !Project [col1#7, col2#8, col3#13, UDF(col3#9, col2, named_struct(col1, col1#7, col2, col2#8, col3, col3#9)) AS contcatenated#17]. Attribute(s) with the same name appear in the operation: col3. Please check if the right attribute(s) are used.;;
The above exception can be replicated using the below code:
addRowUDF() // call invokes
def addRowUDF() {
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().config(new SparkConf().set("master", "local[*]")).appName(this.getClass.getSimpleName).getOrCreate()
import spark.implicits._
val df = Seq(
("a", "b", "c"),
("a1", "b1", "c1")).toDF("col1", "col2", "col3")
execute(df)
}
def execute(df: org.apache.spark.sql.DataFrame) {
import org.apache.spark.sql.Row
def concatFunc(x: Any, y: String, row: Row) = x.toString + ":" + y + ":" + row.mkString(", ")
import org.apache.spark.sql.functions.{ udf, struct }
val combineUdf = udf((x: Any, y: String, row: Row) => concatFunc(x, y, row))
def udf_execute(udf: String, args: org.apache.spark.sql.Column*) = (combineUdf)(args: _*)
val columns = df.columns.map(df(_))
val df2 = df.withColumn("col3", lit("xxxxxxxxxxx"))
val df3 = df2.withColumn("contcatenated", udf_execute("uudf", df2.col("col3"), lit("col2"), struct(columns: _*)))
df3.show(false)
}
output should be:
+----+----+-----------+----------------------------+
|col1|col2|col3 |contcatenated |
+----+----+-----------+----------------------------+
|a |b |xxxxxxxxxxx|xxxxxxxxxxx:col2:a, b, c |
|a1 |b1 |xxxxxxxxxxx|xxxxxxxxxxx:col2:a1, b1, c1 |
+----+----+-----------+----------------------------+
That happens because you refer to column that is no longer in the scope. When you call:
val df2 = df.withColumn("col3", lit("xxxxxxxxxxx"))
it shades the original col3 column, effectively making preceding columns with the same name accessible. Even if it wasn't the case, let's say after:
val df2 = df.select($"*", lit("xxxxxxxxxxx") as "col3")
the new col3 would be ambiguous, and indistinguishable by name from the one defined brought by *.
So to achieve the required output you'll have to use another name:
val df2 = df.withColumn("col3_", lit("xxxxxxxxxxx"))
and then adjust the rest of your code accordingly:
df2.withColumn(
"contcatenated",
udf_execute("uudf", df2.col("col3_") as "col3",
lit("col2"), struct(columns: _*))
).drop("_3")
If the logic is as simple as the one in the example, you can of course just inline things:
df.withColumn(
"contcatenated",
udf_execute("uudf", lit("xxxxxxxxxxx") as "col3",
lit("col2"), struct(columns: _*))
).drop("_3")

How to convert a Spark DataFrame column into a list?

I want to convert a Spark DataFrame into another DataFrame with a specific manner as follows:
I have Spark DataFrame:
col des
A a
A b
B b
B c
As a result of the operation I would like to a have also a Spark DataFrame as:
col des
A a,b
B b,c
I tried to use:
result <- summarize(groupBy(df, df$col), des = n(df$des))
As a result I obtained the count. Is there any parameter of (summarize or agg) that converts column into a list or something similar, but with assumption that all operations are done on Spark?
Thank you in advance
Here is the solution in scala, you need to figure out for the SparkR.
val dataframe = spark.sparkContext.parallelize(Seq(
("A", "a"),
("A", "b"),
("B", "b"),
("B", "c")
)).toDF("col", "desc")
dataframe.groupBy("col").agg(collect_list(struct("desc")).as("desc")).show
Hope this helps!
sparkR code:
sc <- sparkR.init()
sqlContext <- sparkRSQL.init(sc)
#create R data frame
df <- data.frame(col= c("A","A","B","B"),des= c("a","b","b","c"))
#converting to spark dataframe
sdf <- createDataFrame( sqlContext, df)
registerTempTable(sdf, "sdf")
head(sql(sqlContext, "SQL QUERY"))
write the corresponding sql query in it and execute it.

Spark: Merge 2 dataframes by adding row index/number on both dataframes

Q: Is there is any way to merge two dataframes or copy a column of a dataframe to another in PySpark?
For example, I have two Dataframes:
DF1
C1 C2
23397414 20875.7353
5213970 20497.5582
41323308 20935.7956
123276113 18884.0477
76456078 18389.9269
the seconde dataframe
DF2
C3 C4
2008-02-04 262.00
2008-02-05 257.25
2008-02-06 262.75
2008-02-07 237.00
2008-02-08 231.00
Then i want to add C3 of DF2 to DF1 like this:
New DF
C1 C2 C3
23397414 20875.7353 2008-02-04
5213970 20497.5582 2008-02-05
41323308 20935.7956 2008-02-06
123276113 18884.0477 2008-02-07
76456078 18389.9269 2008-02-08
I hope this example was clear.
rownum + window function i.e solution 1 or zipWithIndex.map i.e solution 2 should help in this case.
Solution 1 : You can use window functions to get this kind of
Then I would suggest you to add rownumber as additional column name to Dataframe say df1.
DF1
C1 C2 columnindex
23397414 20875.7353 1
5213970 20497.5582 2
41323308 20935.7956 3
123276113 18884.0477 4
76456078 18389.9269 5
the second dataframe
DF2
C3 C4 columnindex
2008-02-04 262.00 1
2008-02-05 257.25 2
2008-02-06 262.75 3
2008-02-07 237.00 4
2008-02-08 231.00 5
Now .. do inner join of df1 and df2 that's all...
you will get below ouput
something like this
from pyspark.sql.window import Window
from pyspark.sql.functions import rowNumber
w = Window().orderBy()
df1 = .... // as showed above df1
df2 = .... // as shown above df2
df11 = df1.withColumn("columnindex", rowNumber().over(w))
df22 = df2.withColumn("columnindex", rowNumber().over(w))
newDF = df11.join(df22, df11.columnindex == df22.columnindex, 'inner').drop(df22.columnindex)
newDF.show()
New DF
C1 C2 C3
23397414 20875.7353 2008-02-04
5213970 20497.5582 2008-02-05
41323308 20935.7956 2008-02-06
123276113 18884.0477 2008-02-07
76456078 18389.9269 2008-02-08
Solution 2 : Another good way(probably this is best :)) in scala, which you can translate to pyspark :
/**
* Add Column Index to dataframe
*/
def addColumnIndex(df: DataFrame) = sqlContext.createDataFrame(
// Add Column index
df.rdd.zipWithIndex.map{case (row, columnindex) => Row.fromSeq(row.toSeq :+ columnindex)},
// Create schema
StructType(df.schema.fields :+ StructField("columnindex", LongType, false))
)
// Add index now...
val df1WithIndex = addColumnIndex(df1)
val df2WithIndex = addColumnIndex(df2)
// Now time to join ...
val newone = df1WithIndex
.join(df2WithIndex , Seq("columnindex"))
.drop("columnindex")
I thought I would share the python (pyspark) translation for answer #2 above from #Ram Ghadiyaram:
from pyspark.sql.functions import col
def addColumnIndex(df):
# Create new column names
oldColumns = df.schema.names
newColumns = oldColumns + ["columnindex"]
# Add Column index
df_indexed = df.rdd.zipWithIndex().map(lambda (row, columnindex): \
row + (columnindex,)).toDF()
#Rename all the columns
new_df = reduce(lambda data, idx: data.withColumnRenamed(oldColumns[idx],
newColumns[idx]), xrange(len(oldColumns)), df_indexed)
return new_df
# Add index now...
df1WithIndex = addColumnIndex(df1)
df2WithIndex = addColumnIndex(df2)
#Now time to join ...
newone = df1WithIndex.join(df2WithIndex, col("columnindex"),
'inner').drop("columnindex")
for python3 version,
from pyspark.sql.types import StructType, StructField, LongType
def with_column_index(sdf):
new_schema = StructType(sdf.schema.fields + [StructField("ColumnIndex", LongType(), False),])
return sdf.rdd.zipWithIndex().map(lambda row: row[0] + (row[1],)).toDF(schema=new_schema)
df1_ci = with_column_index(df1)
df2_ci = with_column_index(df2)
join_on_index = df1_ci.join(df2_ci, df1_ci.ColumnIndex == df2_ci.ColumnIndex, 'inner').drop("ColumnIndex")
I referred to his(#Jed) answer
from pyspark.sql.functions import col
def addColumnIndex(df):
# Get old columns names and add a column "columnindex"
oldColumns = df.columns
newColumns = oldColumns + ["columnindex"]
# Add Column index
df_indexed = df.rdd.zipWithIndex().map(lambda (row, columnindex): \
row + (columnindex,)).toDF()
#Rename all the columns
oldColumns = df_indexed.columns
new_df = reduce(lambda data, idx:data.withColumnRenamed(oldColumns[idx],
newColumns[idx]), xrange(len(oldColumns)), df_indexed)
return new_df
# Add index now...
df1WithIndex = addColumnIndex(df1)
df2WithIndex = addColumnIndex(df2)
#Now time to join ...
newone = df1WithIndex.join(df2WithIndex, col("columnindex"),
'inner').drop("columnindex")
This answer solved it for me:
import pyspark.sql.functions as sparkf
# This will return a new DF with all the columns + id
res = df.withColumn('id', sparkf.monotonically_increasing_id())
Credit to Arkadi T
Here is an simple example that can help you even if you have already solve the issue.
//create First Dataframe
val df1 = spark.sparkContext.parallelize(Seq(1,2,1)).toDF("lavel1")
//create second Dataframe
val df2 = spark.sparkContext.parallelize(Seq((1.0, 12.1), (12.1, 1.3), (1.1, 0.3))). toDF("f1", "f2")
//Combine both dataframe
val combinedRow = df1.rdd.zip(df2.rdd). map({
//convert both dataframe to Seq and join them and return as a row
case (df1Data, df2Data) => Row.fromSeq(df1Data.toSeq ++ df2Data.toSeq)
})
// create new Schema from both the dataframe's schema
val combinedschema = StructType(df1.schema.fields ++ df2.schema.fields)
// Create a new dataframe from new row and new schema
val finalDF = spark.sqlContext.createDataFrame(combinedRow, combinedschema)
finalDF.show
Expanding on Jed's answer, in response to Ajinkya's comment:
To get the same old column names, you need to replace "old_cols" with a column list of the newly named indexed columns. See my modified version of the function below
def add_column_index(df):
new_cols = df.schema.names + ['ix']
ix_df = df.rdd.zipWithIndex().map(lambda (row, ix): row + (ix,)).toDF()
tmp_cols = ix_df.schema.names
return reduce(lambda data, idx: data.withColumnRenamed(tmp_cols[idx], new_cols[idx]), xrange(len(tmp_cols)), ix_df)
Not the better way performance wise.
df3=df1.crossJoin(df2).show(3)
To merge columns from two different dataframe you have first to create a column index and then join the two dataframes. Indeed, two dataframes are similar to two SQL tables. To make a connection you have to join them.
If you don't care about the final order of the rows you can generate the index column with monotonically_increasing_id().
Using the following code you can check that monotonically_increasing_id generates the same index column in both dataframes (at least up to a billion of rows), so you won't have any error in the merged dataframe.
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
sample_size = 1E9
sdf1 = spark.range(1, sample_size).select(F.col("id").alias("id1"))
sdf2 = spark.range(1, sample_size).select(F.col("id").alias("id2"))
sdf1 = sdf1.withColumn("idx", sf.monotonically_increasing_id())
sdf2 = sdf2.withColumn("idx", sf.monotonically_increasing_id())
sdf3 = sdf1.join(sdf2, 'idx', 'inner')
sdf3 = sdf3.withColumn("diff", F.col("id1")-F.col("id2")).select("diff")
sdf3.filter(F.col("diff") != 0 ).show()
You can use a combination of monotonically_increasing_id (guaranteed to always be increasing) and row_number (guaranteed to always give the same sequence). You cannot use row_number alone because it needs to be ordered by something. So here we order by monotonically_increasing_id. I am using Spark 2.3.1 and Python 2.7.13.
from pandas import DataFrame
from pyspark.sql.functions import (
monotonically_increasing_id,
row_number)
from pyspark.sql import Window
DF1 = spark.createDataFrame(DataFrame({
'C1': [23397414, 5213970, 41323308, 123276113, 76456078],
'C2': [20875.7353, 20497.5582, 20935.7956, 18884.0477, 18389.9269]}))
DF2 = spark.createDataFrame(DataFrame({
'C3':['2008-02-04', '2008-02-05', '2008-02-06', '2008-02-07', '2008-02-08']}))
DF1_idx = (
DF1
.withColumn('id', monotonically_increasing_id())
.withColumn('columnindex', row_number().over(Window.orderBy('id')))
.select('columnindex', 'C1', 'C2'))
DF2_idx = (
DF2
.withColumn('id', monotonically_increasing_id())
.withColumn('columnindex', row_number().over(Window.orderBy('id')))
.select('columnindex', 'C3'))
DF_complete = (
DF1_idx
.join(
other=DF2_idx,
on=['columnindex'],
how='inner')
.select('C1', 'C2', 'C3'))
DF_complete.show()
+---------+----------+----------+
| C1| C2| C3|
+---------+----------+----------+
| 23397414|20875.7353|2008-02-04|
| 5213970|20497.5582|2008-02-05|
| 41323308|20935.7956|2008-02-06|
|123276113|18884.0477|2008-02-07|
| 76456078|18389.9269|2008-02-08|
+---------+----------+----------+

SPARK code for sql case statement and row_number equivalent

I have a data set like below
hduser#ubuntu:~$ hadoop fs -cat /user/hduser/test_sample/sample1.txt
Eid1,EName1,EDept1,100
Eid2,EName2,EDept1,102
Eid3,EName3,EDept1,101
Eid4,EName4,EDept2,110
Eid5,EName5,EDept2,121
Eid6,EName6,EDept3,99
I want to generate the output as below using spark code
Eid1,EName1,IT,102,1
Eid2,EName2,IT,101,2
Eid3,EName3,IT,100,3
Eid4,EName4,ComSc,121,1
Eid5,EName5,ComSc,110,2
Eid6,EName6,Mech,99,1
which is equivalent of the below SQL
Select emp_id, emp_name, case when emp_dept='EDept1' then 'IT' when emp_dept='EDept2' then 'ComSc' when emp_dept='EDept3' then 'Mech' end dept_name, emp_sal, row_number() over (partition by emp_dept order by emp_sal desc) as rn from emp
Can someone suggest how should I get that in spark.
You can use RDD.zipWithIndex, then convert it to a DataFrame, then use min() and join to get the results you want.
Like this:
import org.apache.spark.sql._
import org.apache.spark.sql.types._
// SORT BY added as per comment request
val test = sc.textFile("/user/hadoop/test.txt")
.sortBy(_.split(",")(2)).sortBy(_.split(",")(3).toInt)
// Table to hold the dept name lookups
val deptDF =
sc.parallelize(Array(("EDept1", "IT"),("EDept2", "ComSc"),("EDept3", "Mech")))
.toDF("deptCode", "dept")
val schema = StructType(Array(
StructField("col1", StringType, false),
StructField("col2", StringType, false),
StructField("col3", StringType, false),
StructField("col4", StringType, false),
StructField("col5", LongType, false))
)
// join to deptDF added as per comment
val testDF = sqlContext.createDataFrame(
test.zipWithIndex.map(tuple => Row.fromSeq(tuple._1.split(",") ++ Array(tuple._2))),
schema
)
.join(deptDF, $"col3" === $"deptCode")
.select($"col1", $"col2", $"dept" as "col3", $"col4", $"col5")
.orderBy($"col5")
testDF.show
col1 col2 col3 col4 col5
Eid1 EName1 IT 100 0
Eid3 EName3 IT 101 1
Eid2 EName2 IT 102 2
Eid4 EName4 ComSc 110 3
Eid5 EName5 ComSc 121 4
Eid6 EName6 Mech 99 5
val result = testDF.join(
testDF.groupBy($"col3").agg($"col3" as "g_col3", min($"col5") as "start"),
$"col3" === $"g_col3"
)
.select($"col1", $"col2", $"col3", $"col4", $"col5" - $"start" + 1 as "index")
result.show
col1 col2 col3 col4 index
Eid4 EName4 ComSc 110 1
Eid5 EName5 ComSc 121 2
Eid6 EName6 Mech 99 1
Eid1 EName1 IT 100 1
Eid3 EName3 IT 101 2
Eid2 EName2 IT 102 3

Resources