How to convert column of arrays of strings to strings? - apache-spark

I have a column, which is of type array < string > in spark tables. I am using SQL to query these spark tables. I wanted to convert the array < string > into string.
When used the below syntax:
select cast(rate_plan_code as string) as new_rate_plan from
customer_activity_searches group by rate_plan_code
rate_plan_code column has following values:
["AAA","RACK","SMOBIX","SMOBPX"]
["LPCT","RACK"]
["LFTIN","RACK","SMOBIX","SMOBPX"]
["LTGD","RACK"]
["RACK","LEARLI","NHDP","LADV","LADV2"]
following are populated in the new_rate_plan column:
org.apache.spark.sql.catalyst.expressions.UnsafeArrayData#e4273d9f
org.apache.spark.sql.catalyst.expressions.UnsafeArrayData#c1ade2ff
org.apache.spark.sql.catalyst.expressions.UnsafeArrayData#4f378397
org.apache.spark.sql.catalyst.expressions.UnsafeArrayData#d1c81377
org.apache.spark.sql.catalyst.expressions.UnsafeArrayData#552f3317
Cast seem to work when I am converting decimal to int or int to double, but not in this case. Curious why the cast is not not working here.
Greatly appreciate your help.

In Spark 2.1+ to do the concatenation of the values in a single Array column you can use the following:
concat_ws standard function
map operator
a user-defined function (UDF)
concat_ws Standard Function
Use concat_ws function.
concat_ws(sep: String, exprs: Column*): Column Concatenates multiple input string columns together into a single string column, using the given separator.
val solution = words.withColumn("codes", concat_ws(" ", $"rate_plan_code"))
scala> solution.show
+--------------+-----------+
| words| codes|
+--------------+-----------+
|[hello, world]|hello world|
+--------------+-----------+
map Operator
Use map operator to have full control of what and how should be transformed.
map[U](func: (T) ⇒ U): Dataset[U] Returns a new Dataset that contains the result of applying func to each element.
scala> codes.show(false)
+---+---------------------------+
|id |rate_plan_code |
+---+---------------------------+
|0 |[AAA, RACK, SMOBIX, SMOBPX]|
+---+---------------------------+
val codesAsSingleString = codes.as[(Long, Array[String])]
.map { case (id, codes) => (id, codes.mkString(", ")) }
.toDF("id", "codes")
scala> codesAsSingleString.show(false)
+---+-------------------------+
|id |codes |
+---+-------------------------+
|0 |AAA, RACK, SMOBIX, SMOBPX|
+---+-------------------------+
scala> codesAsSingleString.printSchema
root
|-- id: long (nullable = false)
|-- codes: string (nullable = true)

In spark 2.1+, you can directly use concat_ws to convert(concat with seperator) string/array< String > into String .
select concat_ws(',',rate_plan_code) as new_rate_plan from
customer_activity_searches group by rate_plan_code
This will give you response like:
AAA,RACK,SMOBIX,SMOBPX
LPCT,RACK
LFTIN,RACK,SMOBIX,SMOBPX
LTGD,RACK
RACK,LEARLI,NHDP,LADV,LADV2
PS : concat_ws doesn't works with like array< Long > ..., for which UDF or map would be the only option as told by Jacek.

You can cast array to string at create this df not at output
newdf = df.groupBy('aaa')
.agg(F.collect_list('bbb').("string").alias('ccc'))
outputdf = newdf.select(
F.concat_ws(', ' , newdf.aaa, F.format_string('xxxxx(%s)', newdf.ccc)))

The way to do what you want in SQL is to use the inbuilt sql function string()
select string(rate_plan_code) as new_rate_plan from
customer_activity_searches group by rate_plan_code

Related

Convert string type to array type in spark sql

I have table in Spark SQL in Databricks and I have a column as string. I converted as new columns as Array datatype but they still as one string. Datatype is array type in table schema
Column as String
Data1
[2461][2639][2639][7700][7700][3953]
Converted to Array
Data_New
["[2461][2639][2639][7700][7700][3953]"]
String to array conversion
df_new = df.withColumn("Data_New", array(df["Data1"]))
Then write as parquet and use as spark sql table in databricks
When I search for string using array_contains function I get results as false
select *
from table_name
where array_contains(Data_New,"[2461]")
When I search for all string then query turns the results as true
Please suggest if I can separate these string as array and can find any array using array_contains function.
Just remove leading and trailing brackets from the string then split by ][ to get an array of strings:
df = df.withColumn("Data_New", split(expr("rtrim(']', ltrim('[', Data1))"), "\\]\\["))
df.show(truncate=False)
+------------------------------------+------------------------------------+
|Data1 |Data_New |
+------------------------------------+------------------------------------+
|[2461][2639][2639][7700][7700][3953]|[2461, 2639, 2639, 7700, 7700, 3953]|
+------------------------------------+------------------------------------+
Now use array_contains like this:
df.createOrReplaceTempView("table_name")
sql_query = "select * from table_name where array_contains(Data_New,'2461')"
spark.sql(sql_query).show(truncate=False)
Actually this is not an array, this is a full string so you need a regex or similar
expr = "[2461]"
df_new.filter(df_new["Data_New"].rlike(expr))
import
from pyspark.sql import functions as sf, types as st
create table
a = [["[2461][2639][2639][7700][7700][3953]"], [None]]
sdf = sc.parallelize(a).toDF(["col1"])
sdf.show()
+--------------------+
| col1|
+--------------------+
|[2461][2639][2639...|
| null|
+--------------------+
convert type
def spliter(x):
if x is not None:
return x[1:-1].split("][")
else:
return None
udf = sf.udf(spliter, st.ArrayType(st.StringType()))
sdf.withColumn("array_col1", udf("col1")).withColumn("check", sf.array_contains("array_col1", "2461")).show()
+--------------------+--------------------+-----+
| col1| array_col1|check|
+--------------------+--------------------+-----+
|[2461][2639][2639...|[2461, 2639, 2639...| true|
| null| null| null|
+--------------------+--------------------+-----+

How to swap minus sign from last position in a string to first position in hive?

How to swap negative sign from last position to first in a string or integer to first position in hive and/ spark?
example: 22-
required: -22
My code is:
val Validation1 = spark.sql("Select case when substr(YTTLSVAL-,-1,1)='-' then cast(concat('-',substr(YTTLSVAL-,1,length(YTTLSVAL-)-1)) as int) else cast(YTTLSVAL- as int) end as column_name")
scala> Seq("-abcd", "def", "23-", "we").toDF("value").createOrReplaceTempView("values")
scala> val f = (x: String) => if(x.endsWith("-")) s"-${x.dropRight(1)}" else x
scala> spark.udf.register("myudf", f)
scala> spark.sql("select *, myudf(*) as custval from values").show
+-----+-------+
|value|custval|
+-----+-------+
|-abcd| -abcd|
| def| def|
| 23-| -23|
| we| we|
+-----+-------+
EDIT
On second thought, since, UDF's are discouraged unless you absolutely need them (since they create a black box for spark's optimisation engine), please use below way that uses regex_replace instead. Have tested this and it works:
scala> spark.sql("select REGEXP_REPLACE ( value, '^(\\.+)(-)$','-$1') as custval from values").show
You could try REGEXP_REPLACE. This patter searches for a number followed by - at the end and puts it before the number if found.
SELECT REGEXP_REPLACE ( val, '^(\\d+)-$','-$1')
Column functions

How to write just the `row` value of a DataFrame to a file in spark?

I have a dataframe that has just one column, whose value is a JSON string. I'm trying to write just the values to a file with one record per line.
scala> selddf.printSchema
root
|-- raw_event: string (nullable = true)
The data looks like this:
scala> selddf.show(1)
+--------------------+
| raw_event|
+--------------------+
|{"event_header":{...|
+--------------------+
only showing top 1 row
I am running the following to save it to file:
selddf.select("raw_event").write.json("/data/test")
The output looks like:
{"raw_event":"{\"event_header\":{\"version\":\"1.0\"...}"}
I would like the output to just say:
{\"event_header\":{\"version\":\"1.0\"...}
What am I missing?
The reason this happens is that when you write a json you are writing the dataframe in which the column is raw_event.
Your first option is to simply write it as text:
df.write.text(filename)
Another option (if your json schema is constant to all elements) is using the from_json function to convert this to a legal dataframe. Select the elements (the content of the column which would include all members of the json) and only then save it:
val df = Seq("{\"a\": \"str\", \"b\": [1,2,3], \"c\": {\"d\": 1, \"e\": 2}}").toDF("raw_event")
import org.apache.spark.sql.types._
val schema = StructType(Seq(StructField("a", StringType), StructField("b", ArrayType(IntegerType)), StructField("c", StructType(Seq(StructField("d", IntegerType), StructField("e", IntegerType))))))
df.withColumn("jsonData", from_json($"raw_event", schema)).select("jsonData.*").write.json("bla.json")
The advantage of the second option is that you can test for maleformed rows (which would result in null) and therefore you can add a filter to remove them.
Note that in both cases you don't have escaping for the ". If you want that you would need to use the first option and first do a UDF which adds the escaping.

How to create a dataframe from a string key=value delimited by ";"

I have a Hive table with the structure:
I need to read the string field, breaking the keys and turn into a Hive table columns, the final table should look like this:
Very important, the number of keys in the string is dynamic and the name of the keys is also dynamic
An attempt would be to read the string with Spark SQL, create a dataframe with the schema based on all the strings and use saveAsTable () function to transform the dataframe the hive final table, but do not know how to do this
Any suggestion ?
A naive (assuming unique (code, date) combinations and no embedded = and ; in the string) can look like this:
import org.apache.spark.sql.functions.{explode, split}
val df = Seq(
(1, 1, "key1=value11;key2=value12;key3=value13;key4=value14"),
(1, 2, "key1=value21;key2=value22;key3=value23;key4=value24"),
(2, 4, "key3=value33;key4=value34;key5=value35")
).toDF("code", "date", "string")
val bits = split($"string", ";")
val kv = split($"pair", "=")
df
.withColumn("bits", bits) // Split column by `;`
.withColumn("pair", explode($"bits")) // Explode into multiple rows
.withColumn("key", kv(0)) // Extract key
.withColumn("val", kv(1)) // Extract value
// Pivot to wide format
.groupBy("code", "date")
.pivot("key")
.agg(first("val"))
// +----+----+-------+-------+-------+-------+-------+
// |code|date| key1| key2| key3| key4| key5|
// +----+----+-------+-------+-------+-------+-------+
// | 1| 2|value21|value22|value23|value24| null|
// | 1| 1|value11|value12|value13|value14| null|
// | 2| 4| null| null|value33|value34|value35|
// +----+----+-------+-------+-------+-------+-------+
This can be easily adjust to handle the case when (code, date) are not unique and you can process more complex string patterns using UDF.
Depending on a language you use and a number of columns you may be better with using RDD or Dataset. It is also worth to consider dropping full explode / pivot in favor of an UDF.
val parse = udf((text: String) => text.split(";").map(_.split("=")).collect {
case Array(k, v) => (k, v)
}.toMap)
val keys = udf((pairs: Map[String, String]) => pairs.keys.toList)
// Parse strings to Map[String, String]
val withKVs = df.withColumn("kvs", parse($"string"))
val keys = withKVs
.select(explode(keys($"kvs"))).distinct // Get unique keys
.as[String]
.collect.sorted.toList // Collect and sort
// Build a list of expressions for subsequent select
val exprs = keys.map(key => $"kvs".getItem(key).alias(key))
withKVs.select($"code" :: $"date" :: exprs: _*)
In Spark 1.5 you can try:
val keys = withKVs.select($"kvs").rdd
.flatMap(_.getAs[Map[String, String]]("kvs").keys)
.distinct
.collect.sorted.toList

How to pass whole Row to UDF - Spark DataFrame filter

I'm writing filter function for complex JSON dataset with lot's of inner structures. Passing individual columns is too cumbersome.
So I declared the following UDF:
val records:DataFrame = = sqlContext.jsonFile("...")
def myFilterFunction(r:Row):Boolean=???
sqlc.udf.register("myFilter", (r:Row)=>myFilterFunction(r))
Intuitively I'm thinking it will work like this:
records.filter("myFilter(*)=true")
What is the actual syntax?
You have to use struct() function for constructing the row while making a call to the function, follow these steps.
Import Row,
import org.apache.spark.sql._
Define the UDF
def myFilterFunction(r:Row) = {r.get(0)==r.get(1)}
Register the UDF
sqlContext.udf.register("myFilterFunction", myFilterFunction _)
Create the dataFrame
val records = sqlContext.createDataFrame(Seq(("sachin", "sachin"), ("aggarwal", "aggarwal1"))).toDF("text", "text2")
Use the UDF
records.filter(callUdf("myFilterFunction",struct($"text",$"text2"))).show
When u want all columns to be passed to UDF.
records.filter(callUdf("myFilterFunction",struct(records.columns.map(records(_)) : _*))).show
Result:
+------+------+
| text| text2|
+------+------+
|sachin|sachin|
+------+------+
scala> inputDF
res40: org.apache.spark.sql.DataFrame = [email: string, first_name: string ... 3 more fields]
scala> inputDF.printSchema
root
|-- email: string (nullable = true)
|-- first_name: string (nullable = true)
|-- gender: string (nullable = true)
|-- id: long (nullable = true)
|-- last_name: string (nullable = true)
Now, I would like to filter the rows based on the Gender Field. I can accomplish that by using the .filter($"gender" === "Male") but I would like to do with the .filter(function).
So, defined my anonymous functions
val isMaleRow = (r:Row) => {r.getAs("gender") == "Male"}
val isFemaleRow = (r:Row) => { r.getAs("gender") == "Female" }
inputDF.filter(isMaleRow).show()
inputDF.filter(isFemaleRow).show()
I felt the requirement can be done in a better way i.e without declaring as UDF and invoke it.
In addition to the first answer. When we want all columns to be passed to UDF we can use
struct("*")
If you want to take an action over the whole row and process it in a distributed way, take the row in the DataFrame and send to a function as a struct and then convert to a dictionary to execute the specific action, is very important to execute the collect method over the final DataFrame because Spark has the LazyLoad activated and don't work with full data at less you tell it explicitly.
In my case I should send the row of a DataFrame to index as Dictionary object:
Import libraries.
Declare the udf and the lambda must receiving the row structure.
Execute specific function, in this case send to index a dictionary (the row structure converted to a dict).
The DataFrame origin execute a withColum method that indicates to Spark execute this in each row, before make the call to collect, this allows to execute the function in a distribuible way. Don't forget send to a other DataFrame Variable.
Execute the collect method to execute the process and distribute the function.
from pyspark.sql.functions import udf, struct
from pyspark.sql.types import IntegerType
myUdf = udf(lambda row: sendToES(row.asDict()), IntegerType())
dfWithControlCol = df.withColumn("control_col", myUdf(struct([df[x] for x in df.columns])))
dfWithControlCol.collect()

Resources