Spark: create DataFrame with specified schema fields - apache-spark

In Spark, create case class to specify schema, then create RDD from a file and convert it to DF. e.g.
case class Example(name: String, age: Long)
val exampleDF = spark.sparkContext
.textFile("example.txt")
.map(_.split(","))
.map(attributes => Example(attributes(0), attributes(1).toInt))
.toDF()
The question is, if the content in the txt file is like "ABCDE12345FGHIGK67890", without any symbols or spaces. How to extract specified length of string for the schema field. e.g. extract 'BCD' for name and '23' for age. Is this possible to use map and split?
Thanks !!!

You can use subString to pull the data from specific index as below
case class Example (name : String, age: Int)
val example = spark.sparkContext
.textFile("test.txt")
.map(line => Example(line.substring(1, 4), line.substring(6,8).toInt)).toDF()
example.show()
Output:
+----+---+
|name|age|
+----+---+
| BCD| 23|
| BCD| 23|
| BCD| 23|
+----+---+
I hope this helps!

In the map function where you are splitting by commas, just put a function that converts the input string to a list of values in the required order.

Related

Convert string type to array type in spark sql

I have table in Spark SQL in Databricks and I have a column as string. I converted as new columns as Array datatype but they still as one string. Datatype is array type in table schema
Column as String
Data1
[2461][2639][2639][7700][7700][3953]
Converted to Array
Data_New
["[2461][2639][2639][7700][7700][3953]"]
String to array conversion
df_new = df.withColumn("Data_New", array(df["Data1"]))
Then write as parquet and use as spark sql table in databricks
When I search for string using array_contains function I get results as false
select *
from table_name
where array_contains(Data_New,"[2461]")
When I search for all string then query turns the results as true
Please suggest if I can separate these string as array and can find any array using array_contains function.
Just remove leading and trailing brackets from the string then split by ][ to get an array of strings:
df = df.withColumn("Data_New", split(expr("rtrim(']', ltrim('[', Data1))"), "\\]\\["))
df.show(truncate=False)
+------------------------------------+------------------------------------+
|Data1 |Data_New |
+------------------------------------+------------------------------------+
|[2461][2639][2639][7700][7700][3953]|[2461, 2639, 2639, 7700, 7700, 3953]|
+------------------------------------+------------------------------------+
Now use array_contains like this:
df.createOrReplaceTempView("table_name")
sql_query = "select * from table_name where array_contains(Data_New,'2461')"
spark.sql(sql_query).show(truncate=False)
Actually this is not an array, this is a full string so you need a regex or similar
expr = "[2461]"
df_new.filter(df_new["Data_New"].rlike(expr))
import
from pyspark.sql import functions as sf, types as st
create table
a = [["[2461][2639][2639][7700][7700][3953]"], [None]]
sdf = sc.parallelize(a).toDF(["col1"])
sdf.show()
+--------------------+
| col1|
+--------------------+
|[2461][2639][2639...|
| null|
+--------------------+
convert type
def spliter(x):
if x is not None:
return x[1:-1].split("][")
else:
return None
udf = sf.udf(spliter, st.ArrayType(st.StringType()))
sdf.withColumn("array_col1", udf("col1")).withColumn("check", sf.array_contains("array_col1", "2461")).show()
+--------------------+--------------------+-----+
| col1| array_col1|check|
+--------------------+--------------------+-----+
|[2461][2639][2639...|[2461, 2639, 2639...| true|
| null| null| null|
+--------------------+--------------------+-----+

Not able to split the column into multiple columns in Spark Dataframe

Not able to split the column into multiple columns in Spark Data-frame and through RDD.
I tried other some codes but works with only fixed columns.
Ex:
Datatype is name:string , city =list(string)
I have a text file and input data is like below
Name, city
A, (hyd,che,pune)
B, (che,bang,del)
C, (hyd)
Required Output is:
A,hyd
A,che
A,pune
B,che,
C,bang
B,del
C,hyd
after reading text file and converting DF.
Data-frame will look like below,
scala> data.show
+----------------+
| |
| value |
| |
+----------------+
| Name, city
|
|A,(hyd,che,pune)
|
|B,(che,bang,del)
|
| C,(hyd)
|
| D,(hyd,che,tn)|
+----------------+
You can use explode function on your DataFrame
val explodeDF = inputDF.withColumn("city", explode($"city")).show()
http://sqlandhadoop.com/spark-dataframe-explode/
Now that I understood you're loading your full line as a string, here is the solution on how to achieve your output
I have defined two user defined functions
val split_to_two_strings: String => Array[String] = _.split(",",2) # to first split your input two elements to convert to two columns (name, city)
val custom_conv_to_Array: String => Array[String] = _.stripPrefix("(").stripSuffix(")").split(",") # strip ( and ) then convert to list of cities
import org.apache.spark.sql.functions.udf
val custom_conv_to_ArrayUDF = udf(custom_conv_to_Array)
val split_to_two_stringsUDF = udf(split_to_two_strings)
val outputDF = inputDF.withColumn("tmp", split_to_two_stringsUDF($"value"))
.select($"tmp".getItem(0).as("Name"), trim($"tmp".getItem(1)).as("city_list"))
.withColumn("city_array", custom_conv_to_ArrayUDF($"city_list"))
.drop($"city_list")
.withColumn("city", explode($"city_array"))
.drop($"city_array")
outputDF.show()
Hope this helps

Select all columns at runtime spark sql, without predefined schema

I have a dataframe with values which are in the format
|resourceId|resourceType|seasonId|seriesId|
+----------+------------+--------+--------+
|1234 |cM-type |883838 |8838832 |
|1235 |cM-type |883838 |8838832 |
|1236 |cM-type |883838 |8838832 |
|1237 |CNN-type |883838 |8838832 |
|1238 |cM-type |883838 |8838832 |
+----------+------------+--------+--------+
I want to convert the dataframe into this format
+----------+----------------------------------------------------------------------------------------+
|resourceId|value |
+----------+----------------------------------------------------------------------------------------+
|1234 |{"resourceId":"1234","resourceType":"cM-type","seasonId":"883838","seriesId":"8838832"} |
|1235 |{"resourceId":"1235","resourceType":"cM-type","seasonId":"883838","seriesId":"8838832"} |
|1236 |{"resourceId":"1236","resourceType":"cM-type","seasonId":"883838","seriesId":"8838832"} |
|1237 |{"resourceId":"1237","resourceType":"CNN-type","seasonId":"883838","seriesId":"8838832"}|
|1238 |{"resourceId":"1238","resourceType":"cM-type","seasonId":"883838","seriesId":"8838832"} |
+----------+----------------------------------------------------------------------------------------+
I know I can get the desired output by giving the fields manually like this
val jsonformated=df.select($"resourceId",to_json(struct($"resourceId", $"resourceType", $"seasonId",$"seriesId")).alias("value"))
However, I am trying to pass the column values to struct programmatic, using
val cols = df.columns.toSeq
val jsonformatted=df.select($"resourceId",to_json(struct("colval",cols)).alias("value"))
some reason struct function doesn't take the sequence, from the api it looks like there is a method signature to accept sequence,
struct(String colName, scala.collection.Seq<String> colNames)
is there a better solution to solve this problem.
Update:
As the answer pointed out the exact syntax to get the output
val colsList = df.columns.toList
val column: List[Column] = colsList.map(dftrim(_))
val jsonformatted=df.select($"resourceId",to_json(struct(column:_*)).alias("value"))
struct takes a sequence. You're just looking at a wrong variant. Use
def struct(cols: Column*): Column
such as
import org.apache.spark.sql.functions._
val cols: Seq[String] = ???
struct(cols map col: _*)

How to create a dataframe from a string key=value delimited by ";"

I have a Hive table with the structure:
I need to read the string field, breaking the keys and turn into a Hive table columns, the final table should look like this:
Very important, the number of keys in the string is dynamic and the name of the keys is also dynamic
An attempt would be to read the string with Spark SQL, create a dataframe with the schema based on all the strings and use saveAsTable () function to transform the dataframe the hive final table, but do not know how to do this
Any suggestion ?
A naive (assuming unique (code, date) combinations and no embedded = and ; in the string) can look like this:
import org.apache.spark.sql.functions.{explode, split}
val df = Seq(
(1, 1, "key1=value11;key2=value12;key3=value13;key4=value14"),
(1, 2, "key1=value21;key2=value22;key3=value23;key4=value24"),
(2, 4, "key3=value33;key4=value34;key5=value35")
).toDF("code", "date", "string")
val bits = split($"string", ";")
val kv = split($"pair", "=")
df
.withColumn("bits", bits) // Split column by `;`
.withColumn("pair", explode($"bits")) // Explode into multiple rows
.withColumn("key", kv(0)) // Extract key
.withColumn("val", kv(1)) // Extract value
// Pivot to wide format
.groupBy("code", "date")
.pivot("key")
.agg(first("val"))
// +----+----+-------+-------+-------+-------+-------+
// |code|date| key1| key2| key3| key4| key5|
// +----+----+-------+-------+-------+-------+-------+
// | 1| 2|value21|value22|value23|value24| null|
// | 1| 1|value11|value12|value13|value14| null|
// | 2| 4| null| null|value33|value34|value35|
// +----+----+-------+-------+-------+-------+-------+
This can be easily adjust to handle the case when (code, date) are not unique and you can process more complex string patterns using UDF.
Depending on a language you use and a number of columns you may be better with using RDD or Dataset. It is also worth to consider dropping full explode / pivot in favor of an UDF.
val parse = udf((text: String) => text.split(";").map(_.split("=")).collect {
case Array(k, v) => (k, v)
}.toMap)
val keys = udf((pairs: Map[String, String]) => pairs.keys.toList)
// Parse strings to Map[String, String]
val withKVs = df.withColumn("kvs", parse($"string"))
val keys = withKVs
.select(explode(keys($"kvs"))).distinct // Get unique keys
.as[String]
.collect.sorted.toList // Collect and sort
// Build a list of expressions for subsequent select
val exprs = keys.map(key => $"kvs".getItem(key).alias(key))
withKVs.select($"code" :: $"date" :: exprs: _*)
In Spark 1.5 you can try:
val keys = withKVs.select($"kvs").rdd
.flatMap(_.getAs[Map[String, String]]("kvs").keys)
.distinct
.collect.sorted.toList

How to convert column of arrays of strings to strings?

I have a column, which is of type array < string > in spark tables. I am using SQL to query these spark tables. I wanted to convert the array < string > into string.
When used the below syntax:
select cast(rate_plan_code as string) as new_rate_plan from
customer_activity_searches group by rate_plan_code
rate_plan_code column has following values:
["AAA","RACK","SMOBIX","SMOBPX"]
["LPCT","RACK"]
["LFTIN","RACK","SMOBIX","SMOBPX"]
["LTGD","RACK"]
["RACK","LEARLI","NHDP","LADV","LADV2"]
following are populated in the new_rate_plan column:
org.apache.spark.sql.catalyst.expressions.UnsafeArrayData#e4273d9f
org.apache.spark.sql.catalyst.expressions.UnsafeArrayData#c1ade2ff
org.apache.spark.sql.catalyst.expressions.UnsafeArrayData#4f378397
org.apache.spark.sql.catalyst.expressions.UnsafeArrayData#d1c81377
org.apache.spark.sql.catalyst.expressions.UnsafeArrayData#552f3317
Cast seem to work when I am converting decimal to int or int to double, but not in this case. Curious why the cast is not not working here.
Greatly appreciate your help.
In Spark 2.1+ to do the concatenation of the values in a single Array column you can use the following:
concat_ws standard function
map operator
a user-defined function (UDF)
concat_ws Standard Function
Use concat_ws function.
concat_ws(sep: String, exprs: Column*): Column Concatenates multiple input string columns together into a single string column, using the given separator.
val solution = words.withColumn("codes", concat_ws(" ", $"rate_plan_code"))
scala> solution.show
+--------------+-----------+
| words| codes|
+--------------+-----------+
|[hello, world]|hello world|
+--------------+-----------+
map Operator
Use map operator to have full control of what and how should be transformed.
map[U](func: (T) ⇒ U): Dataset[U] Returns a new Dataset that contains the result of applying func to each element.
scala> codes.show(false)
+---+---------------------------+
|id |rate_plan_code |
+---+---------------------------+
|0 |[AAA, RACK, SMOBIX, SMOBPX]|
+---+---------------------------+
val codesAsSingleString = codes.as[(Long, Array[String])]
.map { case (id, codes) => (id, codes.mkString(", ")) }
.toDF("id", "codes")
scala> codesAsSingleString.show(false)
+---+-------------------------+
|id |codes |
+---+-------------------------+
|0 |AAA, RACK, SMOBIX, SMOBPX|
+---+-------------------------+
scala> codesAsSingleString.printSchema
root
|-- id: long (nullable = false)
|-- codes: string (nullable = true)
In spark 2.1+, you can directly use concat_ws to convert(concat with seperator) string/array< String > into String .
select concat_ws(',',rate_plan_code) as new_rate_plan from
customer_activity_searches group by rate_plan_code
This will give you response like:
AAA,RACK,SMOBIX,SMOBPX
LPCT,RACK
LFTIN,RACK,SMOBIX,SMOBPX
LTGD,RACK
RACK,LEARLI,NHDP,LADV,LADV2
PS : concat_ws doesn't works with like array< Long > ..., for which UDF or map would be the only option as told by Jacek.
You can cast array to string at create this df not at output
newdf = df.groupBy('aaa')
.agg(F.collect_list('bbb').("string").alias('ccc'))
outputdf = newdf.select(
F.concat_ws(', ' , newdf.aaa, F.format_string('xxxxx(%s)', newdf.ccc)))
The way to do what you want in SQL is to use the inbuilt sql function string()
select string(rate_plan_code) as new_rate_plan from
customer_activity_searches group by rate_plan_code

Resources