Ingest "t" and "f" as boolean to Cassandra - apache-spark

I use pyspark to load a csv as dataframe, then save it to Cassandra. One of the columns is defined as boolean in Cassandra's schema, but my actual data in csv are string t or f. Is there any chance I can make Cassandra recognize t and f as boolean? Otherwise I have to add a data transformation step.

Spark Cassandra Connector uses String.toBoolean call to convert strings to boolean values. But it accepts only true and false, and throws an exception if it's used with other strings. So you'll need to write small data transformation code, like this:
scala> val df = Seq((1, "t"), (2, "f"), (3, "t")).toDF("id", "b")
df: org.apache.spark.sql.DataFrame = [id: int, b: string]
scala> val df2 = df.withColumn("b", $"b" === "t")
df2: org.apache.spark.sql.DataFrame = [id: int, b: boolean]
scala> df2.show()
+---+-----+
| id| b|
+---+-----+
| 1| true|
| 2|false|
| 3| true|
+---+-----+

Related

pyspark: read partitioned parquet "my_file.parquet/col1=NOW" string value replaced by <current_time> on read()

With pyspark 3.1.1 on wsl Debian 10
When reading parquet file partitioned with a column containing the string NOW, the string is replaced by the current time at the moment of the read() funct is executed. I suppose that NOW string interpreted as now()
# step to reproduce
df = spark.createDataFrame(data=[("NOW",1), ("TEST", 2)], schema = ["col1", "id"])
df.write.partitionBy("col1").parquet("test/test.parquet")
>>> /home/test/test.parquet/col1=NOW
df_loaded = spark.read.option(
"basePath",
"test/test.parquet",
).parquet("test/test.parquet/col1=*")
df_loaded.show(truncate=False)
>>>
+---+--------------------------+
|id |col1 |
+---+--------------------------+
|2 |TEST |
|1 |2021-04-18 14:36:46.532273|
+---+--------------------------+
Is that a bug or a normal function of pyspark?
if the latter, is there a sparkContext option to avoid that behaviour?
I suspect that's an expected feature... but I'm not sure where it was documented. Anyway, if you want to keep the column as a string column, you can provide a schema while reading the parquet file:
df = spark.read.schema("id long, col1 string").parquet("test/test.parquet")
df.show()
+---+----+
| id|col1|
+---+----+
| 1| NOW|
| 2|TEST|
+---+----+

How read the data in a spark DF when column name changes and data type changes

I have a parquet data with the following schema ,
Id:int,
Name:String
On a later stage new incoming data schema got changed to
Id:double/long,
NAME:String
Change in type
Change in Field name
I have both of the parquet schema data in the same folder .How I can read both of the schema in spark.read.format("parquet").load("")?
Any expert advice will be helpful.
Typically in this scenario, I would create a v2 of this table and segregate my parquet files. Technically these are two different tables.
If you need to tie them back together, you can then create a second layer and stream both of these tables into a new table.
In a one time scenario, it is suggested to cast the datatype & rewrite the target parquet file.
scala> val df = Seq((1, "as"), (2, "fd")).toDF("a", "b")
df: org.apache.spark.sql.DataFrame = [a: int, b: string]
scala> df.show
+---+---+
| a| b|
+---+---+
| 1| as|
| 2| fd|
+---+---+
scala> df.withColumn("a", $"a".cast("double")).show
+---+---+
| a| b|
+---+---+
|1.0| as|
|2.0| fd|
+---+---+
If not then you need to update the source system with the same logic

how to add a value to the date field using data frame in spark

I have date values some (yyyy/mm/dd) on my dataframe. i need to find the next 7 days of data. How can i do it using dataframe in spark
for example: I have data like below
23/01/2018 , 23
24/01/2018 , 21
25/01/2018, 44
.
.
.
.
.
29/01/2018,17
I need to get the next 7 days of data including today(starting from minimum date from the data). so in my example i need to get dates 2018/01/23 plus 7 days ahead. is there any way to achieve the same ?
Note: i need to find minimum date from the data and need to filter that minimum date + 7 days of data
scala> df.show
+----------+---+-------+
| data_date|vol|channel|
+----------+---+-------+
|05/01/2019| 10| ABC|
|05/01/2019| 20| CNN|
|06/01/2019| 10| BBC|
|07/01/2019| 10| ABC|
|02/01/2019| 20| CNN|
|17/01/2019| 10| BBC|
+----------+---+-------+
scala> val df2 = df.select("*").filter( to_date(replaceUDF('data_date)) > date_add(to_date(replaceUDF(lit(minDate))),7))
df2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [data_date: string, vol: int ... 1 more field]
scala> df2.show
+---------+---+-------+
|data_date|vol|channel|
+---------+---+-------+
+---------+---+-------+
I need data as below : minimum date is 02/02/2018 a, so minimum date + 7 is 07/02/2018. I need data between 02/01/2018 and 07/02/2018
+----------+---+-------+
| data_date|vol|channel|
+----------+---+-------+
|05/01/2019| 10| ABC|
|05/01/2019| 20| CNN|
|06/01/2019| 10| BBC|
|07/01/2019| 10| ABC|
|02/01/2019| 20| CNN|
+----------+---+-------+
can someone help as i am beginner on spark
Import below statement
import org.apache.spark.sql.functions._
Code Snippet
val minDate = df.agg(min($"date1")).collect()(0).get(0)
val df2 = df.select("*").filter( to_date(regexp_replace('date1,"/","-")) > date_add(to_date(regexp_replace(lit(minDate)),"/","-"),7))
df2.show()
For data
val data = Seq(("2018/01/23",23),("2018/01/24",24),("2018/02/20",25))
Output would be
+----------+---+
| date1|day|
+----------+---+
|2018/02/20| 25|
+----------+---+
If you are looking for different output, please update your question with the expected results.
Below is a complete program for your reference
package com.nelamalli.spark.dataframe
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object DataFrameUDF {
def main(args:Array[String]): Unit = {
val spark:SparkSession = SparkSession.builder()
.master("local[3]")
.appName("SparkByExample")
.getOrCreate()
val data = Seq(("2018/01/23",23),("2018/01/24",24),("2018/02/20",25))
import spark.sqlContext.implicits._
val df = data.toDF("date1","day")
val minDate = df.agg(min($"date1")).collect()(0).get(0)
val df2 = df.select("*").filter( to_date(regexp_replace('date1,"/","-")) > date_add(to_date(regexp_replace(lit(minDate)),"/","-"),7))
df2.show()
}
}
Thanks
Your question is still unclear. I'm borrowing the input from #Naveen and you can get the same results without UDFs. Check this out
scala> val df = Seq(("2018/01/23",23),("2018/01/24",24),("2018/02/20",25)).toDF("dt","day").withColumn("dt",to_date(regexp_replace('dt,"/","-")))
df: org.apache.spark.sql.DataFrame = [dt: date, day: int]
scala> df.show(false)
+----------+---+
|dt |day|
+----------+---+
|2018-01-23|23 |
|2018-01-24|24 |
|2018-02-20|25 |
+----------+---+
scala> val mindt = df.groupBy().agg(min('dt)).as[(java.sql.Date)].first
mindt: java.sql.Date = 2018-01-23
scala> df.filter('dt > date_add(lit(mindt),7)).show(false)
+----------+---+
|dt |day|
+----------+---+
|2018-02-20|25 |
+----------+---+
scala>

How to get non-null sorted ascending data from Spark DataFrame?

I load the data into data frames where one of the columns is zipCode (String type). I wonder how to get non-null values for that column in ascending order in Scala? Many thanks in advance.
scala> val df = Seq("2", "1", null).toDF("x")
df: org.apache.spark.sql.DataFrame = [x: string]
scala> df.orderBy($"x".asc_nulls_last).show
+----+
| x|
+----+
| 1|
| 2|
|null|
+----+

pyspark AnalysisException: "Reference '<COLUMN>' is ambiguous [duplicate]

I have two dataframes with the following columns:
df1.columns
// Array(ts, id, X1, X2)
and
df2.columns
// Array(ts, id, Y1, Y2)
After I do
val df_combined = df1.join(df2, Seq(ts,id))
I end up with the following columns: Array(ts, id, X1, X2, ts, id, Y1, Y2). I could expect that the common columns would be dropped. Is there something that additional that needs to be done?
The simple answer (from the Databricks FAQ on this matter) is to perform the join where the joined columns are expressed as an array of strings (or one string) instead of a predicate.
Below is an example adapted from the Databricks FAQ but with two join columns in order to answer the original poster's question.
Here is the left dataframe:
val llist = Seq(("bob", "b", "2015-01-13", 4), ("alice", "a", "2015-04-23",10))
val left = llist.toDF("firstname","lastname","date","duration")
left.show()
/*
+---------+--------+----------+--------+
|firstname|lastname| date|duration|
+---------+--------+----------+--------+
| bob| b|2015-01-13| 4|
| alice| a|2015-04-23| 10|
+---------+--------+----------+--------+
*/
Here is the right dataframe:
val right = Seq(("alice", "a", 100),("bob", "b", 23)).toDF("firstname","lastname","upload")
right.show()
/*
+---------+--------+------+
|firstname|lastname|upload|
+---------+--------+------+
| alice| a| 100|
| bob| b| 23|
+---------+--------+------+
*/
Here is an incorrect solution, where the join columns are defined as the predicate left("firstname")===right("firstname") && left("lastname")===right("lastname").
The incorrect result is that the firstname and lastname columns are duplicated in the joined data frame:
left.join(right, left("firstname")===right("firstname") &&
left("lastname")===right("lastname")).show
/*
+---------+--------+----------+--------+---------+--------+------+
|firstname|lastname| date|duration|firstname|lastname|upload|
+---------+--------+----------+--------+---------+--------+------+
| bob| b|2015-01-13| 4| bob| b| 23|
| alice| a|2015-04-23| 10| alice| a| 100|
+---------+--------+----------+--------+---------+--------+------+
*/
The correct solution is to define the join columns as an array of strings Seq("firstname", "lastname"). The output data frame does not have duplicated columns:
left.join(right, Seq("firstname", "lastname")).show
/*
+---------+--------+----------+--------+------+
|firstname|lastname| date|duration|upload|
+---------+--------+----------+--------+------+
| bob| b|2015-01-13| 4| 23|
| alice| a|2015-04-23| 10| 100|
+---------+--------+----------+--------+------+
*/
This is an expected behavior. DataFrame.join method is equivalent to SQL join like this
SELECT * FROM a JOIN b ON joinExprs
If you want to ignore duplicate columns just drop them or select columns of interest afterwards. If you want to disambiguate you can use access these using parent DataFrames:
val a: DataFrame = ???
val b: DataFrame = ???
val joinExprs: Column = ???
a.join(b, joinExprs).select(a("id"), b("foo"))
// drop equivalent
a.alias("a").join(b.alias("b"), joinExprs).drop(b("id")).drop(a("foo"))
or use aliases:
// As for now aliases don't work with drop
a.alias("a").join(b.alias("b"), joinExprs).select($"a.id", $"b.foo")
For equi-joins there exist a special shortcut syntax which takes either a sequence of strings:
val usingColumns: Seq[String] = ???
a.join(b, usingColumns)
or as single string
val usingColumn: String = ???
a.join(b, usingColumn)
which keep only one copy of columns used in a join condition.
I have been stuck with this for a while, and only recently I came up with a solution what is quite easy.
Say a is
scala> val a = Seq(("a", 1), ("b", 2)).toDF("key", "vala")
a: org.apache.spark.sql.DataFrame = [key: string, vala: int]
scala> a.show
+---+----+
|key|vala|
+---+----+
| a| 1|
| b| 2|
+---+----+
and
scala> val b = Seq(("a", 1)).toDF("key", "valb")
b: org.apache.spark.sql.DataFrame = [key: string, valb: int]
scala> b.show
+---+----+
|key|valb|
+---+----+
| a| 1|
+---+----+
and I can do this to select only the value in dataframe a:
scala> a.join(b, a("key") === b("key"), "left").select(a.columns.map(a(_)) : _*).show
+---+----+
|key|vala|
+---+----+
| a| 1|
| b| 2|
+---+----+
You can simply use this
df1.join(df2, Seq("ts","id"),"TYPE-OF-JOIN")
Here TYPE-OF-JOIN can be
left
right
inner
fullouter
For example, I have two dataframes like this:
// df1
word count1
w1 10
w2 15
w3 20
// df2
word count2
w1 100
w2 150
w5 200
If you do fullouter join then the result looks like this
df1.join(df2, Seq("word"),"fullouter").show()
word count1 count2
w1 10 100
w2 15 150
w3 20 null
w5 null 200
try this,
val df_combined = df1.join(df2, df1("ts") === df2("ts") && df1("id") === df2("id")).drop(df2("ts")).drop(df2("id"))
This is a normal behavior from SQL, what I am doing for this:
Drop or Rename source columns
Do the join
Drop renamed column if any
Here I am replacing "fullname" column:
Some code in Java:
this
.sqlContext
.read()
.parquet(String.format("hdfs:///user/blablacar/data/year=%d/month=%d/day=%d", year, month, day))
.drop("fullname")
.registerTempTable("data_original");
this
.sqlContext
.read()
.parquet(String.format("hdfs:///user/blablacar/data_v2/year=%d/month=%d/day=%d", year, month, day))
.registerTempTable("data_v2");
this
.sqlContext
.sql(etlQuery)
.repartition(1)
.write()
.mode(SaveMode.Overwrite)
.parquet(outputPath);
Where the query is:
SELECT
d.*,
concat_ws('_', product_name, product_module, name) AS fullname
FROM
{table_source} d
LEFT OUTER JOIN
{table_updates} u ON u.id = d.id
This is something you can do only with Spark I believe (drop column from list), very very helpful!
Inner Join is default join in spark, Below is simple syntax for it.
leftDF.join(rightDF,"Common Col Nam")
For Other join you can follow the below syntax
leftDF.join(rightDF,Seq("Common Columns comma seperated","join type")
If columns Name are not common then
leftDF.join(rightDF,leftDF.col("x")===rightDF.col("y),"join type")
Best practice is to make column name different in both the DF before joining them and drop accordingly.
df1.columns =[id, age, income]
df2.column=[id, age_group]
df1.join(df2, on=df1.id== df2.id,how='inner').write.saveAsTable('table_name')
will return an error while error for duplicate columns
Try this instead try this:
df2_id_renamed = df2.withColumnRenamed('id','id_2')
df1.join(df2_id_renamed, on=df1.id== df2_id_renamed.id_2,how='inner').drop('id_2')
If anyone is using spark-SQL and wants to achieve the same thing then you can use USING clause in join query.
val spark = SparkSession.builder().master("local[*]").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val df1 = List((1, 4, 3), (5, 2, 4), (7, 4, 5)).toDF("c1", "c2", "C3")
val df2 = List((1, 4, 3), (5, 2, 4), (7, 4, 10)).toDF("c1", "c2", "C4")
df1.createOrReplaceTempView("table1")
df2.createOrReplaceTempView("table2")
spark.sql("select * from table1 inner join table2 using (c1, c2)").show(false)
/*
+---+---+---+---+
|c1 |c2 |C3 |C4 |
+---+---+---+---+
|1 |4 |3 |3 |
|5 |2 |4 |4 |
|7 |4 |5 |10 |
+---+---+---+---+
*/
After I've joined multiple tables together, I run them through a simple function to rename columns in the DF if it encounters duplicates. Alternatively, you could drop these duplicate columns too.
Where Names is a table with columns ['Id', 'Name', 'DateId', 'Description'] and Dates is a table with columns ['Id', 'Date', 'Description'], the columns Id and Description will be duplicated after being joined.
Names = sparkSession.sql("SELECT * FROM Names")
Dates = sparkSession.sql("SELECT * FROM Dates")
NamesAndDates = Names.join(Dates, Names.DateId == Dates.Id, "inner")
NamesAndDates = deDupeDfCols(NamesAndDates, '_')
NamesAndDates.saveAsTable("...", format="parquet", mode="overwrite", path="...")
Where deDupeDfCols is defined as:
def deDupeDfCols(df, separator=''):
newcols = []
for col in df.columns:
if col not in newcols:
newcols.append(col)
else:
for i in range(2, 1000):
if (col + separator + str(i)) not in newcols:
newcols.append(col + separator + str(i))
break
return df.toDF(*newcols)
The resulting data frame will contain columns ['Id', 'Name', 'DateId', 'Description', 'Id2', 'Date', 'Description2'].
Apologies this answer is in Python - I'm not familiar with Scala, but this was the question that came up when I Googled this problem and I'm sure Scala code isn't too different.

Resources