I am trying to enable invalidation of old measurements while keeping them in my Cassandra setup. Given the following table structure:
ID|Test|result|valid|valid2
1 | 1 | 10 | False| unset
2 | 1 | 11 | True| False
3 | 1 | 12 | True| True
with primary key (ID,test)
Now if I insert the following SparkDataframe using the connector as normal with mode("append")
ID|Test|valid2
1 | 1 | False
Will this create a tombstone? The purpose is to be able to "invalidate" certain rows in my tables when necessary. I understand tombstones are created when cells are outdated. But since there is no value in the cell, will a tombstone be created?
Tombstones are created when you performing explicit DELETE, insert null value, or data is TTLed.
If you don't specify the value for specific column, then the data for this cell is simply not set, and if you had some previous data before, then they won't be overwritten until you explicitly set them to null. But in Spark, usually situation is different - by default it will insert nulls until you won't specify spark.cassandra.output.ignoreNulls as true - in this case it will treat nulls as unset, and won't owerwrite the previous data.
But when you specify incomplete row, then only provided pieces will be updated, keeping the previous data intact.
If we have following table and data:
create table test.v2(id int primary key, valid boolean, v int);
insert into test.v2(id, valid, v) values(2,True, 2);
insert into test.v2(id, valid, v) values(1,True, 1);
we can check that data is visible in Spark:
scala> val data = spark.read.cassandraFormat("v2", "test").load()
data: org.apache.spark.sql.DataFrame = [id: int, v: int ... 1 more field]
scala> data.show
+---+---+-----+
| id| v|valid|
+---+---+-----+
| 1| 1| true|
| 2| 2| true|
+---+---+-----+
Now update the data:
scala> import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.SaveMode
scala> val newData = Seq((2, false)).toDF("id", "valid")
newData: org.apache.spark.sql.DataFrame = [id: int, valid: boolean]
scala> newData.write.cassandraFormat("v2", "test").mode(SaveMode.Append).save()
scala> data.show
+---+---+-----+
| id| v|valid|
+---+---+-----+
| 1| 1| true|
| 2| 2|false|
+---+---+-----+
Related
I have a specific requirement where I need to query a dataframe based on a range condition.
The values of the range come from the rows of another dataframe and so I will have as many queries as the rows in this different dataframe.
Using collect() in my scenario seems to be the bottleneck because it brings every row to the driver.
Sample example:
I need to execute a query on table 2 for every row in table 1
Table 1:
ID1
Num1
Num2
1
10
3
2
40
4
Table 2
ID2
Num3
1
9
2
39
3
22
4
12
For the first row in table 1, I create a range [10-3,10+3] =[7,13] => this becomes the range for the first query.
For the second row in table 2, I create a range [40-4,40+4] =[36,44] => this becomes the range for the second query.
I am currently doing collect() and iterating over the rows to get the values. I use these values as ranges in my queries for Table 2.
Output of Query 1:
ID2
Num3
1
9
4
12
Output of Query 2:
ID2
Num3
2
39
Since the number of rows in table 1 is very large, doing a collect() operation is costly.
And since the values are numeric, I assume a join won't work.
Any help in optimizing this task is appreciated.
Depending on what you want your output to look like, you could solve this with a join. Consider the following code:
case class FirstType(id1: Int, num1: Int, num2: Int)
case class Bounds(id1: Int, lowerBound: Int, upperBound: Int)
case class SecondType(id2: Int, num3: Int)
val df = Seq((1, 10, 3), (2, 40, 4)).toDF("id1", "num1", "num2").as[FirstType]
df.show
+---+----+----+
|id1|num1|num2|
+---+----+----+
| 1| 10| 3|
| 2| 40| 4|
+---+----+----+
val df2 = Seq((1, 9), (2, 39), (3, 22), (4, 12)).toDF("id2", "num3").as[SecondType]
df2.show
+---+----+
|id2|num3|
+---+----+
| 1| 9|
| 2| 39|
| 3| 22|
| 4| 12|
+---+----+
val bounds = df.map(x => Bounds(x.id1, x.num1 - x.num2, x.num1 + x.num2))
bounds.show
+---+----------+----------+
|id1|lowerBound|upperBound|
+---+----------+----------+
| 1| 7| 13|
| 2| 36| 44|
+---+----------+----------+
val test = bounds.join(df2, df2("num3") >= bounds("lowerBound") && df2("num3") <= bounds("upperBound"))
test.show
+---+----------+----------+---+----+
|id1|lowerBound|upperBound|id2|num3|
+---+----------+----------+---+----+
| 1| 7| 13| 1| 9|
| 2| 36| 44| 2| 39|
| 1| 7| 13| 4| 12|
+---+----------+----------+---+----+
In here, I do the following:
Create 3 case classes to be able to use typed datasets later on
Create the 2 dataframes
Create an auxilliary dataframe called bounds, which contains the lower/upper bounds
Join the second dataframe onto that auxilliary one
As you can see, the test dataframe contains the result. For each unique combination of the id1, lowerBound and upperBound columns you'll get the different dataframes that you wanted if you look at the id2 and num3 columns only.
You could, for example use a groupBy operation to group by these 3 columns and then do whatever you wanted with the output KeyValueGroupedDataset (something like test.groupBy("id1", "lowerBound", "upperBound")). From there on it depends on what you want: if you want to apply an operation to each dataset for each of the bounds you could use the mapValues method of KeyValueGroupedDataset.
Hope this helps!
Below is my Hive table definition:
CREATE EXTERNAL TABLE IF NOT EXISTS default.test2(
id integer,
count integer
)
PARTITIONED BY (
fac STRING,
fiscaldate_str DATE )
STORED AS PARQUET
LOCATION 's3://<bucket name>/backup/test2';
I have the data in hive table as below, (I just inserted sample data)
select * from default.test2
+---+-----+----+--------------+
| id|count| fac|fiscaldate_str|
+---+-----+----+--------------+
| 2| 3| NRM| 2019-01-01|
| 1| 2| NRM| 2019-01-01|
| 2| 3| NRM| 2019-01-02|
| 1| 2| NRM| 2019-01-02|
| 2| 3| NRM| 2019-01-03|
| 1| 2| NRM| 2019-01-03|
| 2| 3|STST| 2019-01-01|
| 1| 2|STST| 2019-01-01|
| 2| 3|STST| 2019-01-02|
| 1| 2|STST| 2019-01-02|
| 2| 3|STST| 2019-01-03|
| 1| 2|STST| 2019-01-03|
+---+-----+----+--------------+
This table is partitioned on two columns (fac, fiscaldate_str) and we are trying to dynamically execute insert overwrite at partition level by using spark dataframes - dataframe writer.
However, when trying this, we are either ending up with duplicate data or all other partitions got deleted.
Below are the codes snippets for this using spark dataframe.
First I am creating dataframe as
df = spark.createDataFrame([(99,99,'NRM','2019-01-01'),(999,999,'NRM','2019-01-01')], ['id','count','fac','fiscaldate_str'])
df.show(2,False)
+---+-----+---+--------------+
|id |count|fac|fiscaldate_str|
+---+-----+---+--------------+
|99 |99 |NRM|2019-01-01 |
|999|999 |NRM|2019-01-01 |
+---+-----+---+--------------+
Getting duplicate with below snippet,
df.coalesce(1).write.mode("overwrite").insertInto("default.test2")
All other data get deleted and only the new data is available.
df.coalesce(1).write.mode("overwrite").saveAsTable("default.test2")
OR
df.createOrReplaceTempView("tempview")
tbl_ald_kpiv_hist_insert = spark.sql("""
INSERT OVERWRITE TABLE default.test2
partition(fac,fiscaldate_str)
select * from tempview
""")
I am using AWS EMR with Spark 2.4.0 and Hive 2.3.4-amzn-1 along with S3.
Can anyone have any idea why I am not able to dynamically overwrite the data into partitions ?
Your question is less easy to follow, but I think you mean you want a partition overwritten. If so, then this is what you need, all you need - the second line:
df = spark.createDataFrame([(99,99,'AAA','2019-01-02'),(999,999,'BBB','2019-01-01')], ['id','count','fac','fiscaldate_str'])
df.coalesce(1).write.mode("overwrite").insertInto("test2",overwrite=True)
Note the overwrite=True. The comment made is neither here nor there, as the DF.writer is being used. I am not addressing the coalesce(1).
Comment to Asker
I ran this as I standardly do - when prototyping and answering here - on a Databricks Notebook and expressly set the following and it worked fine:
spark.conf.set("spark.sql.sources.partitionOverwriteMode","static")
spark.conf.set("hive.exec.dynamic.partition.mode", "strict")
You ask to update the answer with:
spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic").
Can do as I have just done; may be in your environment this is needed, but I did certainly not need to do so.
UPDATE 19/3/20
This worked on prior Spark releases, now the following applie afaics:
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
// In Databricks did not matter the below settings
//spark.conf.set("hive.exec.dynamic.partition", "true")
//spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")
Seq(("CompanyA1", "A"), ("CompanyA2", "A"),
("CompanyB1", "B"))
.toDF("company", "id")
.write
.mode(SaveMode.Overwrite)
.partitionBy("id")
.saveAsTable("KQCAMS9")
spark.sql(s"SELECT * FROM KQCAMS9").show(false)
val df = Seq(("CompanyA3", "A"))
.toDF("company", "id")
// disregard coalsece
df.coalesce(1).write.mode("overwrite").insertInto("KQCAMS9")
spark.sql(s"SELECT * FROM KQCAMS9").show(false)
spark.sql(s"show partitions KQCAMS9").show(false)
All OK this way now from 2.4.x. onwards.
This question already has answers here:
How to avoid duplicate columns after join?
(10 answers)
Closed 4 years ago.
I want to use join with 3 dataframe, but there are some columns we don't need or have some duplicate name with other dataframes, so I want to drop some columns like below:
result_df = (aa_df.join(bb_df, 'id', 'left')
.join(cc_df, 'id', 'left')
.withColumnRenamed(bb_df.status, 'user_status'))
Please note that status column is in two dataframes, i.e. aa_df and bb_df.
The above doesn't work. I also tried to use withColumn, but the new column is created, and the old column is still existed.
If you are trying to rename the status column of bb_df dataframe then you can do so while joining as
result_df = aa_df.join(bb_df.withColumnRenamed('status', 'user_status'),'id', 'left').join(cc_df, 'id', 'left')
I want to use join with 3 dataframe, but there are some columns we don't need or have some duplicate name with other dataframes
That's a fine use case for aliasing a Dataset using alias or as operators.
alias(alias: String): Dataset[T] or alias(alias: Symbol): Dataset[T]
Returns a new Dataset with an alias set. Same as as.
as(alias: String): Dataset[T] or as(alias: Symbol): Dataset[T]
Returns a new Dataset with an alias set.
(And honestly I did only now see the Symbol-based variants.)
NOTE There are two as operators, as for aliasing and as for type mapping. Consult the Dataset API.
After you've aliases a Dataset, you can reference columns using [alias].[columnName] format. This is particularly handy with joins and star column dereferencing using *.
val ds1 = spark.range(5)
scala> ds1.as('one).select($"one.*").show
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
+---+
val ds2 = spark.range(10)
// Using joins with aliased datasets
// where clause is in a longer form to demo how ot reference columns by alias
scala> ds1.as('one).join(ds2.as('two)).where($"one.id" === $"two.id").show
+---+---+
| id| id|
+---+---+
| 0| 0|
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
+---+---+
so I want to drop some columns like below
My general recommendation is not to drop columns, but select what you want to include in the result. That makes life more predictable as you know what you get (not what you don't). I was told that our brains work by positives which could also make a point for select.
So, as you asked and I showed in the above example, the result has two columns of the same name id. The question is how to have only one.
There are at least two answers with using the variant of join operator with the join columns or condition included (as you did show in your question), but that would not answer your real question about "dropping unwanted columns", would it?
Given I prefer select (over drop), I'd do the following to have a single id column:
val q = ds1.as('one)
.join(ds2.as('two))
.where($"one.id" === $"two.id")
.select("one.*") // <-- select columns from "one" dataset
scala> q.show
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
+---+
Regardless of the reasons why you asked the question (which could also be answered with the points I raised above), let me answer the (burning) question how to use withColumnRenamed when there are two matching columns (after join).
Let's assume you ended up with the following query and so you've got two id columns (per join side).
val q = ds1.as('one)
.join(ds2.as('two))
.where($"one.id" === $"two.id")
scala> q.show
+---+---+
| id| id|
+---+---+
| 0| 0|
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
+---+---+
withColumnRenamed won't work for this use case since it does not accept aliased column names.
scala> q.withColumnRenamed("one.id", "one_id").show
+---+---+
| id| id|
+---+---+
| 0| 0|
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
+---+---+
You could select the columns you're interested in as follows:
scala> q.select("one.id").show
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
+---+
scala> q.select("two.*").show
+---+
| id|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
+---+
Please see the docs : withColumnRenamed()
You need to pass the name of the existing column and the new name to the function. Both of these should be strings.
result_df = aa_df.join(bb_df,'id', 'left').join(cc_df, 'id', 'left').withColumnRenamed('status', 'user_status')
If you have 'status' columns in 2 dataframes, you can use them in the join as aa_df.join(bb_df, ['id','status'], 'left') assuming aa_df and bb_df have the common column. This way you will not end up having 2 'status' columns.
Im trying to load a dataframe into hive table which is partitioned like below.
> create table emptab(id int, name String, salary int, dept String)
> partitioned by (location String)
> row format delimited
> fields terminated by ','
> stored as parquet;
I have a dataframe created in the below format:
val empfile = sc.textFile("emp")
val empdata = empfile.map(e => e.split(","))
case class employee(id:Int, name:String, salary:Int, dept:String)
val empRDD = empdata.map(e => employee(e(0).toInt, e(1), e(2).toint, e(3)))
val empDF = empRDD.toDF()
empDF.write.partitionBy("location").insertInto("/user/hive/warehouse/emptab/location=England")
But Im getting an error as below:
empDF.write.partitionBy("location").insertInto("/user/hive/warehouse/emptab/location=India")
java.lang.RuntimeException: [1.1] failure: identifier expected
/user/hive/warehouse/emptab/location=England
Data in "emp" file:
---+-------+------+-----+
| id| name|salary| dept|
+---+-------+------+-----+
| 1| Mark| 1000| HR|
| 2| Peter| 1200|SALES|
| 3| Henry| 1500| HR|
| 4| Adam| 2000| IT|
| 5| Steve| 2500| IT|
| 6| Brian| 2700| IT|
| 7|Michael| 3000| HR|
| 8| Steve| 10000|SALES|
| 9| Peter| 7000| HR|
| 10| Dan| 6000| BS|
+---+-------+------+-----+
Also this is the first time loading the empty Hive table which is partitioned. I am trying to create a partition while loading the data into Hive table.
Could anyone tell what is the mistake I am doing here and how can I correct it ?
This is a wrong approach.
When you say the partition path, that is not a "valid" Hadoop path.
What you have to do is:
val empDF = empRDD.toDF()
val empDFFiltered = empDF.filter(empDF.location == "India")
empDFFiltered.write.partitionBy("location").insertInto("/user/hive/warehouse/emptab")
The path will be handle by the partitionBy, if you want only add the information to partition India you should filter the India data from your dataframe.
I use Window.sum function to get the sum of a value in an RDD, but when I convert the DataFrame to an RDD, I found that the result's has only one partition. When does the repartitioning occur?
val rdd = sc.parallelize(List(1,3,2,4,5,6,7,8), 4)
val df = rdd.toDF("values").
withColumn("csum", sum(col("values")).over(Window.orderBy("values")))
df.show()
println(s"numPartitions ${df.rdd.getNumPartitions}")
// 1
//df is:
// +------+----+
// |values|csum|
// +------+----+
// | 1| 1|
// | 2| 3|
// | 3| 6|
// | 4| 10|
// | 5| 15|
// | 6| 21|
// | 7| 28|
// | 8| 36|
// +------+----+
I add partitionBy in Window ,but the result is error,what should i do?this is my change code:
val rdd=sc.parallelize(List(1,3,2,4,5,6,7,8),4)
val sqlContext = new SQLContext(m_sparkCtx)
import sqlContext.implicits._
val df = rdd.toDF("values").withColumn("csum", sum(col("values")).over(Window.partitionBy("values").orderBy("values")))
df.show()
println(s"numPartitions ${df.rdd.getNumPartitions}")
//1
//df is:
// +------+----+
// |values|csum|
// +------+----+
// | 1| 1|
// | 6| 6|
// | 3| 3|
// | 5| 5|
// | 4| 4|
// | 8| 8|
// | 7| 7|
// | 2| 2|
// +------+----+
Window function has partitionBy api for grouping the dataframe and orderBy to order the grouped rows in ascending or descending order.
In your first case you hadn't defined partitionBy, thus all the values were grouped in one dataframe for ordering purpose and thus shuffling the data into one partition.
But in your second case you had partitionBy defined on values itself. So since each value are distinct, each row is grouped into individual groups.
The partition in second case is 200 as that is the default partitioning defined in spark when you haven't defined partitions and shuffle occurs
To get the same result from your second case as you get with the first case, you need to group your dataframe as in your first case i.e. into one group. For that you will need to create another column with constant value and use that value for partitionBy.
When you create a column as
withColumn("csum", sum(col("values")).over(Window.orderBy("values")))
The Window.orderBy("values") is ordering the values of column "values" in single partition since you haven't defined partitionBy() method to define the partition.
This is changing the number of partition from initial 4 to 1.
The partition is 200 in your second case since the partitionBy()method uses 200 as default partition. if you need the number of partition as 4 you can use methods like repartition(4) or coalesce(4)
Hope you got the point!