How to delete rows in YugabyteDB YCQL spark connector? - yugabytedb

[Question posted by a user on YugabyteDB Community Slack]
Is it possible in YugabyteDB YCQL Spark connector to delete rows? If yes, how?

Yes, it is possible. You can apply this patch to the repo to see how you would run a delete inside a test case:
diff --git a/java/yb-cql-4x/src/test/java/org/yb/loadtest/TestSpark3Jsonb.java b/java/yb-cql-4x/src/test/java/org/yb/loadtest/TestSpark3Jsonb.java
index 50d075b529..3ccf3b42f4 100644
--- a/java/yb-cql-4x/src/test/java/org/yb/loadtest/TestSpark3Jsonb.java
+++ b/java/yb-cql-4x/src/test/java/org/yb/loadtest/TestSpark3Jsonb.java
## -124,6 +124,7 ## public class TestSpark3Jsonb extends BaseMiniClusterTest {
}
// TODO to update a JSONB sub-object only using Spark SQL --
// for now requires using the Cassandra session directly.
+ session.execute("delete from " + tableWithKeysapce + " where id=3");
String update = "update " + tableWithKeysapce +
" set phone->'key'->1->'m'->2->'b'='320' where id=4";
session.execute(update);
In the test output, you would see:
+---+----------------------+-----+-----------------------------------------------------------------+
|id |address |name |phone |
+---+----------------------+-----+-----------------------------------------------------------------+
|2 |Acton London, UK |Nick |{"code":"+43","phone":1200} |
|4 |4 Act London, UK |Kumar|{"code":"+45","key":[0,{"m":[12,-1,{"b":320},500]}],"phone":1500}|
|1 |Hammersmith London, UK|John |{"code":"+42","phone":1000} |
+---+----------------------+-----+-----------------------------------------------------------------+
where the row below is deleted:
|3 |11 Acton London, UK |Smith|{"code":"+44","phone":1400}|

Related

Renaming a table & keeping connections to existing partitions in YugabyteDB

[Question posted by a user on YugabyteDB Community Slack]
Does renaming the table, existing partitions attached to that table remain as it is after renaming?
Yes.
yugabyte=# \dt
List of relations
Schema | Name | Type | Owner
--------+-----------------------+-------+----------
public | order_changes | table | yugabyte
public | order_changes_2019_02 | table | yugabyte
public | order_changes_2019_03 | table | yugabyte
public | order_changes_2020_11 | table | yugabyte
public | order_changes_2020_12 | table | yugabyte
public | order_changes_2021_01 | table | yugabyte
public | people | table | yugabyte
public | people1 | table | yugabyte
public | user_audit | table | yugabyte
public | user_credentials | table | yugabyte
public | user_profile | table | yugabyte
public | user_svc_account | table | yugabyte
(12 rows)
yugabyte=# alter table order_changes RENAME TO oc;
ALTER TABLE
yugabyte=# \dS+ oc
Table "public.oc"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
-------------+------+-----------+----------+---------+----------+--------------+-------------
change_date | date | | | | plain | |
type | text | | | | extended | |
description | text | | | | extended | |
Partition key: RANGE (change_date)
Partitions: order_changes_2019_02 FOR VALUES FROM ('2019-02-01') TO ('2019-03-01'),
order_changes_2019_03 FOR VALUES FROM ('2019-03-01') TO ('2019-04-01'),
order_changes_2020_11 FOR VALUES FROM ('2020-11-01') TO ('2020-12-01'),
order_changes_2020_12 FOR VALUES FROM ('2020-12-01') TO ('2021-01-01'),
order_changes_2021_01 FOR VALUES FROM ('2021-01-01') TO ('2021-02-01')
Postgres and therefore YugabyteDB doesn’t actually use the names of an object, it uses the OID (object ID) of an object.
That means that you can rename it, without actually causing any harm, because it’s simply a name in the catalog with the object identified by its OID.
This has other side effects as well: if you create a table, and perform a certain SQL like ‘select count(*) from table’, drop it, and then create a table with the same name, and perform the exact same SQL, you will get two records in pg_stat_statements with identical SQL text. This seems weird from the perspective of databases where the SQL area is shared. In postgres, only pg_stat_statements is shared, there is no SQL cache.
pg_stat_statements does not store the SQL text, it stores the query tree (an internal representation of the SQL), and symbolizes the tree, which makes to appear like SQL again. The query tree uses the OID, and therefore for pg_stat_statements the above two identical SQL texts are different query trees, because the OIDs of the tables are different.

Spark, return multiple rows on group?

So, I have a Kafka topic containing the following data, and I'm working on a proof-of-concept whether we can achieve what we're trying to do. I was previous trying to solve it within Kafka, but it seems that Kafka wasn't the right tool, so looking at Spark now :)
The data in its basic form looks like this:
+--+------------+-------+---------+
|id|serialNumber|source |company |
+--+------------+-------+---------+
|1 |123ABC |system1|Acme |
|2 |3285624 |system1|Ajax |
|3 |CDE567 |system1|Emca |
|4 |XX |system2|Ajax |
|5 |3285624 |system2|Ajax&Sons|
|6 |0147852 |system2|Ajax |
|7 |123ABC |system2|Acme |
|8 |CDE567 |system2|Xaja |
+--+------------+-------+---------+
The main grouping column is serialNumber and the result should be that id 1 and 7 should match as it's a full match on the company. Id 2 and 5 should match because the company in id 2 is a full partial match of the company in id 5. Id 3 and 8 should not match as the companies doesn't match.
I expect the end result to be something like this. Note that sources are not fixed to just one or two and in the future it will contain more sources.
+------+-----+------------+-----------------+---------------+
|uuid |id |serialNumber|source |company |
+------+-----+------------+-----------------+---------------+
|<uuid>|[1,7]|123ABC |[system1,system2]|[Acme] |
|<uuid>|[2,5]|3285624 |[system1,system2]|[Ajax,Ajax&Sons|
|<uuid>|[3] |CDE567 |[system1] |[Emca] |
|<uuid>|[4] |XX |[system2] |[Ajax] |
|<uuid>|[6] |0147852 |[system2] |[Ajax] |
|<uuid>|[8] |CDE567 |[system2] |[Xaja] |
+------+-----+------------+-----------------+---------------+
I was looking at groupByKey().mapGroups() but having problems finding examples. Can mapGroups() return more than one row?
You can simply groupBy based on serialNumber column and collect_list of all other columns.
code:
import org.apache.spark.sql.{Dataset, SparkSession}
import org.apache.spark.sql.functions._
val ds = Seq((1,"123ABC", "system1", "Acme"),
(7,"123ABC", "system2", "Acme"))
.toDF("id", "serialNumber", "source", "company")
ds.groupBy("serialNumber")
.agg(
collect_list("id").alias("id"),
collect_list("source").alias("source"),
collect_list("company").alias("company")
)
.show(false)
Output:
+------------+------+------------------+------------+
|serialNumber|id |source |company |
+------------+------+------------------+------------+
|123ABC |[1, 7]|[system1, system2]|[Acme, Acme]|
+------------+------+------------------+------------+
If you dont want duplicate values, use collect_set
ds.groupBy("serialNumber")
.agg(
collect_list("id").alias("id"),
collect_list("source").alias("source"),
collect_set("company").alias("company")
)
.show(false)
Output with collect_set on company column:
+------------+------+------------------+-------+
|serialNumber|id |source |company|
+------------+------+------------------+-------+
|123ABC |[1, 7]|[system1, system2]|[Acme] |
+------------+------+------------------+-------+

Check value from Spark DataFrame column and do transformations

I have a dataframe consists of person, transaction_id & is_successful. The dataframe consists of duplicate values for person with different transaction_ids and is_successful will be True/False for each transaction.
I would like to derive a new dataframe which will have one record for each person which consists latest transaction_id of that person and populate True only if any of his transactions are successful.
val input_df = sc.parallelize(Seq((1,1, "True"), (1,2, "False"), (2,1, "False"), (2,2, "False"), (2,3, "True"), (3,1, "False"), (3,2, "False"), (3,3, "False"))).toDF("person","transaction_id", "is_successful")
df: org.apache.spark.sql.DataFrame = [person: int, transaction_id: int ... 1 more field]
input_df.show(false)
+------+--------------+-------------+
|person|transaction_id|is_successful|
+------+--------------+-------------+
|1 |1 |True |
|1 |2 |False |
|2 |1 |False |
|2 |2 |False |
|2 |3 |True |
|3 |1 |False |
|3 |2 |False |
|3 |3 |False |
+------+--------------+-------------+
Expected Df:
+------+--------------+-------------+
|person|transaction_id|is_successful|
+------+--------------+-------------+
|1 |2 |True |
|2 |3 |True |
|3 |3 |False |
+------+--------------+-------------+
How can we derive the dataframe like above?
What you can do is below in spark sql
select person,max(transaction_id) as transaction_id,max(is_successful) as is_successful from <table_name> group by person
Leave the complex work to max operator.As per the max operation True will come over False.So if one of your person has three False and one True, max of that would be True.
You may achieve this by grouping your dataframe on person and finding the max transaction_id and max is_successful
I've included an example below of how this may be achieved using spark sql.
First, I created a temporary view of your dataframe in order to access using spark sql, then run the following sql statement in spark sql.
input_df.createOrReplaceTempView("input_df");
val result_df = sparkSession.sql("<insert sql below here>");
The sql statement groups the data for each person before using max to determine the last transaction id and a combination of max (sum could be used with the same logic also) and case expressions to derive the is_successful value. The case expression is nested as I've converted True to a numeric value of 1 and False to 0 to leverage a numeric comparison. This is within an outer case expression which checks if the max value is > 0 (i.e. any value was successful) before printing True/False.
SELECT
person,
MAX(transaction_id) as transaction_id,
CASE
WHEN MAX(
CASE
WHEN is_successful = 'True' THEN 1
ELSE 0
END
) > 0 THEN 'True'
ELSE 'False'
END as is_successful
FROM
input_df
GROUP BY
person
Here is the #ggordon's sql version of answer in dataframe version.
input_df.groupBy("person")
.agg(max("transaction_id").as("transaction_id"),
when(max(when('is_successful === "True", 1)
.otherwise(0)) > 0, "True")
.otherwise("False").as("is_successful"))

Spark - match states inside a row of dataframe

Below is my dataframe which I was able to wrangle and extract from multi struct Json files
-------------------------------------------
Col1 | Col2| Col3 | Col4
-------------------------------------------
A | 1 |2018-03-28T19:03:39| Active
-------------------------------------------
A | 1 |2018-03-28T19:03:40| Clear
-------------------------------------------
A | 1 |2018-03-28T19:11:21| Active
-------------------------------------------
A | 1 |2018-03-28T20:13:06| Active
-------------------------------------------
A | 1 |2018-03-28T20:13:07| Clear
-------------------------------------------
This is what I came up with by grouping by keys
A|1|[(2018-03-28T19:03:39,Active),(2018-03-28T19:03:40,Clear),(2018-03-28T19:11:21,Active),(2018-03-28T20:13:06,Active),(2018-03-28T20:13:07,Clear)]
and this is my desired output..
--------------------------------------------------------
Col1 | Col2| Active time | Clear Time
--------------------------------------------------------
A | 1 |2018-03-28T19:03:39| 2018-03-28T19:03:40
--------------------------------------------------------
A | 1 |2018-03-28T20:13:06| 2018-03-28T20:13:07
--------------------------------------------------------
I am kind of stuck at this step and not sure how to proceed further to get the desired output. Any direction is appreciated.
Spark version - 2.1.1
Scala version - 2.11.8
You can use window function for the grouping and ordering to get the consecutive active and clear time. Since you are looking for filtering out the the rows which doesn't have consecutive clear or active status, you would need a filter too.
so if you have dataframe as
+----+----+-------------------+------+
|Col1|Col2|Col3 |Col4 |
+----+----+-------------------+------+
|A |1 |2018-03-28T19:03:39|Active|
|A |1 |2018-03-28T19:03:40|Clear |
|A |1 |2018-03-28T19:11:21|Active|
|A |1 |2018-03-28T20:13:06|Active|
|A |1 |2018-03-28T20:13:07|Clear |
+----+----+-------------------+------+
you can simply do as I explained above
import org.apache.spark.sql.expressions._
def windowSpec = Window.partitionBy("Col1", "Col2").orderBy("Col3")
import org.apache.spark.sql.functions._
df.withColumn("active", lag(struct(col("Col3"), col("Col4")), 1).over(windowSpec))
.filter(col("active.Col4") === "Active" && col("Col4") === "Clear")
.select(col("Col1"), col("Col2"), col("active.Col3").as("Active Time"), col("Col3").as("Clear Time"))
.show(false)
and you should get
+----+----+-------------------+-------------------+
|Col1|Col2|Active Time |Clear Time |
+----+----+-------------------+-------------------+
|A |1 |2018-03-28T19:03:39|2018-03-28T19:03:40|
|A |1 |2018-03-28T20:13:06|2018-03-28T20:13:07|
+----+----+-------------------+-------------------+

How to add column and records on a dataset given a condition

I'm working on a program that brands data as OutOfRange based on the values present on certain columns.
I have three columns: Age, Height, and Weight. I want to create a fourth column called OutOfRange and assign it a value of 0(false) or 1(true) if the values in those three columns exceed a specific threshold.
If age is lower than 18 or higher than 60, that row will be assigned a value of 1 (0 otherwise). If height is lower than 5, that row will be assigned a value of 1 (0 otherwise), and so on.
Is it possible to create a column and then add/overwrite values to that column? It would be awesome if I can do that with Spark. I know SQL so if there is anything I can do with the dataset.SQL() function please let me know.
Given a dataframe as
+---+------+------+
|Age|Height|Weight|
+---+------+------+
|20 |3 |70 |
|17 |6 |80 |
|30 |5 |60 |
|61 |7 |90 |
+---+------+------+
You can apply when function to apply the logics explained in the question as
import org.apache.spark.sql.functions._
df.withColumn("OutOfRange", when(col("Age") <18 || col("Age") > 60 || col("Height") < 5, 1).otherwise(0))
which would result following dataframe
+---+------+------+----------+
|Age|Height|Weight|OutOfRange|
+---+------+------+----------+
|20 |3 |70 |1 |
|17 |6 |80 |1 |
|30 |5 |60 |0 |
|61 |7 |90 |1 |
+---+------+------+----------+
Is it possible to create a column and then add/overwrite values to that column? It would be awesome if I can do that with Spark. I know SQL so if there is anything I can do with the dataset.SQL() function please let me know.
This is not possible without recreating the Dataset all together since Datasets are inherently immutable.
However you can save the Dataset as a Hive table, which will allow you to do what you want to do. Saving the Dataset as a Hive table will write the contents of your Dataset to disk under the default spark-warehouse directory.
df.write.mode("overwrite").saveAsTable("my_table")
// Add a row
spark.sql("insert into my_table (Age, Height, Weight, OutofRange) values (20, 30, 70, 1)
// Update a row
spark.sql("update my_table set OutOfRange = 1 where Age > 30")
....
Hive support must be enabled for spark at time of instantiation in order to do this.

Resources