Cassandra Partition key duplicates?

Cassandra Partition key duplicates? - cassandra

I am new to Cassandra so I had a few quick questions, suppose I do this:
CREATE TABLE my_keyspace.my_table (
id bigint,
year int,
datetime timestamp,
field1 int,
field2 int,
PRIMARY KEY ((id, year), datetime))
I imagine Cassandra as something like Map<PartitionKey, SortedMap<ColKey, ColVal>>,
My question is when querying for something from Cassandra using a WHERE, it will be like:
SELECT * FROM my_keyspace.my_table WHERE id = 1 AND year = 4,
This could return 2 or more records, how does this fit in with the data model of Cassandra?
If it really it a Big HashMap how come duplicate records for a partition key are allowed?
Thanks!

There is a batch of entries in the SortedMap<ColKey, ColVal> for each row, using its sorted nature.
To build on your mental model, while there is only 1 partition key for id = 1 AND year = 4 there are multiple cells:
(id, year) | ColKey | ColVal
------------------------------------------
1, 4 | datetime(1):field1 | 1 \ Row1
1, 4 | datetime(1):field2 | 2 /
1, 4 | datetime(5):field1 | 1 \
1, 4 | datetime(5):field2 | 2 / Row2
...

Related

Replace values of several columns with values mapping in other dataframe PySpark

I need to replace values of several columns (many more than those in the example, so I would like to avoid doing multiple left joins) of a dataframe with values from another dataframe (mapping).
Example:
df1 EXAM
id
question1
question2
question3
1
12
12
5
2
12
13
6
3
3
7
5
df2 VOTE MAPPING :
id
description
3
bad
5
insufficient
6
sufficient
12
very good
13
excellent
Output
id
question1
question2
question3
1
very good
very good
insufficient
2
very good
excellent
sufficient
3
bad
null
insufficient
Edit 1: Corrected id for excellent in vote map

First of all, you can create a reference dataframe:
df3 = df2.select(
func.create_map(func.col('id'), func.col('desc')).alias('ref')
).groupBy().agg(
func.collect_list('ref').alias('ref')
).withColumn(
'ref', func.udf(lambda lst: {k:v for element in lst for k, v in element.items()}, returnType=MapType(StringType(), StringType()))(func.col('ref'))
)
+---------------------------------------------------------------------------+
|ref |
+---------------------------------------------------------------------------+
|{3 -> bad, 12 -> good, 5 -> insufficient, 13 -> excellent, 6 -> sufficient}|
+---------------------------------------------------------------------------+
Then you can replace the value in question columns by getting the value in reference with 1 crossJoin:
df4 = df1.crossJoin(df3)\
.select(
'id',
*[func.col('ref').getItem(func.col(col)).alias(col) for col in df1.columns[1:]]
)
df4.show(10, False)
+---+----+---------+------------+
|id |q1 |q2 |q3 |
+---+----+---------+------------+
|1 |good|good |insufficient|
|2 |good|excellent|sufficient |
|3 |bad |null |insufficient|
+---+----+---------+------------+

Identify if an upsert operation inserts or updates the row in YugabyteDB

[Question posted by a user on YugabyteDB Community Slack]
Is there a way to identify if an upsert operation like the one shown below, inserts or either updates the row e.g., with the Java or Golang driver?
UPDATE test set value = 'value1', checkpoint = 'cas1' WHERE key = 'key1' IF checkpoint = '' OR NOT EXISTS;

The RETURNS STATUS AS ROW is a YCQL feature. In YSQL, you could use an AFTER INSERT OR UPDATE... EACH ROW trigger to detect the outcome. The challenge, then, would be to surfcae the result in the session that made the change. You could use a user-defined run-time parameter (set my_app.outcome = 'true') or a temp table.
—regards, bryn#yugabyte.com

You can use RETURNS STATUS AS ROW as documented here: https://docs.yugabyte.com/preview/api/ycql/dml_update/#returns-status-as-row 2
Example:
cqlsh:sample> CREATE TABLE test(h INT, r INT, v LIST<INT>, PRIMARY KEY(h,r)) WITH transactions={'enabled': true};
cqlsh:sample> INSERT INTO test(h,r,v) VALUES (1,1,[1,2]);
Unapplied update when IF condition is false:
cqlsh:sample> UPDATE test SET v[2] = 4 WHERE h = 1 AND r = 1 IF v[1] = 3 RETURNS STATUS AS ROW;
[applied] | [message] | h | r | v
-----------+-----------+---+---+--------
False | null | 1 | 1 | [1, 2]
Applied update when IF condition true:
cqlsh:sample> UPDATE test SET v[0] = 4 WHERE h = 1 AND r = 1 IF v[1] = 2 RETURNS STATUS AS ROW;
[applied] | [message] | h | r | v
-----------+-----------+------+------+------
True | null | null | null | null

Filter DataFrame to delete duplicate values in pyspark

I have the following dataframe
date | value | ID
--------------------------------------
2021-12-06 15:00:00 25 1
2021-12-06 15:15:00 35 1
2021-11-30 00:00:00 20 2
2021-11-25 00:00:00 10 2
I want to join this DF with another one like this:
idUser | Name | Gender
-------------------
1 John M
2 Anne F
My expected output is:
ID | Name | Gender | Value
---------------------------
1 John M 35
2 Anne F 20
What I need is: Get only the most recent value of the first dataframe and join only this value with my second dataframe. Although, my spark script is joining both values:
My code:
df = df1.select(
col("date"),
col("value"),
col("ID"),
).OrderBy(
col("ID").asc(),
col("date").desc(),
).groupBy(
col("ID"), col("date").cast(StringType()).substr(0,10).alias("date")
).agg (
max(col("value")).alias("value")
)
final_df = df2.join(
df,
(col("idUser") == col("ID")),
how="left"
)
When i perform this join (formating the columns is abstracted in this post) I have the following output:
ID | Name | Gender | Value
---------------------------
1 John M 35
2 Anne F 20
2 Anne F 10
I use substr to remove hours and minutes to filter only by date. But when I have the same ID in different days my output df has the 2 values instead of the most recently. How can I fix this?
Note: I'm using only pyspark functions to do this (I now want to use spark.sql(...)).

You can use window and row_number function in pysaprk
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
windowSpec = Window.partitionBy("ID").orderBy("date").desc()
df1_latest_val = df1.withColumn("row_number", row_number().over(windowSpec)).filter(
f.col("row_number") == 1
)
The output of table df1_latest_val will look something like this
date | value | ID | row_number |
-----------------------------------------------------
2021-12-06 15:15:00 35 1 1
2021-11-30 00:00:00 20 2 1
Now you will have df with the latest val, which you can directly join with another table.

Python: How to replace "datetime" in a Dataframe with only the Day represented as integer value?

I am dealing with a dataframe called "Data" containing two columns, one is the userid, the other one a datetime object:
userid | eventTime
1 | 2018-11-01 07:36:58
2 | 2018-11-07 08:04:12
.. | ....
My goal is now to replace the entry in this dataframe column eventTime with just the number of the day in the specific month as integer value, e.g. the first day (=1) in November and the seventh day (=7) in November.
So the Result shall be
userid | eventTime
1 | 1
2 | 7
.. | ....
How can I get this done?
I already extracted indices from the data frame and tried to modify it in a loop but I don't know how it shall work:
temp = Data.index.get_values() #get indices from data frame
for temp, row in Data.iterrows():
print(row['eventTime'])

df['eventTime'] = df['eventTime'].dt.day
Should work - assuming df is your dataframe
Performing a vector operation like this is the most efficient way to work on a dataframe

As an example:
df = pd.DataFrame(data={'user': ['a', 'b', 'a', 'b'],
'eventTime': [datetime(2000, 1, 1),
datetime(2000, 2, 2),
datetime(2000, 3, 3),
datetime(2000, 4, 4)]})
print(df)
# eventTime user
# 0 2000-01-01 a
# 1 2000-02-02 b
# 2 2000-03-03 a
# 3 2000-04-04 b
You can operate on a particular column with the apply() method, and datetime objects provide a datetime.day property that gives you the day of the month as an integer:
df['day'] = df.eventTime.apply(lambda x: x.day)
print(df)
# eventTime user day
# 0 2000-01-01 a 1
# 1 2000-02-02 b 2
# 2 2000-03-03 a 3
# 3 2000-04-04 b 4
If you want to replace the eventTime column instead of create a new column, just use:
df['eventTime'] = df.eventTime.apply(lambda x: x.day)

Cassandra: how to initialize the counter column with value?

I have to benchmark Cassandra with the Facebook Linkbench. There are two phase during the Benchmark, the load and the request phase.
in the Load Phase, Linkbench fill the cassandra tables : nodes, links and counts (for links counting) with default values(graph data).
the count table looks like this:
keyspace.counttable (
link_id bigint,
link_type bigint,
time bigint,
version bigint,
count counter,
PRIMARY KEY (link_id, link_type, time, version)
my question is how to insert the default counter values (before incrementing and decrementing the counter in the Linkbench request phase) ?
If it isn't possible to do that with cassandra, how should i increment/decrement a bigint variable (instead of counter variable)
Any suggest and comments? Thanks a lot.

The default value is zero. Given
create table counttable (
link_id bigint,
link_type bigint,
time bigint,
version bigint,
count counter,
PRIMARY KEY (link_id, link_type, time, version)
);
and
update counttable set count = count + 1 where link_id = 1 and link_type = 1 and time = 1 and version = 1;
We see that the value of count is now 1.
select * from counttable ;
link_id | link_type | time | version | count
---------+-----------+------+---------+-------
1 | 1 | 1 | 1 | 1
(1 rows)
So, if we want to set it to some other value we can:
update counttable set count = count + 500 where link_id = 1 and link_type = 1 and time = 1 and version = 2;
select * from counttable ;
link_id | link_type | time | version | count
---------+-----------+------+---------+-------
1 | 1 | 1 | 1 | 1
1 | 1 | 1 | 2 | 500
(2 rows)

There is no elegant way to initialize a counter column with a non-zero value. The only operation you can do on a counter field is increment/decrement. I recommend to keep the offset (e.g. the your intended initial value) in a different column, and simply add the two values in your client application.

Thank you for the Answers. I implemented the following solution to initialize the counter Field.
as the initial and default value of the counter Field is 0 ,i incremented it with my default value. it looks like Don Branson solution but with only one column:
create table counttable (
link_id bigint,
link_type bigint,
count counter,
PRIMARY KEY (link_id, link_type)
);
i set the value with this statement (during the load phase):
update counttable set count = count + myValue where link_id = 1 and link_type = 1;
select * from counttable ;
link_id | link_type | count
---------+-----------+--------
1 | 1 | myValue (added to 0)
(1 row)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Cassandra Partition key duplicates? - cassandra

Related

Replace values of several columns with values mapping in other dataframe PySpark

Identify if an upsert operation inserts or updates the row in YugabyteDB

Filter DataFrame to delete duplicate values in pyspark

Python: How to replace "datetime" in a Dataframe with only the Day represented as integer value?

Cassandra: how to initialize the counter column with value?

Categories

Resources