Add String to Array, without creating a new row in Clickhouse table - python-3.x

I just started to study the clickhouse! I use python and library clickhouse_connect. Can't get to add a new string to the Array(String)
I try to create new String to Array
My code:
import clickhouse_connect
ch_client = clickhouse_connect.get_client(host=ch_host, user=ch_user, password=ch_pass, database=ch_datebase)
ch_client.command(f'CREATE TABLE IF NOT EXISTS {ch_table} (key String, strings Array(String)) ENGINE MergeTree ORDER BY key')
insert_data = [['123', ['string1']]]
ch_client.insert(ch_table, insert_data, column_names=['key', 'strings'])
insert_data = [['123', ['string2']]]
ch_client.insert(ch_table, insert_data, column_names=['key', 'strings'])
Is there an easy way to insert a new row into the list if there is already such a key, and if there is no such key, then create a new row?

You could just insert your rows, then write a query that gives you what you want:
SELECT
key,
groupArrayArray(strings)
FROM ch_table
GROUP BY key;
If that works, you could create a materialized view from this query:
CREATE MATERIALIZED VIEW ch_table_view
ENGINE = AggregatingMergeTree
ORDER BY key
POPULATE AS
SELECT
key,
groupArrayArrayState(strings) AS strings_merged
FROM ch_table
GROUP BY key;
Notice the -State aggregate combinator was used, which keeps a "running total" of the array of strings. To read this column, you need to use the corresponding -Merge combinator:
SELECT
key,
groupArrayArrayMerge(strings_merged)
FROM ch_table_view
GROUP BY key;

Related

Reading guarantees for full table scan while updating the table?

Given schema:
CREATE TABLE keyspace.table (
key text,
ckey text,
value text
PRIMARY KEY (key, ckey)
)
...and Spark pseudocode:
val sc: SparkContext = ...
val connector: CassandraConnector = ...
sc.cassandraTable("keyspace", "table")
.mapPartitions { partition =>
connector.withSessionDo { session =>
partition.foreach { row =>
val key = row.getString("key")
val ckey = Random.nextString(42)
val value = row.getString("value")
session.execute(s"INSERT INTO keyspace.table (key, ckey, value)" +
" VALUES ($key, $ckey, $value)")
}
}
}
Is it possible for a code like this to read an inserted value within a single application (Spark job) run? More generalized version of my question would be whether a token range scan CQL query can read newly inserted values while iterating over rows.
Yes, it is possible exactly as Alex wrote
but I don't think it's possible with above code
So per data model the table is ordered by ckey in ascending order
The funny part however is the page size and how many pages are prefetched and since this is by default 1000 (spark.cassandra.input.fetch.sizeInRows), then the only problem could occur, if you wouldn't use 42, but something bigger and/or the executor didn't page yet
Also I think you use unnecessary nesting, so the code to achieve what you want might be simplified (after all cassandraTable will give you a data frame).
(I hope I understand that you want to read per partition (note a partition in your case is all rows under one primary key - "key") and for every row (distinguished by ckey) in this partition generate new one (with new ckey that will just duplicate value with new ckey) - use case for such code is a mystery for me, but I hope it has some sense:-))

DynamoDB adds item instead of updates item when using update_item

I have an html table that is filled from a DynamoDB table. Clicking a row pops up an edit form in a modal. The data inputted is sent to a flask server to update the item - using AWS DynamoDB - that was edited in the modal form. Upon reading the AWS documentation for this, the correct method is to use update_item. However, when doing so the item is added again instead of updating the item. I used the AWS here to script the below. In my DynamoDB table, the primary partition key is KEY1 and the primary sort key is KEY2 in the below reference.
table = dynamodb.Table('table_name') #define DynamoDB table
key1 = account_id #string value of account id
key2 = request.form["KEY2"] #this is a read only field in the form, so the key does not get updated here
form_val1 = request.form["input1"]
form_val2 = request.form["input2"]
form_val3 = request.form["input3"]
form_val4 = request.form["input4"]
form_val5 = request.form["input5"]
form_val6 = request.form["input6"]
form_val7 = request.form["input7"]
form_val8 = request.form["input8"]
form_val9 = request.form["input9"]
#update item in dynamo
table.update_item(
Key={
'KEY1': key1, #partition key
'KEY2': key2 #sort key
},
UpdateExpression='SET dbField1 = :val1, dbField2 = :val2, dbField3 = :val3, dbField4 = :val4, dbField5 = :val5, dbField6 = :val6, dbField7 = :val7, dbField8 = :val8, dbField9 = :val9',
ExpressionAttributeValues={
':val1': form_val1,
':val2': form_val2,
':val3': form_val3,
':val4': form_val4,
':val5': form_val5,
':val6': form_val6,
':val7': form_val7,
':val8': form_val8,
':val9': form_val9
}
)
You can't and I will explain to you for what that not is possible.
When you create a table on dynamo DB with key and a order key you automatically create an index between key and sort key. We know an index is inmutable, that means you can't update the keys. Is for that reason that when you update dynamo create a new element.
It's a problem of the definition of your table because you never need to change the key or the sort key. Recreate your table only with the index and not with the sort index (because if your app can change the sort index that make not sense).
Is this the full query? the update_item docs say that TableName is required, which I don't see in your snippet.
From the updateitem docs:
Edits an existing item's attributes, or adds a new item to the table
if it does not already exist.
Make sure that the primary key (partition key and sort key) are unique in your table. If they are not, updateitem will create a new item in the database.
Are you absolutely certain that the primary key for the item already exists in the database?

How to insert value in already created Database table through pandas `df.to_sql()`

I'm creating new table then inserting values in it because the tsv file doesn't have headers so i need to create table structure first then insert the value. I'm trying to insert the value in database table which is been created. I'm using df.to_sql function to insert tsv values into database table but its creating table but it's not inserting values in that table and its not giving any type of error either.
I have tried to create new table through sqalchemy and insert value it worked but it didn't worked for already created table.
conn, cur = create_conn()
engine = create_engine('postgresql://postgres:Shubham#123#localhost:5432/walmart')
create_query = '''create table if not exists new_table(
"item_id" TEXT, "product_id" TEXT, "abstract_product_id" TEXT,
"product_name" TEXT, "product_type" TEXT, "ironbank_category" TEXT,
"primary_shelf" TEXT, apparel_category" TEXT, "brand" TEXT)'''
cur.execute(create_query)
conn.commit()
file_name = 'new_table'
new_file = "C:\\Users\\shubham.shinde\\Desktop\\wallll\\new_file.txt"
data = pd.read_csv(new_file, delimiter="\t", chunksize=500000, error_bad_lines=False, quoting=csv.QUOTE_NONE, dtype="unicode", iterator=True)
with open(file_name + '_bad_rows.txt', 'w') as f1:
sys.stderr = f1
for df in data:
df.to_sql('new_table', engine, if_exists='append')
data.close()
I want to insert values from df.to_sql() into database table
Not 100% certain if this argument works with postgresql, but I had a similar issue when doing it on mssql. .to_sql() already creates the table in the first argument of the method in new_table. The if_exists = append also doesn't check for duplicate values. If data in new_file is overwritten, or run through your function again, it will just add to the table. As to why you're seeing the table name, but not seeing the data in it, might be due to the size of the df. Try setting fast_executemany=True as the second argument of the create_engine.
My suggestion, get rid of create_query, and handle the data types after to_sql(). Once the SQL table is created, you can use your actual SQL table, and join against this staging table for duplicate testing. The non-duplicates can be written to the actual table, converting datatypes on UPDATE to match the tables data type structure.

How to make a lookup-table in cassandra

I want to create a table in cassandra, that is used as a lookup table. I have a lot of urls in my database and want to store ids instead of the urls-strings. So my approach is, to store the urls in a table with two columns: id (int) and url (text).
My problem is, that I need an index for the url field and also for the id field.
The first index is used during progressing new ulrs (so find an id for an url in the database) and the second index is use during displaying data (get the url for an id).
How can I implement that in cassandra?
I would suggest creating 2 separate tables for this:
CREATE TABLE id_url (id int primary key, url text);
and
CREATE TABLE url_id (url text primary key, id int);
Inserts to these tables should be done with a batch:
BEGIN BATCH
INSERT INTO id_url (id, url) VALUES (1, '<url1>');
INSERT INTO url_id (url, id) VALUES ('<url1>', 1);
APPLY BATCH
You could create your table like this:
CREATE TABLE urls_table(
id int PRIMARY KEY,
url text
);
and then create an index on the second column:
create index urls_table_url on urls_table (url);
Your first query is satisfied since you're querying over partition key. The second one is satisfied since you created an index on url column.

How to search a cassandra collection map using QueryBuilder

In my cassandra table i have a collection of Map also i have indexed the map keys.
CREATE TABLE IF NOT EXISTS test.collection_test(
name text,
year text,
attributeMap map<text,text>,
PRIMARY KEY ((name, year))
);
CREATE INDEX ON collection_test (attributeMap);
The QueryBuilder syntax is as below:
select().all().from("test", "collection_test")
.where(eq("name", name)).and(eq("year", year));
How should i put where condition on attributeMap?
First of all, you will need to create an index on the keys in your map. By default, an index created on a map indexes the values of the map, not the keys. There is special syntax to index the keys:
CREATE INDEX attributeKeyIndex ON collection_test (KEYS(attributeMap));
Next, to SELECT from a map with indexed keys, you'll need the CONTAINS KEY keyword. But currently, there is not a definition for this functionality in the query builder API. However, there is an open ticket to support it: JAVA-677
Currently, to accomplish this with the Java Driver, you'll need to build your own query or use a prepared statement:
PreparedStatement statement = _session.prepare("SELECT * " +
"FROM test.collection_test " +
"WHERE attributeMap CONTAINS KEY ?");
BoundStatement boundStatement = statement.bind(yourKeyValue);
ResultSet results = _session.execute(boundStatement);
Finally, you should read through the DataStax doc on When To Use An Index. Secondary indexes are known to not perform well. I can't imagine that a secondary index on a collection would be any different.

Resources