Dynamo DB table design for very large attribute values - pagination

I am facing below design decision challenge while designing dynamo db table
The structure of the table is -
appId : partition key
artifactId : sort key
and
attribute1, attribute2, attribute3 carry JSON data
The relation ship between attributes is -
attribute3 --> depends on attribute1, attribute2 and
attribute2 --> depends on attribute1
We finally need to show attribute3 on the UI
Now the challenge we are facing is :
attribute1 , attribute2, attribute3 contain a lot of data .
We have a query which fetches the data with appId = 1
The total number of rows with appId = 1 amounts to 16
This generates huge data and query returns only sub part of the total 16 records , since dynamo DB has maximum size limit for returned query result = 1 MB
We cannot do pagination here because a single row size of attribute1 , attribute2, attribute3 can amount to > 1 MB , indicating pagination is of no use
So , what could be best design to avoid this type of issue
I have written approach below but this approach is not upto the mark because the real issue here is - we should be able to paginate on INDIVIDUAL JSON data inside
attribute 1 , attribute 2 and attribute 3 and not on rows
Please suggest any approach where we can paginate on individual JSON items inside attribute 1 , attribute 2 and attribute 3
Approach :
Since final output shown on UI is of attribute 3 , we wil store attribute 1 and attribute 2 in dynamo DB and when final processing happens (attribute 3 value calculation) , we will store that in cache (We are using OpenSearch ) . This final processing happens asynchronously and it will hapen only whern attribute 1 and attribute 2 are in dynammo DB
Since attribute 3 value is stored in cache , there will be no performace penalty to retrieve all 16 records from cache

Related

Custom partitioning on JDBC in PySpark

I have a huge table in an oracle database that I want to work on in pyspark. But I want to partition it using a custom query, for example imagine there is a column in the table that contains the user's name, and I want to partition the data based on the first letter of the user's name. Or imagine that each record has a date, and I want to partition it based on the month. And because the table is huge, I absolutely need the data for each partition to be fetched directly by its executor and NOT by the master. So can I do that in pyspark?
P.S.: The reason that I need to control the partitioning, is that I need to perform some aggregations on each partition (partitions have meaning, not just to distribute the data) and so I want them to be on the same machine to avoid any shuffles. Is this possible? or am I wrong about something?
NOTE
I don't care about even or skewed partitioning! I want all the related records (like all the records of a user, or all the records from a city etc.) to be partitioned together, so that they reside on the same machine and I can aggregate them without any shuffling.
It turned out that the spark has a way of controlling the partitioning logic exactly. And that is the predicates option in spark.read.jdbc.
What I came up with eventually is as follows:
(For the sake of the example, imagine that we have the purchase records of a store, and we need to partition it based on userId and productId so that all the records of an entity is kept together on the same machine, and we can perform aggregations on these entities without shuffling)
First, produce the histogram of every column that you want to partition by (count of each value):
userId
count
123456
1640
789012
932
345678
1849
901234
11
...
...
productId
count
123456789
5435
523485447
254
363478326
2343
326484642
905
...
...
Then, use the multifit algorithm to divide the values of each column into n balanced bins (n being the number of partitions that you want).
userId
bin
123456
1
789012
1
345678
1
901234
2
...
...
productId
bin
123456789
1
523485447
2
363478326
2
326484642
3
...
...
Then, store these in the database
Then update your query and join on these tables to get the bin numbers for every record:
url = 'jdbc:oracle:thin:username/password#address:port:dbname'
query = ```
(SELECT
MY_TABLE.*,
USER_PARTITION.BIN as USER_BIN,
PRODUCT_PARTITION.BIN AS PRODUCT_BIN
FROM MY_TABLE
LEFT JOIN USER_PARTITION
ON my_table.USER_ID = USER_PARTITION.USER_ID
LEFT JOIN PRODUCT_PARTITION
ON my_table.PRODUCT_ID = PRODUCT_PARTITION.PRODUCT_ID) MY_QUERY```
df = spark.read\
.option('driver', 'oracle.jdbc.driver.OracleDriver')\
jdbc(url=url, table=query, predicates=predicates)
And finally, generate the predicates. One for each partition, like these:
predicates = [
'USER_BIN = 1 OR PRODUCT_BIN = 1',
'USER_BIN = 2 OR PRODUCT_BIN = 2',
'USER_BIN = 3 OR PRODUCT_BIN = 3',
...
'USER_BIN = n OR PRODUCT_BIN = n',
]
The predicates are added to the query as WHERE clauses, which means that all the records of the users in partition 1 go to the same machine. Also, all the records of the products in partition 1 go to that same machine as well.
Note that there are no relations between the user and the product here. We don't care which products are in which partition or are sent to which machine.
But since we want to perform some aggregations on both the users and the products (separately), we need to keep all the records of an entity (user or product) together. And using this method, we can achieve that without any shuffles.
Also, note that if there are some users or products whose records don't fit in the workers' memory, then you need to do a sub-partitioning. Meaning that you should first add a new random numeric column to your data (between 0 and some chunk_size like 10000 or something), then do the partitioning based on the combination of that number and the original IDs (like userId). This causes each entity to be split into fixed-sized chunks (i.e., 10000) to ensure it fits in the workers' memory.
And after the aggregations, you need to group your data on the original IDs to aggregate all the chunks together and make each entity whole again.
The shuffle at the end is inevitable because of our memory restriction and the nature of our data, but this is the most efficient way you can achieve the desired results.

Spark JDBC - Read -> update -> write huge table without primary key

I am trying to update the few fields of each row of a big mysql table (having close to 500 million rows). The table doesn't have any primary key (or having string primary key like UUID). I don't have enough executor memory to read and hold the entire data in once. Can anyone please let me know what are my options to process such tables.
Below is the schema
CREATE TABLE Persons ( Personid varchar(255) NOT NULL, LastName varchar(255) NOT NULL, FirstName varchar(255) DEFAULT NULL, Email varchar(255) DEFAULT NULL, Age int(11) DEFAULT NULL) ) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Spark code is like
SparkSession spark = SparkSession.builder().master("spark://localhost:7077").appName("KMASK").getOrCreate();
DataFrame rawDataFrame = spark.read().format("jdbc").load();
rawDataFrame.createOrReplaceTempView("data");
//encrypt is UDF
String sql = "select Personid, LastName, FirstName, encrypt(Email), Age from data";
Dataset newData = spark.sql(sql);
newData.write().mode(SaveMode.Overwrite).format("jdbc").options(options).save();
This table has around 150 million records, size of data is around 6GB. My executor memory is just 2 gb. Can I process this table using Spark - jdbc.
Ideally you can alter the spark jdbc fetchsize option to reduce/increase how many records are fetched and processed each time.
Partitioning the data can also help to reduce shuffles and additional overhead. Since you have Age as a numerical field. You may also process the data in partitions determined by the Age. First determine the min and max age and use the Spark JDBC Options.
Notably:
partitionColumn : Age
lowerBound : min age you identified
upperBound : max age you identified
numPartitions: really dependent on the number of cores and worker nodes but more hints and links are here
You may also use custom queries to only select and update a few records that can hold in memory with the query option. NB. when using the query option you should not use dbtable option.

How to model for word search in cassandra

my model design to save word search from checkbox and it must have update word search and status, delete(fake). my old model set pk is uuid(id of word search) and set index is status (enable, disable, deleted)
but I don't want to set index at status column(I think its very bad to set index at update column) and I don't change database
Is it have better way for model this?
sorry for my english grammar
You should not create index on very low cardinality column status
Avoid very low cardinality index e.g. index where the number of distinct values is very low. A good example is an index on the gender of an user. On each node, the whole user population will be distributed on only 2 different partitions for the index: MALE & FEMALE. If the number of users per node is very dense (e.g. millions) we’ll have very wide partitions for MALE & FEMALE index, which is bad
Source : https://www.datastax.com/dev/blog/cassandra-native-secondary-index-deep-dive
Best way to handle this type of case :
Create separate table for each type of status
Or Status with a known parameter (year, month etc) as partition key
Example of 2nd Option
CREATE TABLE save_search (
year int,
status int,
uuid uuid,
category text,
word_search text,
PRIMARY KEY((year, status), uuid)
);
Here you can see that i have made a composite partition key with year and status, because of low cardinality issue. If you think huge data will be in a single status then you should also add month as the part of composite partition key
If your dataset is small you can just remove the year field.
CREATE TABLE save_search (
status int,
uuid uuid,
category text,
word_search text,
PRIMARY KEY(status, uuid)
);
Or
If you are using cassandra version 3.x or above then you can use materialized view
CREATE MATERIALIZED VIEW search_by_status AS
SELECT *
FROM your_main_table
WHERE uuid IS NOT NULL AND status IS NOT NULL
PRIMARY KEY (status, uuid);
You can query with status like :
SELECT * FROM search_by_status WHERE status = 0;
All the deleting, updating and inserting you made on your main table cassandra will sync it with the materialized view

Query to select max value of a column using max using cassandra

We have table called table1 with Tableindex as column. we need the maximum value of tableindex which must result as 3. Consider this table as example(table1),
Tableindex Creationdate Documentname Documentid
1 2017-01-18 11:59:06+0530 Document 1 Doc_1
2 2017-01-18 12:09:06+0530 Document 2 Doc_2
3 2017-01-18 01:09:06+0530 Document 3 Doc_3
In sql, we can select the max value using the following query,
"Select max(Tableindex) from table1;" // result in 3.
Similarly how can we get the max value in cassandra query. Is there Max function in cassandra. If not how can we get the max value?
Thanks in advance.
In Cassandra 2.2, the standard aggregate functions of min, max, avg, sum, and count are built-in functions.
SELECT MAX(column_name) FROM keyspace.table_name ;
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useQueryStdAggregate.html
If you do select * from table1 limit 1, the result will give you row with MAX Tableindex .
There are better ways to do this with cassandra. you shoud know few things while designing cassandra tables.
1- Design your table for your queries.
2- Design to distributed data across the cassandra cluster.
You can upgrade to Cassandra higher version (2.2+) , it has built in max func.

Cassandra-secondary index on part of the composite key?

I am using a composite primary key consisting of 2 strings Name1, Name2, and a timestamp (e.g. 'Joe:Smith:123456'). I want to query a range of timestamps given an equality condition for either Name1 or Name2.
For example, in SQL:
SELECT * FROM testcf WHERE (timestamp > 111111 AND timestamp < 222222 and Name2 = 'Brown');
and
SELECT * FROM testcf WHERE (timestamp > 111111 AND timestamp < 222222 and Name1 = 'Charlie);
From my understanding, the first part of the composite key is the partition key, so the second query is possible, but the first query would require some kind of index on Name2.
Is it possible to create a separate index on a component of the composite key? Or am I misunderstanding something here?
You will need to manually create and maintain an index of names if you want to use your schema and support the first query. Given this requirement, I question your choice in data model. Your model should be designed with your read pattern in mind. I presume you are also storing some column values as well that you want to query by timestamp. If so, perhaps the following model would serve you better:
"[current_day]:Joe:Smith" {
123456:Field1 : value
123456:Field2 : value
123450:Field1 : value
123450:Field2 : value
}
With this model you can use the current day (or some known day) as a sentinel value, then filter on first and last names. You can also get a range of columns by timestamp using the composite column names.

Resources