Get PartitionedList for partition with more than 2000 documents - node.js

I have a partitioned Cloudant database (on the free tier) with a partition that has more than 2000 documents. Unfortunately, running await db.partitionedList('partitionID') returns this object:
{
total_rows: 2082,
offset: 0,
rows: [...]
}
where rows is an array of only 2000 objects. Is there a way for me to get those 82 remaining rows, or get a list of all 2082 rows together. Thanks.

Cloudant limits the _partition endpoints to returning a maximum of 2000 rows so you can't get all 2082 rows at once.
The way to get the remaining rows is by storing the doc ID of the last row and using it to make a startkey for a second request, appending \0 to ask the list to start from the next doc ID in the index e.g.
db.partitionedList('partitionID', {
startkey: `${firstResponse.rows[1999].id}\0`
})
Note that partitionedList is the equivalent of /{db}/_partition/{partitionID}/_all_docs so key and id are the same in each row and you can safely assume they are unique (because it is a doc ID) allowing use the unicode \0 trick. However, if you wanted to do the same with a _view you'd need to store both the key and id and fetch the 2000th row twice.

Related

Custom partitioning on JDBC in PySpark

I have a huge table in an oracle database that I want to work on in pyspark. But I want to partition it using a custom query, for example imagine there is a column in the table that contains the user's name, and I want to partition the data based on the first letter of the user's name. Or imagine that each record has a date, and I want to partition it based on the month. And because the table is huge, I absolutely need the data for each partition to be fetched directly by its executor and NOT by the master. So can I do that in pyspark?
P.S.: The reason that I need to control the partitioning, is that I need to perform some aggregations on each partition (partitions have meaning, not just to distribute the data) and so I want them to be on the same machine to avoid any shuffles. Is this possible? or am I wrong about something?
NOTE
I don't care about even or skewed partitioning! I want all the related records (like all the records of a user, or all the records from a city etc.) to be partitioned together, so that they reside on the same machine and I can aggregate them without any shuffling.
It turned out that the spark has a way of controlling the partitioning logic exactly. And that is the predicates option in spark.read.jdbc.
What I came up with eventually is as follows:
(For the sake of the example, imagine that we have the purchase records of a store, and we need to partition it based on userId and productId so that all the records of an entity is kept together on the same machine, and we can perform aggregations on these entities without shuffling)
First, produce the histogram of every column that you want to partition by (count of each value):
userId
count
123456
1640
789012
932
345678
1849
901234
11
...
...
productId
count
123456789
5435
523485447
254
363478326
2343
326484642
905
...
...
Then, use the multifit algorithm to divide the values of each column into n balanced bins (n being the number of partitions that you want).
userId
bin
123456
1
789012
1
345678
1
901234
2
...
...
productId
bin
123456789
1
523485447
2
363478326
2
326484642
3
...
...
Then, store these in the database
Then update your query and join on these tables to get the bin numbers for every record:
url = 'jdbc:oracle:thin:username/password#address:port:dbname'
query = ```
(SELECT
MY_TABLE.*,
USER_PARTITION.BIN as USER_BIN,
PRODUCT_PARTITION.BIN AS PRODUCT_BIN
FROM MY_TABLE
LEFT JOIN USER_PARTITION
ON my_table.USER_ID = USER_PARTITION.USER_ID
LEFT JOIN PRODUCT_PARTITION
ON my_table.PRODUCT_ID = PRODUCT_PARTITION.PRODUCT_ID) MY_QUERY```
df = spark.read\
.option('driver', 'oracle.jdbc.driver.OracleDriver')\
jdbc(url=url, table=query, predicates=predicates)
And finally, generate the predicates. One for each partition, like these:
predicates = [
'USER_BIN = 1 OR PRODUCT_BIN = 1',
'USER_BIN = 2 OR PRODUCT_BIN = 2',
'USER_BIN = 3 OR PRODUCT_BIN = 3',
...
'USER_BIN = n OR PRODUCT_BIN = n',
]
The predicates are added to the query as WHERE clauses, which means that all the records of the users in partition 1 go to the same machine. Also, all the records of the products in partition 1 go to that same machine as well.
Note that there are no relations between the user and the product here. We don't care which products are in which partition or are sent to which machine.
But since we want to perform some aggregations on both the users and the products (separately), we need to keep all the records of an entity (user or product) together. And using this method, we can achieve that without any shuffles.
Also, note that if there are some users or products whose records don't fit in the workers' memory, then you need to do a sub-partitioning. Meaning that you should first add a new random numeric column to your data (between 0 and some chunk_size like 10000 or something), then do the partitioning based on the combination of that number and the original IDs (like userId). This causes each entity to be split into fixed-sized chunks (i.e., 10000) to ensure it fits in the workers' memory.
And after the aggregations, you need to group your data on the original IDs to aggregate all the chunks together and make each entity whole again.
The shuffle at the end is inevitable because of our memory restriction and the nature of our data, but this is the most efficient way you can achieve the desired results.

Cassandra paging: How to page trough entire column family (table) and have part of compound key in resultset

I have a table as follows:
CREATE TABLE someTable (
user_id uuid,
id uuid,
someField string,
anotherField string,
PRIMARY KEY (user_id, id)
);
I know that there's a way to do paging in cassandra (https://docs.datastax.com/en/developer/java-driver/2.1/manual/paging/)
However, what I need to do is:
page trough entire table (it's large, so paging is required)
get all rows of a user_id
do something with these rows.
In short I need to fetch all the results of 1 user and do this for every record there is. (No, I don't have a unique list of user_ids here)
Also, I know I could do this programatically: paging through all the pages, assume i get it ordered by user_id, and append the last user_id (where rows are cut off) to the next page of results so data of that user gets in the same set.
However, I was hoping there would be a more elegant solution for this?
However, what I need to do is:
page trough entire table (it's large, so paging is required).
Assuming you don't know the **user_id**. And you want to fetch all the users data. To do this use token function to make a range query to get the user_ids.Displaying rows from an unordered partitioner with the TOKEN function something like select * from someTable where token(a_id) > token(other_id);
get all rows of a user_id
Now you know the user_id and want to fetch all the rows of that user_id. Use range query based on id starting from the MIN_UUID. Like:
select * from someTable where user_id = 123 and id > MIN_UUID limit 100
After that query choose the 100th uuid to fetch other rows. such that:
select * from someTable where user_id = 123 and id > [previous_quries_100th_id(uuid)] limit 100
Keep querying until you fetched all the rows.
do something with these rows.
It depends on you what you want to do with all of those rows. Use language specific ResultSet and iterate over rows to do something on there.

KDB: How to Delete rows from Partitioned Table

I have the below query used to delete rows from a partitioned table, but it doesn't work. What is the approach used for deleting rows in a partitioned table?
delete from SecurityLoan where lender=`SCOTIA, date in inDays, portfolio in portfoliolist
Note that inDays and portfoliolist are lists
Here's a slightly different method that re-indexes a column in a partition to a new list of indices that you want to keep in that column.
It still follows the same semantics of reading a column in, amending and then resetting it back to disk, just uses a slightly different approach. However, by doing it this way you can grab the indices you want to remove, simply by using a qsql query. It then grabs the full list of indices in a partition, and runs 'except' against the initial list, resulting in the ones you actually want to keep.
It becomes powerful when all you want to do is delete the contents of a sql query from a database/table (as is the case in yours).
// I've commented this function as much as possible to break it down and explain the approach
// db is where the database lives (hsym)
// qry is the qsql query (string)
q)delFromDisk:{[db;qry]
// grab the tree from the query
q:parse qry;
// cache partition counts
.Q.cn `. t:q 1;
// grab i by partition for your qry using the where clause
d:?[t;raze q 2;{x!x}1#f:.Q.pf;enlist[`delis]!1#`i];
// grab full indice list for each partition
a:1!flip (f,`allis)!(`. f;til each .Q.pn t);
// run except on full indice list and your query's indice list
r:update newis:allis except'delis from a,'d;
// grab columns except partition domain
c:cols[t] except .Q.pf;
// grab partitions that actually need modifications and make them dir handles
p:update dirs:.Q.par[db;;t] each p[.Q.pf] from p:0!select from r where not allis~'newis;
// apply on disk to directory handle (x), on column (y), to new indices (z)
m:{#[x;y;#;z]};
// grab params from p
pa:`dirs`c`newis#p cross ([]c);
// modify each column in a partition, one partition at a time
m .' value each pa
};
// test data/table
q)portfolio:`one`two`three`four`five;
q)lender:`user1`user2`user3`user4;
q)n:5;
// set to disk in date partitioned format
q)`:./2017.01.01/secLoan/ set .Q.en[`:./] ([]lender:n?lender;portfolio:n?portfolio);
q)`:./2017.01.02/secLoan/ set .Q.en[`:./] ([]lender:n?lender;portfolio:n?portfolio);
// load db
q)\l .
// lets say we want to delete from secLoan where lender in `user3 and portfolio in `one`two`three
// please note, this query does not have a date constraint, so it may be an inefficient query if you where-clause produces large results. Once happy with the util as a whole, it can be re-jigged to select+delete per partition
q)select from secLoan where lender in `user3,portfolio in `one`two`three
date lender portfolio
---------------------------
2017.01.01 user3 one
2017.01.01 user3 two
2017.01.02 user3 one
// 3 rows need deleted, 2 from first partition, 1 from second partition
// 10 rows exist
q)count secLoan
10
// run delete function
q)delFromDisk[`:.;"select from secLoan where lender in `user3,portfolio in `one`two`three"];
// reload to see diffs
q)\l .
q)count secLoan
7
// rows deleted
q)secLoan
date lender portfolio
---------------------------
2017.01.01 user2 five
2017.01.01 user4 three
2017.01.01 user2 three
2017.01.02 user2 five
2017.01.02 user2 two
2017.01.02 user4 three
2017.01.02 user1 five
// PS - can accept a delete qsql query as all the function does is look at the where clause
// delFromDisk[`:.;"delete from secLoan where lender in `user3,portfolio in `one`two`three"]
Unfortunately you can't use delete directly on a partitioned database.
For you to completely remove a row you'd have to read, modify and write all the data down again.
For an example on how to achieve this see the wiki:
http://code.kx.com/wiki/JB:KdbplusForMortals/partitioned_tables#1.3.5_Modifying_Partitioned_Tables
Thanks,
Seán

Can Cassandra hold an empty list?

I want to create a Cassandra collection with some list<int> field and insert an empty list;
CREATE TABLE test (
name text PRIMARY KEY,
scores list<int>,
);
INSERT INTO test (name, scores) VALUES ('John', []);
However, this returns null
SELECT * FROM test;
name |scores
------+------------
John | null
Does Cassandra not differentiate between null and empty list?
As always the recommendation goes with Cassandra don't insert NULL or try to insert EMPTY values. Its just saving yourselves from Tombstones, storage, I/O bandwidth.
The reason why Cassandra doesn't differentiate NULL Vs empty is because the way deletes are handled. There is no read before deleting any record in Cassandra. So it just marks as a tombstone and moves ahead.
So actually you get penalized to initialize the list as empty (essentially creating tombstone).

CouchDB views: total_rows vs offset vs rows?

I am making a POST request to a CouchDB with a list of keys in the body.
This is a follow up on a previous question asked on Stack Overflow here: CouchDB Query View with Multiple Keys Formatting).
I see that the result has 711 rows returned in this case, with an offset of 209. To me an offset means valid results that have been truncated - and you would need to go to the next page to see them.
I'm getting confused because the offset, rows, and what I actually get does not seem to add up. These are the results that I'm getting:
{
total_rows: 711,
offset: 209,
rows: [{
id: 'b45d1be2-9173-4008-9240-41b01b66b5de',
key: 2213,
value: [Object]
}, {
id: 'a73d0b13-5d36-431f-8a7a-2f2b45cb480d',
key: 2214,
value: [Object]
},
etc BUT THERE ARE ONLY 303 OBJECTS IN THIS ARRAY????
]
}
You have not supplied the query parameters you are using so I'll have to be a little general.
The total_rows value is the total number of rows in the view itself. The offset is the index in the view of the first matching row for the given query. The number of rows matching the query parameters are returned in the rows array, the total of which are trivial to obtain.
If there are no entries in the view for a direct key query, the offset value is the index into the view where the entry would be if it had the desired key.
It would seem that the offset refers to the number of documents BEFORE the first document that matches the key criteria is found.
and then the rows are all the documents that match the criteria.
i.e. rows returns all the documents that match the key criteria, and offset tells you what 'index' within all the docs returned by the view that the first document that matches the key criteria was found.
Please let me know if this is not correct :)

Resources