I have the below query used to delete rows from a partitioned table, but it doesn't work. What is the approach used for deleting rows in a partitioned table?
delete from SecurityLoan where lender=`SCOTIA, date in inDays, portfolio in portfoliolist
Note that inDays and portfoliolist are lists
Here's a slightly different method that re-indexes a column in a partition to a new list of indices that you want to keep in that column.
It still follows the same semantics of reading a column in, amending and then resetting it back to disk, just uses a slightly different approach. However, by doing it this way you can grab the indices you want to remove, simply by using a qsql query. It then grabs the full list of indices in a partition, and runs 'except' against the initial list, resulting in the ones you actually want to keep.
It becomes powerful when all you want to do is delete the contents of a sql query from a database/table (as is the case in yours).
// I've commented this function as much as possible to break it down and explain the approach
// db is where the database lives (hsym)
// qry is the qsql query (string)
q)delFromDisk:{[db;qry]
// grab the tree from the query
q:parse qry;
// cache partition counts
.Q.cn `. t:q 1;
// grab i by partition for your qry using the where clause
d:?[t;raze q 2;{x!x}1#f:.Q.pf;enlist[`delis]!1#`i];
// grab full indice list for each partition
a:1!flip (f,`allis)!(`. f;til each .Q.pn t);
// run except on full indice list and your query's indice list
r:update newis:allis except'delis from a,'d;
// grab columns except partition domain
c:cols[t] except .Q.pf;
// grab partitions that actually need modifications and make them dir handles
p:update dirs:.Q.par[db;;t] each p[.Q.pf] from p:0!select from r where not allis~'newis;
// apply on disk to directory handle (x), on column (y), to new indices (z)
m:{#[x;y;#;z]};
// grab params from p
pa:`dirs`c`newis#p cross ([]c);
// modify each column in a partition, one partition at a time
m .' value each pa
};
// test data/table
q)portfolio:`one`two`three`four`five;
q)lender:`user1`user2`user3`user4;
q)n:5;
// set to disk in date partitioned format
q)`:./2017.01.01/secLoan/ set .Q.en[`:./] ([]lender:n?lender;portfolio:n?portfolio);
q)`:./2017.01.02/secLoan/ set .Q.en[`:./] ([]lender:n?lender;portfolio:n?portfolio);
// load db
q)\l .
// lets say we want to delete from secLoan where lender in `user3 and portfolio in `one`two`three
// please note, this query does not have a date constraint, so it may be an inefficient query if you where-clause produces large results. Once happy with the util as a whole, it can be re-jigged to select+delete per partition
q)select from secLoan where lender in `user3,portfolio in `one`two`three
date lender portfolio
---------------------------
2017.01.01 user3 one
2017.01.01 user3 two
2017.01.02 user3 one
// 3 rows need deleted, 2 from first partition, 1 from second partition
// 10 rows exist
q)count secLoan
10
// run delete function
q)delFromDisk[`:.;"select from secLoan where lender in `user3,portfolio in `one`two`three"];
// reload to see diffs
q)\l .
q)count secLoan
7
// rows deleted
q)secLoan
date lender portfolio
---------------------------
2017.01.01 user2 five
2017.01.01 user4 three
2017.01.01 user2 three
2017.01.02 user2 five
2017.01.02 user2 two
2017.01.02 user4 three
2017.01.02 user1 five
// PS - can accept a delete qsql query as all the function does is look at the where clause
// delFromDisk[`:.;"delete from secLoan where lender in `user3,portfolio in `one`two`three"]
Unfortunately you can't use delete directly on a partitioned database.
For you to completely remove a row you'd have to read, modify and write all the data down again.
For an example on how to achieve this see the wiki:
http://code.kx.com/wiki/JB:KdbplusForMortals/partitioned_tables#1.3.5_Modifying_Partitioned_Tables
Thanks,
Seán
Related
I have a huge table in an oracle database that I want to work on in pyspark. But I want to partition it using a custom query, for example imagine there is a column in the table that contains the user's name, and I want to partition the data based on the first letter of the user's name. Or imagine that each record has a date, and I want to partition it based on the month. And because the table is huge, I absolutely need the data for each partition to be fetched directly by its executor and NOT by the master. So can I do that in pyspark?
P.S.: The reason that I need to control the partitioning, is that I need to perform some aggregations on each partition (partitions have meaning, not just to distribute the data) and so I want them to be on the same machine to avoid any shuffles. Is this possible? or am I wrong about something?
NOTE
I don't care about even or skewed partitioning! I want all the related records (like all the records of a user, or all the records from a city etc.) to be partitioned together, so that they reside on the same machine and I can aggregate them without any shuffling.
It turned out that the spark has a way of controlling the partitioning logic exactly. And that is the predicates option in spark.read.jdbc.
What I came up with eventually is as follows:
(For the sake of the example, imagine that we have the purchase records of a store, and we need to partition it based on userId and productId so that all the records of an entity is kept together on the same machine, and we can perform aggregations on these entities without shuffling)
First, produce the histogram of every column that you want to partition by (count of each value):
userId
count
123456
1640
789012
932
345678
1849
901234
11
...
...
productId
count
123456789
5435
523485447
254
363478326
2343
326484642
905
...
...
Then, use the multifit algorithm to divide the values of each column into n balanced bins (n being the number of partitions that you want).
userId
bin
123456
1
789012
1
345678
1
901234
2
...
...
productId
bin
123456789
1
523485447
2
363478326
2
326484642
3
...
...
Then, store these in the database
Then update your query and join on these tables to get the bin numbers for every record:
url = 'jdbc:oracle:thin:username/password#address:port:dbname'
query = ```
(SELECT
MY_TABLE.*,
USER_PARTITION.BIN as USER_BIN,
PRODUCT_PARTITION.BIN AS PRODUCT_BIN
FROM MY_TABLE
LEFT JOIN USER_PARTITION
ON my_table.USER_ID = USER_PARTITION.USER_ID
LEFT JOIN PRODUCT_PARTITION
ON my_table.PRODUCT_ID = PRODUCT_PARTITION.PRODUCT_ID) MY_QUERY```
df = spark.read\
.option('driver', 'oracle.jdbc.driver.OracleDriver')\
jdbc(url=url, table=query, predicates=predicates)
And finally, generate the predicates. One for each partition, like these:
predicates = [
'USER_BIN = 1 OR PRODUCT_BIN = 1',
'USER_BIN = 2 OR PRODUCT_BIN = 2',
'USER_BIN = 3 OR PRODUCT_BIN = 3',
...
'USER_BIN = n OR PRODUCT_BIN = n',
]
The predicates are added to the query as WHERE clauses, which means that all the records of the users in partition 1 go to the same machine. Also, all the records of the products in partition 1 go to that same machine as well.
Note that there are no relations between the user and the product here. We don't care which products are in which partition or are sent to which machine.
But since we want to perform some aggregations on both the users and the products (separately), we need to keep all the records of an entity (user or product) together. And using this method, we can achieve that without any shuffles.
Also, note that if there are some users or products whose records don't fit in the workers' memory, then you need to do a sub-partitioning. Meaning that you should first add a new random numeric column to your data (between 0 and some chunk_size like 10000 or something), then do the partitioning based on the combination of that number and the original IDs (like userId). This causes each entity to be split into fixed-sized chunks (i.e., 10000) to ensure it fits in the workers' memory.
And after the aggregations, you need to group your data on the original IDs to aggregate all the chunks together and make each entity whole again.
The shuffle at the end is inevitable because of our memory restriction and the nature of our data, but this is the most efficient way you can achieve the desired results.
I have a partitioned Cloudant database (on the free tier) with a partition that has more than 2000 documents. Unfortunately, running await db.partitionedList('partitionID') returns this object:
{
total_rows: 2082,
offset: 0,
rows: [...]
}
where rows is an array of only 2000 objects. Is there a way for me to get those 82 remaining rows, or get a list of all 2082 rows together. Thanks.
Cloudant limits the _partition endpoints to returning a maximum of 2000 rows so you can't get all 2082 rows at once.
The way to get the remaining rows is by storing the doc ID of the last row and using it to make a startkey for a second request, appending \0 to ask the list to start from the next doc ID in the index e.g.
db.partitionedList('partitionID', {
startkey: `${firstResponse.rows[1999].id}\0`
})
Note that partitionedList is the equivalent of /{db}/_partition/{partitionID}/_all_docs so key and id are the same in each row and you can safely assume they are unique (because it is a doc ID) allowing use the unicode \0 trick. However, if you wanted to do the same with a _view you'd need to store both the key and id and fetch the 2000th row twice.
We use UOM conversions at this client. We stock in Eaches and sell in Cases. The problem we are having with the Pick ticket is that both the quantity to be picked and the UOM being picked are the stocking unit and not the selling unit.
e.g. The customer orders 73 cases (12 ea per case). The pick ticket prints 876 each. This requires the warehouse person to look up each item determine if there is a Selling UOM and ratio and to then manually convert 876 eaches to 73 cases.
Obviously, the pick ticket should print 73 cases. But I cannot find a way to do this. The items are lotted and an order of 73 case might have 50 cases of Lot A and 23 cases of Lot B. This is represented in the SOShipLineSplit table. The quantities and UOM in this table are based on Stocking units.
Ideally, I could join the INUnits table to both the SOSHipLine and SOShipLineSPlit table. See Below.
Select case when isnull(U.UnitRate,0) = 0 then S.Qty else S.Qty/U.Unitrate end as ShipQty
,case when isnull(U.UnitRate,0) = 0 then s.uom else U.FromUnit end as UOM
from SOShipLineSplit S
inner join SOShipLine SL
ON S.CompanyID = SL.CompanyID and s.ShipmentNbr = SL.ShipmentNbr and S.LineNbr = SL.LineNbr and S.InventoryID = SL.InventoryID
Left Outer Join INUnit U
On S.CompanyID = U.CompanyID and S.InventoryID = U.InventoryID and s.UOm = U.ToUnit and SL.UOM = U.FromUnit
where S.ShipmentNbr = '000161' and S.CompanyId = 4
The problem is the Acumatica Report writer does not support a join with multiple tables.
Left Outer Join INUnit U
On S.CompanyID = U.CompanyID and S.InventoryID = U.InventoryID and s.UOm = U.ToUnit and SL.UOM = U.FromUnit
I believe I must be missing something. This cannot be the only client using Acumatica who utilizes Selling Units of Measure. Is there another table I could use that would contain the quantities and UOM already converted for this order to Selling Units?
Or another solution?
Thanks in advance.
pat
EDIT:
If the goal is to display accurate quantities before/after conversion then INUnit DAC can't be used. It doesn't store historical data, you can change INUnit values after an order has been finalized so re-using it to compute quantities will not yield accurate results.
For that scenario you would need to use the historical data fields with Base prefixes like ShippedQuantity/BaseShippedQuantity. If you require to store more historical data you need to add a custom field to hold these values and update them when shipment is created/modified.
The main issue appears to be a logical error in the requirement:
The problem is that the INUnit table has to be joined to BOTH the
SOShipLine and the SOShipLineSplit tables.
INUnit DAC has a single parent, not 2 so you need to change your requirement to reflect that constraint.
If SOShipLine and SOShipLineSplit values differ then you'll never get any record.
If they are identical then there's no need to join on both since they have the same value.
I suggest to add 2 joins, one for SOShipLine and another for SOShipLineSplit. In the report you can choose which one to display (1st, 2nd or both).
You can also add visibility conditions or IIF formula condition in the report if you want to handle null values error check for display purpose.
Use the Child Alias property in schema builder to join the same table 2 times without name conflicts. In the report formulas (to display field or in formula conditions) use the Child Alias table name too.
Example:
I want to create a Cassandra collection with some list<int> field and insert an empty list;
CREATE TABLE test (
name text PRIMARY KEY,
scores list<int>,
);
INSERT INTO test (name, scores) VALUES ('John', []);
However, this returns null
SELECT * FROM test;
name |scores
------+------------
John | null
Does Cassandra not differentiate between null and empty list?
As always the recommendation goes with Cassandra don't insert NULL or try to insert EMPTY values. Its just saving yourselves from Tombstones, storage, I/O bandwidth.
The reason why Cassandra doesn't differentiate NULL Vs empty is because the way deletes are handled. There is no read before deleting any record in Cassandra. So it just marks as a tombstone and moves ahead.
So actually you get penalized to initialize the list as empty (essentially creating tombstone).
Im learning cassandra from past few days. Tried to create a data model for the following use case..
"Each Zipcode in US has a list of stores sorted based on a defined rank"
"Each store/warehouse has millions of SKUs and the inventory is tracked"
"If I search using a zipcode and SKU, it should return the best possible 100 stores
with inventory, based on the rank"
Assume store count is 1000+ and sku count is in millions
Design tried
One table with
ZipCode
Rank
StoreID
primary key (ZipCode, Rank)
Another table with
Sku
Store
Inventory
Primary Key (Sku, Store)
Now, if I want to search top 100 stores for each ZipCode, SKU
combination..
I have to search in table 1 for the top 100 stores and
then pull inventory of each store from the second table.
Since the SKU count is in millions and store count is in 1000+, m not
sure if we can store all this in one table and have zipcode_sku as row
key and stores and inventory stored as wide row sorted by rank
Am I thinking right? What could be other possible data models for this use case?
UPDATE: Data Loader Code (as mentioned in below comments)
println "Loading data started.."
(1..1000000).each { // SKUs
sku = it.toString()
(1..42000).each { // Zip Codes
zipcode = it.toString().padLeft(5,"0")
(1..1500).each { // Stores
store = it.toString()
int inventory = Math.abs(new Random().nextInt() % 10000) + 1
session.execute("INSERT INTO ritz.rankedStoreByZipcodeAndSku(sku, zipcode, store, store_rank, inventory) " +
"VALUES('$sku','$zipcode','$store',$it,$inventory);")
}
}
}
println "Data Loaded"
Cassandra is a Columnar database, so you can have wide rows that you usually want to represent each kind of query you want to make. In this case
CREATE TABLE storeByZipcodeAndSku (
sku text,
zipcode int,
store text,
store_rank int,
inventory int,
PRIMARY KEY ((sku, zipcode), store)
);
This way the row key is sku + zipcode so its a very fast lookup and you can store up to 2 billion stores in it. When you update your inventory also update this table. To get the top 100 you just pull down all of them and sort (1000's is not many) but if this operation is super common and you need it faster you can instead use
CREATE TABLE rankedStoreByZipcodeAndSku (
...
PRIMARY KEY ((sku, zipcode), store_rank)
) WITH CLUSTERING ORDER BY (store_rank ASC);
to have it sorted for you automatically and you just grab the top 100. Then when you update it you will want to use the lightweight transactions to move things around atomically.
It sounds like you want to get a list of StoreID's from the first table based on ZipCode, and a list of StoreID's from the second table based on Sku, and then do a join. Since Cassandra is a simple key value store, it doesn't do join's. So you would have to either write code in your client to do the two queries and manually do the join, or connect Cassandra to spark which has a join function.
As you say, trying to denormalize the two tables into one table so that you could do this as one query might result in a very large and difficult to maintain table. If this is the only query pattern you will have, then that might be worth it, but if this is a general inventory system with a lot of different query patterns, then it might be too inflexible.
The other option would be to use an RDBMS instead of Cassandra, and then joins are super easy.