range queries in cassandra - cassandra

The following is working as expected. But who do I execute range queries like "where age > 40 and age < 50"
create keyspace Keyspace1;
use Keyspace1;
create column family Users with comparator=UTF8Type and default_validation_class=UTF8Type and key_validation_class=UTF8Type;
set Users[jsmith][first] = 'John';
set Users[jsmith][last] = 'Smith';
set Users[jsmith][age] = long(42);
get Users[jsmith];
=> (column=age, value=42, timestamp=1341375850335000)
=> (column=first, value=John, timestamp=1341375827657000)
=> (column=last, value=Smith, timestamp=1341375838375000)

The best way to do this in Cassandra varies depending on your requirements, but the approaches are fairly similar for supporting these types of range queries.
Basically, you will take advantage of the fact that columns within a row are sorted by their names. So, if you use an age as the column name (or part of the column name), the row will be sorted by ages.
You will find a lot of similarities between this and storing time-series data. I suggest you take a look at Basic Time Series with Cassandra for the fundamentals, and the second half of an intro to the latest CQL features that gives an example of a somewhat more powerful approach.
The built-in secondary indexes are basically designed like a hash table, and don't work for range queries unless that range expression accompanies an equality expression on an indexed column. So, you could ask for select * from users where name = 'Joe' and age > 54, but not simply select * from users where age > 54, because that would require a full table scan. See Secondary Indexes doc for more details.

You have to create a Secondary index on the column age:
update column family Users with column_metadata=[{column_name: age, validation_class: LongType, index_type: KEYS}];
Then use:
get Users where age > 40 and age < 50
Note: I think: Exclusive operators are not supported since 1.2.
Datastax has a good documentation about that: http://www.datastax.com/dev/blog/whats-new-cassandra-07-secondary-indexes Or you can create and maintain your own secondary index. This is a good link about that:
http://www.anuff.com/2010/07/secondary-indexes-in-cassandra.html

Related

Custom partitioning on JDBC in PySpark

I have a huge table in an oracle database that I want to work on in pyspark. But I want to partition it using a custom query, for example imagine there is a column in the table that contains the user's name, and I want to partition the data based on the first letter of the user's name. Or imagine that each record has a date, and I want to partition it based on the month. And because the table is huge, I absolutely need the data for each partition to be fetched directly by its executor and NOT by the master. So can I do that in pyspark?
P.S.: The reason that I need to control the partitioning, is that I need to perform some aggregations on each partition (partitions have meaning, not just to distribute the data) and so I want them to be on the same machine to avoid any shuffles. Is this possible? or am I wrong about something?
NOTE
I don't care about even or skewed partitioning! I want all the related records (like all the records of a user, or all the records from a city etc.) to be partitioned together, so that they reside on the same machine and I can aggregate them without any shuffling.
It turned out that the spark has a way of controlling the partitioning logic exactly. And that is the predicates option in spark.read.jdbc.
What I came up with eventually is as follows:
(For the sake of the example, imagine that we have the purchase records of a store, and we need to partition it based on userId and productId so that all the records of an entity is kept together on the same machine, and we can perform aggregations on these entities without shuffling)
First, produce the histogram of every column that you want to partition by (count of each value):
userId
count
123456
1640
789012
932
345678
1849
901234
11
...
...
productId
count
123456789
5435
523485447
254
363478326
2343
326484642
905
...
...
Then, use the multifit algorithm to divide the values of each column into n balanced bins (n being the number of partitions that you want).
userId
bin
123456
1
789012
1
345678
1
901234
2
...
...
productId
bin
123456789
1
523485447
2
363478326
2
326484642
3
...
...
Then, store these in the database
Then update your query and join on these tables to get the bin numbers for every record:
url = 'jdbc:oracle:thin:username/password#address:port:dbname'
query = ```
(SELECT
MY_TABLE.*,
USER_PARTITION.BIN as USER_BIN,
PRODUCT_PARTITION.BIN AS PRODUCT_BIN
FROM MY_TABLE
LEFT JOIN USER_PARTITION
ON my_table.USER_ID = USER_PARTITION.USER_ID
LEFT JOIN PRODUCT_PARTITION
ON my_table.PRODUCT_ID = PRODUCT_PARTITION.PRODUCT_ID) MY_QUERY```
df = spark.read\
.option('driver', 'oracle.jdbc.driver.OracleDriver')\
jdbc(url=url, table=query, predicates=predicates)
And finally, generate the predicates. One for each partition, like these:
predicates = [
'USER_BIN = 1 OR PRODUCT_BIN = 1',
'USER_BIN = 2 OR PRODUCT_BIN = 2',
'USER_BIN = 3 OR PRODUCT_BIN = 3',
...
'USER_BIN = n OR PRODUCT_BIN = n',
]
The predicates are added to the query as WHERE clauses, which means that all the records of the users in partition 1 go to the same machine. Also, all the records of the products in partition 1 go to that same machine as well.
Note that there are no relations between the user and the product here. We don't care which products are in which partition or are sent to which machine.
But since we want to perform some aggregations on both the users and the products (separately), we need to keep all the records of an entity (user or product) together. And using this method, we can achieve that without any shuffles.
Also, note that if there are some users or products whose records don't fit in the workers' memory, then you need to do a sub-partitioning. Meaning that you should first add a new random numeric column to your data (between 0 and some chunk_size like 10000 or something), then do the partitioning based on the combination of that number and the original IDs (like userId). This causes each entity to be split into fixed-sized chunks (i.e., 10000) to ensure it fits in the workers' memory.
And after the aggregations, you need to group your data on the original IDs to aggregate all the chunks together and make each entity whole again.
The shuffle at the end is inevitable because of our memory restriction and the nature of our data, but this is the most efficient way you can achieve the desired results.

UOM and UOM ratio incorrect on Pick Ticket

We use UOM conversions at this client. We stock in Eaches and sell in Cases. The problem we are having with the Pick ticket is that both the quantity to be picked and the UOM being picked are the stocking unit and not the selling unit.
e.g. The customer orders 73 cases (12 ea per case). The pick ticket prints 876 each. This requires the warehouse person to look up each item determine if there is a Selling UOM and ratio and to then manually convert 876 eaches to 73 cases.
Obviously, the pick ticket should print 73 cases. But I cannot find a way to do this. The items are lotted and an order of 73 case might have 50 cases of Lot A and 23 cases of Lot B. This is represented in the SOShipLineSplit table. The quantities and UOM in this table are based on Stocking units.
Ideally, I could join the INUnits table to both the SOSHipLine and SOShipLineSPlit table. See Below.
Select case when isnull(U.UnitRate,0) = 0 then S.Qty else S.Qty/U.Unitrate end as ShipQty
,case when isnull(U.UnitRate,0) = 0 then s.uom else U.FromUnit end as UOM
from SOShipLineSplit S
inner join SOShipLine SL
ON S.CompanyID = SL.CompanyID and s.ShipmentNbr = SL.ShipmentNbr and S.LineNbr = SL.LineNbr and S.InventoryID = SL.InventoryID
Left Outer Join INUnit U
On S.CompanyID = U.CompanyID and S.InventoryID = U.InventoryID and s.UOm = U.ToUnit and SL.UOM = U.FromUnit
where S.ShipmentNbr = '000161' and S.CompanyId = 4
The problem is the Acumatica Report writer does not support a join with multiple tables.
Left Outer Join INUnit U
On S.CompanyID = U.CompanyID and S.InventoryID = U.InventoryID and s.UOm = U.ToUnit and SL.UOM = U.FromUnit
I believe I must be missing something. This cannot be the only client using Acumatica who utilizes Selling Units of Measure. Is there another table I could use that would contain the quantities and UOM already converted for this order to Selling Units?
Or another solution?
Thanks in advance.
pat
EDIT:
If the goal is to display accurate quantities before/after conversion then INUnit DAC can't be used. It doesn't store historical data, you can change INUnit values after an order has been finalized so re-using it to compute quantities will not yield accurate results.
For that scenario you would need to use the historical data fields with Base prefixes like ShippedQuantity/BaseShippedQuantity. If you require to store more historical data you need to add a custom field to hold these values and update them when shipment is created/modified.
The main issue appears to be a logical error in the requirement:
The problem is that the INUnit table has to be joined to BOTH the
SOShipLine and the SOShipLineSplit tables.
INUnit DAC has a single parent, not 2 so you need to change your requirement to reflect that constraint.
If SOShipLine and SOShipLineSplit values differ then you'll never get any record.
If they are identical then there's no need to join on both since they have the same value.
I suggest to add 2 joins, one for SOShipLine and another for SOShipLineSplit. In the report you can choose which one to display (1st, 2nd or both).
You can also add visibility conditions or IIF formula condition in the report if you want to handle null values error check for display purpose.
Use the Child Alias property in schema builder to join the same table 2 times without name conflicts. In the report formulas (to display field or in formula conditions) use the Child Alias table name too.
Example:

Cassandra Range Query

I have a set of products (product_Id, price).
'Price' of all products keep on changing and hence need to be updated very frequently.
I want to perform range query on prices:
select * from products where price > 10 and price < 100;
Please note - I want to get the products in range. Query do not matter.
What is the best possible way to model this scenario? I'm using cassandra 2.1.9.
If your price is a column key, you can only create range queries with your partition key. E.g.
Your table:
products (product_Id text, price float, PRIMARY KEY(productId, price))
Your range query:
SELECT * FROM products
WHERE productId = 'ysdf834234' AND price < 1000 AND price > 30;
But I think this query is really useless. If you need ranges for your prices and without your partition key, you need a new table. But I think a Cassandra table with 2 columns is a bad database design. In your usecase a pure key value storage is a better option. (Like Redis) But you can also add productType, productVariation, productColor, productBrand ... as columns. In this case Cassandra is a good option for you. Then you can create tables like:
productsByType_price PRIMARY KEY(productType, productPrice, productId)
productsByType_color PRIMARY KEY(productType, productColor, productId)
productsByType_brand PRIMARY KEY(productType, productBrand, productId)
etc.
One tip: Read a bit more about how cassandra manages the data. This really helps you with your data modelling.

Cassandra design approach for my sample use case

Im learning cassandra from past few days. Tried to create a data model for the following use case..
"Each Zipcode in US has a list of stores sorted based on a defined rank"
"Each store/warehouse has millions of SKUs and the inventory is tracked"
"If I search using a zipcode and SKU, it should return the best possible 100 stores
with inventory, based on the rank"
Assume store count is 1000+ and sku count is in millions
Design tried
One table with
ZipCode
Rank
StoreID
primary key (ZipCode, Rank)
Another table with
Sku
Store
Inventory
Primary Key (Sku, Store)
Now, if I want to search top 100 stores for each ZipCode, SKU
combination..
I have to search in table 1 for the top 100 stores and
then pull inventory of each store from the second table.
Since the SKU count is in millions and store count is in 1000+, m not
sure if we can store all this in one table and have zipcode_sku as row
key and stores and inventory stored as wide row sorted by rank
Am I thinking right? What could be other possible data models for this use case?
UPDATE: Data Loader Code (as mentioned in below comments)
println "Loading data started.."
(1..1000000).each { // SKUs
sku = it.toString()
(1..42000).each { // Zip Codes
zipcode = it.toString().padLeft(5,"0")
(1..1500).each { // Stores
store = it.toString()
int inventory = Math.abs(new Random().nextInt() % 10000) + 1
session.execute("INSERT INTO ritz.rankedStoreByZipcodeAndSku(sku, zipcode, store, store_rank, inventory) " +
"VALUES('$sku','$zipcode','$store',$it,$inventory);")
}
}
}
println "Data Loaded"
Cassandra is a Columnar database, so you can have wide rows that you usually want to represent each kind of query you want to make. In this case
CREATE TABLE storeByZipcodeAndSku (
sku text,
zipcode int,
store text,
store_rank int,
inventory int,
PRIMARY KEY ((sku, zipcode), store)
);
This way the row key is sku + zipcode so its a very fast lookup and you can store up to 2 billion stores in it. When you update your inventory also update this table. To get the top 100 you just pull down all of them and sort (1000's is not many) but if this operation is super common and you need it faster you can instead use
CREATE TABLE rankedStoreByZipcodeAndSku (
...
PRIMARY KEY ((sku, zipcode), store_rank)
) WITH CLUSTERING ORDER BY (store_rank ASC);
to have it sorted for you automatically and you just grab the top 100. Then when you update it you will want to use the lightweight transactions to move things around atomically.
It sounds like you want to get a list of StoreID's from the first table based on ZipCode, and a list of StoreID's from the second table based on Sku, and then do a join. Since Cassandra is a simple key value store, it doesn't do join's. So you would have to either write code in your client to do the two queries and manually do the join, or connect Cassandra to spark which has a join function.
As you say, trying to denormalize the two tables into one table so that you could do this as one query might result in a very large and difficult to maintain table. If this is the only query pattern you will have, then that might be worth it, but if this is a general inventory system with a lot of different query patterns, then it might be too inflexible.
The other option would be to use an RDBMS instead of Cassandra, and then joins are super easy.

Cassandra BETWEEN & ORDER BY operations

I wanted to perform SQL operations such as BETWEEN, ORDER BY with ASC/DSC order on Cassandra-0.7.8.
As I know, Cassandra-0.7.8 does not have direct support to these operations. Kindly let me know is there a way to accomplish these by tweaking on secondary index?
Below is my Data model design.
Emp(KS){
User(CF):{
bsanderson(RowKey): { eno, name, dept, dob, email }
prothfuss(RowKey): { eno, name, dept, dob, email }
}
}
Queries:
- Select * from emp where dept='IT' ORDER BY dob ASC.
- Select * from emp where eno BETWEEN ? AND ? ORDER BY dob ASC.
Thanks in advance.
Regards,
Thamizhananl
Select * from emp where dept='IT' ORDER BY dob ASC.
You can select rows where the 'dept' column has a certain value, by using the built-in secondary indexes. However, the rows will be returned in the order determined by the partitioner (RandomPartitioner or OrderPreservingPartitioner). To order by arbitrary values such as DOB, you would need to sort at the client.
Or, you could support this query directly by having a row for each dept, and a column for each employee, keyed (and therefore sorted) by DOB. But be careful of shared birthdays! And you'd still need subsequent queries to retrieve other data (the results of your SELECT *) for the employees selected, unless you denormalise so that the desired data is stored in the index too.
Select * from emp where eno BETWEEN ? AND ? ORDER BY dob ASC.
The secondary index querying in Cassandra requires at least one equality term, so I think you can do dept='IT' AND eno >=X AND eno <=y, but not just a BETWEEN-style query.
You could do this by creating your own index row, with a column for each employee, keyed on the employee number, with an appropriate comparator so all the columns are automatically sorted in employee-number order. You could then do a range query on that row to get a list of matching employees - but you would need further queries to retrieve other data for each employee (dob etc), unless you denormalise so that the desired data is stored in the index too. You would still need to do the dob ordering at the client.
As I know the columns will be sorted by comparator when you create column family and you can use clustring key for sorting on your opinion
and row in column family will be sorted by partitioner
I suggest you read this paper
Cassandra The Definitive Guide Chapter 6

Resources