What's the right pattern to reduce the number of rows within one table in Cassandra? - cassandra

By reducing I mean some form of caching when you can reduce 100 rows with 1 row (accumulate counters etc).
I want to able to answer queries how many people are from %EACH_COUNTRY, so basically return an array of pairs / map from (Country, COUNT). And then I've got huge number (think of 50 * 10^8) of people so I can't allocate 1 row for each person, so I'd like to cache the results somehow to keep PeopleTable under 10^6 entries at least (and then merge the results with the fast read from CacheTable). By caching I mean count the number of people with country=%SPECIFIC_COUNTRY, write %SPECIFIC_COUNTRY, COUNT(*) in CacheTable (to be precise, increment the count for %SPECIFIC_COUNTRY and then remove these rows from PeopleTable):
personId, country
1132312312, Russia
2344333333, the USA
1344111112, France
1133555555, Russia
1132666666, Russia
3334124124, Russia
....
and then
CacheTable
country, count
Russia, 4
France, 1

Related

DAX - compare selected group averages with category averages

I have data on objects which belong to different categories. I want to be able to compare the average across a selection of objects to averages across the categories where selected objects belong. I have written out measures, but they do not produce the expected results,
My data looks like this. I am using Power Pivot to set up the data model for MS Excel pivot charts.
Table 1 has unique stores (Store names are guaranteed to be unique)
Store
Branch
Region
Store A
North 1
Plain
Store B
North 1
Plain
Store K
West 3
Plain
Store F
West 3
Plain
Store T
East 1
Coast
Store P
East 1
Coast
Table 2
Store
Area, sq ft
Store A
3000
Store B
4000
Store K
2000
Store F
5000
Store T
5000
Store P
4000
Table 3
Store
Year
Month
Expenses
Store A
2022
September
10000
Store A
2022
October
15000
Store B
2022
September
20000
Store B
2022
October
22000
There is more than one year included in the dataset.
Table 2 and 3 are connected to table 1.
First, I write measures that I expect to compute costs per sq ft for selected objects.
Costs:=sum('Table 3'[Expenses)
Area:= sum('Table 2'[Area, sq ft])
Costs_per_sq_ft:= [Costs] / [Area]
Then, I write identical measures for averages:
Costs_avg:=average('Table 3'[Expenses)
Area_avg:= average('Table 2'[Area, sq ft])
Costs_per_sq_ft_avg:= [Costs_avg] / [Area_avg]
Finally, I write define measures to average across a selected group (assuming all selected elements belong to the same category):
Costs_avg_branch:=var StoreBranch = max('Table 1'[Branch] = StoreBranch) return calculate([Costs_avg], filter(all('Table 1');'Table 1'[Branch] = StoreBranch))
Area_avg_branch:=var StoreBranch = max('Table 1'[Branch] = StoreBranch) return calculate([Area_avg], filter(all('Table 1');'Table 1'[Branch] = StoreBranch))
Costs_per_sq_ft_avg_branch:=var StoreBranch = max('Table 1'[Branch] = StoreBranch) return calculate([Costs_per_sq_ft_avg], filter(all('Table 1');'Table 1'[Branch] = StoreBranch))
and identical measures for Region as variable,
On selection of Store A and September 2022, I expected to have
Costs_avg
Costs_avg_branch
Store A
3.33
4.28
i.e. the average for the selected store and the average for the branch where it belongs.
On selection of Store A and September-October I expected to have:
Costs_avg
Costs_avg_branch
Store A
4.17
4.78
( average over chosen period for a selected store and the same average for the branch).
On selection of the entire branch I intended the average across selection to match that of the category. E.g., for stores A, B in September-October 2022:
Costs_avg
Costs_avg_branch
North 1
4.78
4.78
Unfortunately, the averages for individual selected objects seem to be consistently near zero. When I select entire branches, the object average and the branch average do not match.
Is there any way to obtain the correct averages? Is it possible to get the desired behavior when objects from different categories are selected, as I originally wanted?

Excel Vlookup Multiple Values

I am looking for a vlookup formula that returns multiple matches using two lookup values. I am currently trying to use the concatenate method, but I haven't quite figured it out. The table needs to return all of the multiple matches not just one. Currently, its only returning the last match.
For example, lets say I have a list of multiple city and states. The cities differ but the states remain the same obviously. I want to return the number of people in the each city.
City State #OfPeople
Albany NY 10
Orlando FL 5
Tampa FL 3
Seattle WA 1
Queens NY 8
So I concatenated the city and state column.
Join City State #OfPeople
Albany-NY Albany NY 10
Orlando-FL Orlando FL 5
Tampa-FL Tampa FL 3
Seattle-WA Seattle WA 1
Queens-NY Queens NY 8
The purpose of this is to create an updated log of people in each city has time progresses. I want to have a grand total amount of people in each column. (I know this requires another formula. I'm just focused on returning multiple matches for now). However, I don't want to overwrite the existing data. Hopefully, I explained this well. This is just an example of a larger project I'm working on. I need to be able to build on this list. That's why its important that I be able to return matches multiple times.
Join City State #OfPeople Total
Albany-NY Albany NY 10 10
Orlando-FL Orlando FL 5 15
Tampa-FL Tampa FL 3 18
Seattle-WA Seattle WA 1 19
Queens-NY Queens NY 8 27
Any help would be greatly appreciated!
Considering you're trying to get some grand totals based on multiple criteria, I would suggest using SUMIFS() / COUNTIFS() functions, rather than focusing on searching matching row itself.
However, if you need multiple criteria look up, for some reason, I believe INDEX() + MATCH() combination can perfectly do the job.
The table needs to return all of the multiple matches not just one.
Currently, its only returning the last match
You'll need to use SUMIFS() if there are multiple records for the same city/state combo in your people lookup.
=SUMIFS (sum_range, range1, criteria1, [range2], [criteria2], ...)
Let's assume that you have a cities tab and a people tab. Let's assume you have ten cities that you want to return the total amount of people from.
Cities Tab definition
City range: 'Cities'!A$1:A$10
State range: 'Cities'!B$1:B$10
People Tab definition
City range: 'People'!A$1:A$100
State range: 'People'!B$1:B$100
#OfPeople range: 'People'!C$1:C$100
Drop this formula in the first row of your cities tab, drag down the entire range of cities.
=SUMIFS('People'!C$1:C$100, 'Cities'!A$1, People'!A$1:A$100, 'Cities'!B$1, 'People'!B$1:B$100)

Excel lookup value for multiple criteria and multiple columns

I am helping a friend with some data analysis in Excel.
Here's how our data looks like:
Car producer | Classification | Prices from 9 different vendors in 9 columns
AUDI | C | 100 200 300 400 500 600 700 800 900
AUDI | C | 100 900 800 200 700 300 600 400 500
AUDI | B | .. ..
Now, for each classification and each producer, we produced a list that shows which of the 9 vendors has offers the most lowest prices (in terms of count, so for example there are 2 cars from AUDI in the C class, so vendor A would offer the lowest price for both).
What we need: A way to calculate the average price for this vendor. So, if we see that the vendor A has the lowest price for AUDI cars in the C class, then we want to know the average price for vendor A for these cars.
I'm quite stumped since I can't use the "standard" index-match-small approach since the prices are stored in 9 different columns.
I've suggested to use a long if-chain like this: =if(vendor=A,averageif(enter the criteria and select the column of vendor A for average values),if(vendor=B,average(enter the criteria and select the column of vendor B for average values),... etc.).
But this method is obviously limited and does not scale well to higher dimensions.
We also would like to avoid using any addons.
You're going to need to create a separate table that has all unique classifications in the rows and all dealers in the columns (same as yours, but with duplicate rows removed). Then, in each cell, take the average price for that classification*vendor combination. This can be done by using a combination of sumif/countif. For example, if your second table had a column for classifications in cells M2:M[end], calculating the average price for the Audi C class offered by vendor 1 could be:
=sumif(C$2:C$[end],"="&$M2,$B$2:$B$[end])/countif($B$2:$B$[end],"="&$M2)
This would look something like this:
Then you could simply find the cheapest vendor by matching the min price. For example, the cheapest vendor for the audi C class in my example image would be:
=index($N$1:$V$1,match(min($N2:$V2),$N2:$V2,0))
A lot this could be done using PivotTables. If it is a one off thing, I would go that route, if it needs to be automated, then try using a multicondtional VLOOKUP (needs to be entered as a Matrix Formula: CTRL+ALT+SHIFT). This is simply an example, not based on your data:
{=VLOOKUP(A11&B11,CHOOSE({1\2},A2:A7&B2:B7,C2:C7),2,0)}
A better explanation is given here at chandoos site:http://chandoo.org/wp/2014/10/28/multi-condition-vlookup/

Using the Excel's Rank() function to calculate allocations based on ranking and constraints

I have the following table set up
Limit Allocation Yield Ranking
$600 [to calc] 0.07% 7
$600 0.09% 6
$600 0.20% 1
$400 0.20% 1
$400 0.13% 4
$200 0.19% 3
$200 0.12% 5
Additionally, I have a constraint which I could only allocate a total of $2000 across the 7 rows here, by the rankings of their yield (so a higher yield would get everything allocated up to the limit column if there is any left overs from the $2000 total).
I was wondering how I could set up the equations so that it could perform the allocation automatically. Thanks!
I'm going to assume this table starts in A1...
In E1, put the amount you have to allocate
In B2 (and then copied to B3...B8) use the following formula
=MIN(A2,$E$1-SUMIF($D$2:$D$8,">"&D2,$B$2:$B$8))
This will work out how much has been taken by higher ranked, and take the rest, upto whatever is the lesser amount of their limit, and what is left in the pot.
There is one fault with this equation that you will need to figure out how to handle:
If there are equal ranks at the end of the distribution, then both will get the final amount. (e.g. try this with $2,001, and you will see that the 2 rows that have then rank 1 will both claim the final dollar)
Answer to solve the ties for rank causing problem. In the rank column D, add to the rank =rank(c2,$c$2:$c$8,0) + (.0000001 * row(a2)), or whatever row you are in. Then format the rank column to only show integers. Doing this makes the very small decimal addition to the rank the tie breaker so the first row with the rank's matching integer will take the allocation. Since you are adding it to the rank, it doesn't effect any totals. By changing the column format display to integer, the viewer will not be aware of the tiebreaker.

Read a text file and store it in the hashmap

I am reading a file called Expenses.txt...I want to store it in a hashmap with repeated entries of items
The text file contains data on several lines, where each line (a record) consists of two fields: category name (a string), and its value (a number). For example, the file below shows expenses by category.
Input
Expenses.txt
cosmetics 100.00
medicines 120.00
cosmetics 50.00
books 250.00
medicines 80.00
medicines 100.00
program should generate a Summary report showing the sums and averages by category, sorted by category. The summary should be displayed on the console. The program should prompt the user and read in the name of the input file.
For example, for the above data, the summary will be:
output
Category Total Average
books $250 $250.00
medicines $300.00 $100.00
cosmetics $150.00 $75.00
a) The first field is a string and the second field is a floating point number.
b) The number of records for each category may vary. For example, in the above example, there are 2 records for cosmetics, 3 for medicines and 1 for books.
c) The total number of records (lines) may vary. Do not limit them to any fixed number.
d) The records are not in any sorted order.
It really depends on the language you are using, but I would recommend you using some kind of structure of tuple to save in the hashmap. You can read each line, split each of them in two (for the label and the value), and check if the label is already in the hashmap. If it is, just increment by one the number of units, as well as summing the coast.
At the end, just do a hashmap transversal and print all the values needed.

Resources