GAMS : Avoid Scan of obviously wrong solutions in CPLEX - modeling

I have the following problem in GAMS
I implemented a location routing problem. While checking the .log file I noticed something that could speed up the calculation time immensly if I fixed it.
Let me state an example first:
Let's assume that we have a set S of all nodes consisting of s1*s140 nodes whereas nodes i1*i10 represent potential Warehouses and i11*i140 represent customers to be served. So we have have
Sets
i "all nodes" /i1*i40/
WH(i) "only potential Warehouse locations" /i1*i10/
K(i) "only customer sites" /i11*i140/
alias(i,j)
Binary Variables
z(WH) 1 if warehouse location WH is opened
y(K,WH) 1 if customer site K is assigned to warehouse WH
x(i,j) If node j is immediately headed to after node i.
Parameters
WHKAPA Capacity of a warehouse
d(K) Demand of a customer.
Cfix Opening Costs for a warehouse
dist(i,j)
The objective function minimizes the fixed opening costs and the routing costs.
While setting the capacity of a warehouse large enough to be able to serve all customers and setting high opening costs for each warehouse my assumption was that the optimal solution would consist of one warehouse being opened which serves all customers.
My assumption was right however I noticed that CPLEX takes a very long time to check the solution space for opening way to many Warehouses, first.
The optimality Gap then "jumps" to a near optimal solution when fewer Warehouses are opened (see attached screenshot). So basically a lot of time is spent scanning obviously "bad" solutions. Actually I consciously used examples where the obviously best solution would have to consist of one Warehouse only.
My question to you:
How can I "direc"t CPLEX to checkout solutions consisting of one Warehouse opened first without giving a maximal number of possible opened warehouses within the model (i. e. sum(WH, z(WH)) =l= 1 ; )
I tried Branching prioritys using the .prior suffix and the mipordind = 1 option. Cplex still checked solutions consisting of 10 Warehouses opened so I assume it did not help.
I also tried to set the Warehouse opening costs ridiculously high. However solutions that included opening the maximum number of possible warehouses were still checked and time lost.
Sorry for the long post
I hope I have put all necessary information in :)
Looking forward for your advice
Kind Regards
Adam

Related

AAS: How can I optimize the memory usage of an Azure Analysis Services instance?

Context
Note: To be precise I have multiple data models on the same AAS
instance however from viewing the size of those models along with the usage graphs they don't seem to be impacting the memory usage by any significant amount. Therefore the discussion below focuses on the
"single data model" that to us seems to be most correlated to the observed
spikes.
I have a data model held in an Azure Analysis Services instance (the data model itself is a database inside the azure analysis services instance). The data model itself has been deployed using Visual Studio to the Azure Analysis Services instance. The data model is essentially created using data straight from SQL Server database (queries and stored procedures are being used to create the tables under the hood).
Note: Within this data model there are 16 tables in total. The largest 2 (as defined by % of model occupied & other metrics, which can be viewed via DAX Studio Vertipaq Analyzer) are the ones which have been partitioned day wise with 60 days partitions in total each (2022-04-11, 2022-04-12, ...) and handled via the partitioning automation procedure outlined in the Resources section below. The remaining 14 tables haven't been partitioned in that sense and are "fully processed" each time the refresh function triggers the refresh (effectively each of those 14 tables consist of 1 single large partition, i.e. the whole table).
E.g: Every hour, when our refresh functions triggers, the latest 3 partitions of our 2 large tables are re-processed and each of the remaining 14 tables are re-processed fully (since only 1 big partition which forms each of these tables).
The refreshes of the data model are performed using a function app which has functions which refresh the latest 3 daywise partitions of 2 of the largest tables in the data model while the other tables are processed whole every time during refresh.
Currently the function controlling the refresh execution is triggered to go off every hour, during which it performs a refresh of the data as described above.
The issue that we have been facing is that when we observe our memory usage dashboard (Check screenshot below) we tend to get massive spikes in the memory usage which seem to occur during this refresh phase.
In light of this observation we began trying to test out what seems to be causing these periodic spikes and observed the following interesting points:
Spikes align almost perfectly with the scheduled hourly refreshes of our data model.
Leading us to believe the spikes are related to the refresh process in some way.
In between these refreshes the memory usage drops significantly
Further making us believe the spikes is caused by some part of the refresh activity and not caused by general usage.
Increasing the number of partitions (time window of data for the 2 main tables) from 30 days to 60 days and vice versa causes significantly visible changes in the spikes
If we go up from 30 to 60 days the spikes amplitudes increase and the other way around causes it to decrease.
Performing "Defragmentation process" as outlined in the white paper viewable via the link in the Resources section temporarily reduces the usage by a little.
By nature of this process its something that would be need to be performed on a regular basis to ensure continued benefits.
The tables that are fully processed each time (all tables in main data model barring 2 which only refresh last 3 daily partitions) don't seem to cause a high impact on the memory usage spikes.
We manually processed some of the largest tables one after another in-between the refreshes and didn't notice a huge jump in the graphs.
Reducing the 3 daywise partitions to 3 hourly partitions for the 2 main tables refresh didn't seem to cause a big change either.
Noticed a small drop in the memory usage (about 1-2GB during hourly refresh) but didn't seem to have as large of an impact as we thought (proportional to the data reduction). This makes us think that the actual amount of data might not be the primary issue.
Screenshots
Here are some more details on the metrics used Definitions:
Turqoise Line: Hard memory limit max (same as the max cache size of our AAS tier).
Dark Blue Line: High memory limit max (approx 80% of our Hard Memory limit).
Orange Line: Memory Usage max. More details can be found in the following links: AAS Metrics, Memory Usage Forum Post
Questions
Based on our scenario (described above) what could be that cause of the memory usage spikes during refresh and how can we reduce and or manage them in a nice way (ideally removing entirely or as much as possible)? Basically always much below the turquoise and dark blue lines
We feel that if we can figure this out it may allow us to stay within our current pricing tier and also potentially allow us to bring in more data (90-120 days partitions) without the worry of hitting "Out of Memory" health alerts for our instance (which we have been receiving up until now with 60 days).
Note: Barring the current hourly refreshes we are well within the tier limits in terms of memory usage (orange line much lower than the threshold turquoise & blue). Thus solving this could free us to make better use of our AAS resources
Current Thoughts
We do have calculated columns in our data model. Could this be causing the issue?
What would be the best way to test this?
Resources
Will place any useful links to documentation in this section. Hopefully can aid in understanding the context.
Github Link for the repo containing Tabular model refresh logic we based our process off.
https://github.com/microsoft/Analysis-Services/tree/master/AsPartitionProcessing
In the README.md make sure to click the link to the white paper which provides more detail.
Please try setting MaxParallelism to some low value like 2 or 3 in the ModelConfiguration table. This will reduce the number of parallel tables and partitions it will process at once. This alone probably won’t solve the memory spike issue but it should lower the spike a little at the expense of longer refresh times. If you can deal with this tradeoff and it spikes memory less this may be a workaround.
Please set IsAvailableInMDX to false on any hidden columns or hidden measure columns which are not put on an axis or referenced directly in an MDX query. This should reduce your memory footprint during processing because it will not build attribute hierarchies for those columns. On high cardinality columns the savings could be significant.
The next thing to try would be to split the tables/partitions into separate ModelConfiguration rows in the database. Then configure it to process one ModelConfiguration then the other sequentially. The goal here would be to process some tables in one transaction and other tables in a separate transaction. That should cause the memory usage required for each transaction to be less. Of course this may impact users in that half of the data will be stale after the first transaction so you will have to judge whether this is feasible.
A more complex optimization would be to scale out AAS and have a dedicated processing node. You could then process clear the model before you full process it. That should reduce the memory requirements the most. Once processing is done you run the Synchronize command. You could even scale back in removing the processing node to save cost the rest of the hour.
Another option to consider would be to deploy the models to Power BI Premium Gen2. The very interesting nuance with Gen2 is that a P1 capacity allows each dataset to be up to 25GB unlike Gen1 and unlike AAS S1 where the total of all datasets must be less than 25GB. If your organization already owns Power BI Premium capacities this should be a good option. If not then the cost probably won’t make sense at the moment. Or you could license each user with a Power BI Premium Per User license and deploy the model to that Premium Per User capacity. If you have under 70 users this may be a more cost effective option for you to try.

Acumatica Physical Inventory Process and Transferring between Locations

We are starting to think about how we will utilize locations within warehouses to keep tighter tracking and control over where our items are physically located. In that context, we are trying to figure out what the actual workflow would be when it relates to performing physical inventory counts and review. I have read the documentation, but I'm wondering how to best think through the below scenario.
Let's say to start, that we have 10 serial items across 5 locations (so let's assume 2 in each location). And assume that all these locations are in the same warehouse.
2 weeks go by, and there is movement between these locations by way of the inventory transfer document process. But for this example, let's say that users didn’t perform the inventory transfer as they physically moved the items between locations 100% of the time.
So at this point, where acumatica thinks the serial items are doesn't reflect the reality of where they are.
So now we do Physical inventory for this warehouse (all 5 locations together).
By the time we complete the inventory count and review, we will see the 10 items in the same warehouse. BUT:
will see be able to see that variances/problems against the locations? Meaning, will it highlight/catch where they actual are located vs where acumatica thought they were located,?
and assuming yes, is there anything in the inventory process that will handle the auto transferring to it's correct location within the warehouse? Or does this need to then be done manually through an inventory transfer?
Any help would be much appreciated.
Thanks.

Options for running data extraction on a daily basis

I currently have an excel based data extraction method using power query and vba (for docs with passwords). Ideally this would be programmed to run once or twice a day.
My current solution involves setting up a spare laptop on the network that will run the extraction twice a day on its own. This works but I am keen to understand the other options. The task itself seems to be quite a struggle for our standard hardware. It is 6 network locations across 2 servers with around 30,000 rows and increasing.
Any suggestions would be greatly appreciated
Thanks
if you are going to work with increasing data, and you are going to dedicate a exclusive laptot for the process, i will think about install a database in the laptot (MySQL per example), you can use Access too... but Access file corruptions are a risk.
Download to this db all data you need for your report, based on incremental downloads (only new, modified and deleted info).
then run the Excel report extracting from this database in the same computer.
this should increase your solution performance.
probably your bigger problem can be that you query ALL data on each report generation.

Cassandra count use case

I'm trying to figure out an appropriate use case for Casandra's counter functionality. I thought of a situation and I was wondering if this would be feasible. I'm not quite sure because I'm still experimenting with Cassandra so any advice would be appreciated.
Lets say you had a small video service, you record the log of views in Cassandra while recording what video was played, which user played it, country, referer etc. You obviously want to show a count of how many times that video was played would incrementing a counter every time you insert a play event be a good solution to this? Or would there be a better alternative. Counting all the events on read every time would take a pretty big performance hit and even if you cached the results the cache would be invalidated pretty quickly if you had a busy site.
Any advice would be appreciated!
Counters can be used for whatever you need to count within an application -- both "frontend" data and "backend" one. I personally use them to store user's behaviour information (for backend analysis) and frontend ratings (each operation a user do in my platform give to the user some points). There is no real limitation on use case -- the limitation is given by few technical limitations, the bigger coming to my mind:
a counter cf can be made only by counters columns (except PK, obviously)
counters can't be reset: to set 0 value to a counter you need to read and calculate before writing (with no guarantee about the fact that someone else updated before you)
no ttl and no indexing/deletion
As far as your video service it all depends on how you choose to model data -- if you find a valid model to hit few partitions each time you write/read and you have a good key distribution I don't see any real problem in its implementation.
btw: you tagged Cassandra 2.0 but if you have to use counters you should think about 2.1 for the reasons described here

Access times for Windows Azure storage tables

My company is interested in using the azure storage tables. They have asked me to look into access times but so far I have not found any information on this. I have a few questions that perhaps some person here could help answer.
Any information / links or anything on the read / write access times of azure table storage
If I use a partition key and row key for direct access does read time increase with number of fields
Is anyone aware of future plans for azure storage such as decrease in price, increase in access speed, ability to index or increase in size of storage per row
Storage is I understand 1MByte / row. Does this include space for the field names. I assume it does
Is there any way to determine how much space is used for a row in Azure storage. Any API for this.
Hope someone can help answer even one or two of these questions.
PLEASE note this question only applies to TABLE STORAGE.
Thanks
Microsoft has a blog post about scalability targets.
For actual storage per row, here's an excerpt from that post:
Entity (Row) – Entities (an entity is
analogous to a "row") are the basic
data items stored in a table. An
entity contains a set of properties.
Each table has two properties,
“PartitionKey and RowKey”, which form
the unique key for the entity. An
entity can hold up to 255 properties
Combined size of all of the properties
in an entity cannot exceed 1MB. This
size includes the size of the property
names as well as the size of the
property values or their types.
You should see performance around 500 transactions per second, on a given partition.
I know of no plans to reduce storage cost. It's currently at $0.15 / GB / month.
You can optimize table storage write speed by combining writes within a single partition - this is an entity group transaction. See here for more detail.
To add to David's answer. The Microsoft Extreme Computing Group have a pretty comprehensive series of performance benchmarks on all things Azure, including Azure tables.
From the above benchmarks (under read latency):
Entity size does not significantly affect the latencies
So I wouldn't be overly concerned about adding more properties.
Secondary indexes on Azure Tables have come up as a requested feature since it was first release and at one point it was even talked about as if it was going to be in an upcoming release. MS has since fallen very quiet about it. I understand that MS are working on it (or at the very least thinking very hard about it), but there is no time frame for when/if it will be released.

Resources