Assigning priority based on entity total time

Assigning priority based on entity total time - arena-simulation

I am working on a manufacturing simulation where I have 5 entities C1, c2....C5 with different creation time that follow exponential distribution of 50, 45 ....80 and there are 5 different stations and at each process I want to assign the highest priority to the entity that has been in the system for the longest time?
I have tried using first come, first serve but it doesn't work because entities have different creation time which follow exponential creation time so first come first serve does not prioritize the entity that has been in the system for the longest time.

My suggestion would be to add a "system_entry_time_a" attribute to each entity. Just after the entity is created, assign TNOW to that attribute.
Then, in your queue element, you can use Rule Expression as system_entry_time_a, and choose LowValueFirst as the ranking criterion.

Related

Should we exclude target variable from DFS in featuretools?

While passing the dataframes as entities in an entityset and use DFS on that, are we supposed to exclude target variable from the DFS? I have a model that had 0.76 roc_auc score after traditional feature selection methods tried manually and used feature tools to see if it improves the score. So used DFS on entityset that included target variable as well. Surprisingly, the roc_auc score went up to 0.996 and accuracy to 0.9997 and so i am doubtful of the scores as i passed target variable as well into Deep Feature Synthesis and there the infor related to the target might have been leaked to the training? Am i assuming correct?

Deep Feature Synthesis and Featuretools do allow you to keep your target in your entity set (in order to create new features using historical values of it), but you need to set up the “time index” and use “cutoff times” to do this without label leakage.
You use the time index to specify the column that holds the value for when data in each row became known. This column is specified using the time_index keyword argument when creating the entity using entity_from_dataframe.
Then, you use cutoff times when running ft.dfs() or ft.calculate_feature_matrix() to specify the last point in time you should use data when calculating each row of your feature matrix. Feature calculation will only use data up to and including the cutoff time. So, if this cutoff time is before the time index value of your target, you won’t have label leakage.
You can read about those concepts in detail in the documentation on Handling Time.
If you don’t want to deal with the target at all you can
You can use pandas to drop it out of your dataframe entirely before making it an entity. If it’s not in the entityset, it can’t be used to create features.
You can set the drop_contains keyword argument in ft.dfs to ['target']. This stops any feature from being created which includes the string 'target'.
No matter which of the above options you do, it is still possible to pass a target column directly through DFS. If you add the target to your cutoff times dataframe it is passed through to the resulting feature matrix. That can be useful because it ensures the target column remains aligned with the other features. You can an example of passing the label through here in the documentation.
Advanced Solution using Secondary Time Index
Sometimes a single time index isn’t enough to represent datasets where information in a row became known at two different times. This commonly occurs when the target is a column. To handle this situation, we need to use a “secondary time index”.
Here is an example from a Kaggle kernel on predicting when a patient will miss an appointment with a doctor where a secondary time index is used. The dataset has a scheduled_time, when the appointment is scheduled, and an appointment_day, which is when the appointment actually happens. We want to tell Featuretools that some information like the patient’s age is known when they schedule the appointment, but other information like whether or not a patient actually showed up isn't known until the day of the appointment.
To do this, we create an appointments entity with a secondary time index as follows:
es = ft.EntitySet('Appointments')
es = es.entity_from_dataframe(entity_id="appointments",
dataframe=data,
index='appointment_id',
time_index='scheduled_time',
secondary_time_index={'appointment_day': ['no_show', 'sms_received']})
This says that most columns can be used at the time index scheduled_time, but that the variables no_show and sms_received can’t be used until the value in secondary time index.
We then make predictions at the scheduled_time by setting our cutoff times to be
cutoff_times = es['appointments'].df[['appointment_id', 'scheduled_time', 'no_show']]
By passing that dataframe into DFS, the no_show column will be passed through untouched, but while historical values of no_show can still be used to create features. An example would be something like ages.PERCENT_TRUE(appointments.no_show) or “the percentage of people of each age that have not shown up in the past”.

If you are using the target variable in your DFS, than you are leaking information about it in your training data. So you have to remove your target variable while you are doing every kind of feature engineering (manuall or via DFS).

Adding weight ot variables in a line graph Tableau

I have a dataset consisting of calls going to agents (atually 10 of them) per day. These agents can either answer calls or transfer them to a call center. What we are interested in is whether each of these agents answers more calls than he transfers. In order to answer this, I have created a variable for each of these agents:
Answered/Transferred
I am using line graph to depict these variables per agent over time.
Now if this variable is less than 1 then this agent transferred more calls than he received. The problem now is that this is not a safe way to measure the overall impact of transferred calls. This is because the traffic pertaining to agents 1,2,3 is far greater than the one pertaining to agents 5,6,7 and so on. Therefore, I am trying to come up with a way to "weight" the variables I have created before. That is, somehow include the total number of calls reaching each agent (irrespectively of whether they are getting transferred or answered) in my calculations. That means that if an agent is getting 5 calls per day while another guy is getting 5.000 per day then I should find a way to depict this in my graphs.
Do you guys have any ideas?

Easiest would be to drag weight measure to colors and choose something like temperature diverging. Depending on your viz you can also drag weight measure to size, and for example, make bars or lines thicker to show there are more records there.

Temporal logic for modelling events at discrete points in time causing states/changes over a period of time

I am looking for an appropriate formalism (i.e. a temporal logic) to model the following kind of situation
There can be events happening at discrete events in time (subject to conditions to be detailed below).
There is state. This state cannot be expressed by a fixed number of variables. However, it is possible to express it with a linear list/array, where each entry consists of a finite number of variables.
Before any events have happened, the state is fixed.
At any point in time, events are possible. They have a fixed structure (with a few variables). The possible events are constrained by the current state.
Events will cause an immediate change of the state.
Events can also cause continuous state changes. For example, a variable (of one of the entries of the array mentioned above) changes its value from 0 to 1 over some time (either immediately or after a specified delay).
It should also be possible to specify discrete points in time in the form "the earliest point in time after event E where some condition C holds", and to start a continuos state change at such a point.
Is there an existing temporal logic to model something like this?
It should also be possible to express desired conditions, like the following:
Referring to a certain point in time: The sum of a specific variables of all the entries of the array may not exceed a certain threshold.
Referring to change over time: For all possible time intervals, the value of a certain variable (again, from each entry of said array) [realistically, rather of some arithmetic expression computed for each entry] must not change faster than a given threshold.
There should exist a model checker that can check whether for all possible scenarios, all the conditions are met. If this is not the case, it should print one possible scenario and tell me which condition is not met. In other words, it should distinguish between conditions describing the possible scenarios, and conditions that have have to be fulfilled in those scenarios, and not just tell me "not possible".

You need a model checker with more flexible language. Technically speaking model checking of systems of infinite state space is open research problem and in general case algorithmically undecidable. The temporal logic is more typically related to propreties under the question.
Considering limited info you shared about your project, why do not you try Spin/Promela it is loosely inspired by C and has 'buffers' which can be considered to be arrays. At the least you might be able to simulate your system?

Find UK PostCodes closest to other UK Post Codes by matching the Post Code String

Here is a question that has me awake for a number of days now. The only conclusion I came up so far is that Red Bull does not usually help coders.
I have a scenario in my application where I have a couple of jobs (1 to 50). The job has an address and I have the following properties of an address: Postcode, Latitude, and Longitude.
I have a table of workers also and they too have addresses. While the jobs or workers are created through screens, I use Google Map queries to make sure the provided Postcode is valid and is in UK so all the addresses are verified.
I am using a scheduler control to display some workers on y-axis and a timeline on x-axis. Every job has a date and can only move vertically on the scheduler on the job’s date. The user selects a number of jobs and they are displayed in a basket close to the scheduler. The user can then drag and drop job against workers. All this is manual so it works.
My task is to automate this so that the user does not do much except just verifying and allotting the jobs. Therefore, I have to automate the process.
Every worker has a property called WillingMaximumDistanceTravel which is an integer representing miles, the worker is willing to travel for a job.
Now here is the headache: I have over 1500 workers. I have a utility function that uses Newtonsoft’s Json Convert to de-serialize a stream of response from Google Maps. I need to feed it Postcode A and B.
I also plan to introduce a new table to DB to store the distance finds as Postcode A, Postcode B, and Distance. Therefore, if I find myself comparing the same postcodes again, I will just retrieve the result from DB instead and slowly and eventually, I would no longer require bothering Google anymore as this table would be very comprehensive.
I cannot use the simple Haversine formula, as Crow-fly path is not my requirement here. The pain in this is that it takes a lot of time to calculate. Some workers can travel over 10 miles while some vary from 15 to 80. I have to take the first job from the list and run it with every applicable worker o the system! I was wondering that the UK postcode has a pattern to it. If we sort a list of UK postcodes, can we rough-estimate, from the alphanumeric pattern, where will we hit a 100-mile mark, a 200-mile mark and so on?
If anyone is interested in the code, please drop a line and I will paste it.

(I work for Google, but I'm not speaking on behalf of Google. I have nothing to do with the maps API.)
I suspect this isn't a great situation for using the Google Maps API, simply because you're pushing so much data through. You really don't want to make that many requests, even if you could do so under the directions limits.
When I tackled something similar in a previous job, we bought into a locally-hosted maps API - but even that wasn't fast enough for this sort of work. We ended up precomputing the time to travel from the centroid of each postcode "area" (probably the wrong name for it, but the first part of the postcode followed by the first digit of the remainder, e.g. "SW1W 9" for "SW1W 9TQ") to every other area, storing the result in a giant table. I think we only did it for postcodes which were within 100 miles or something similar, to cut down on the amount of preprocessing.
Even then, a simple DB wasn't quite as fast as we wanted - so we stored the results in a giant file, with a single byte per source/destination pair. (We had a fixed sequence of source postcodes and target postcodes, so we didn't need to specify those.) At that point, computing a travel time consisted of:
Work out postcode areas (substring work)
Find the index of each postcode area within the sequence
Check if we'd loaded that part of the file (we lazy loaded for startup speed)
Load the row if necessary, and just access it otherwise
The bytes were on a sliding scale of accuracy, so for the first 60 minutes it was on a per-minute basis, then each extra value meant an extra 2 minutes, then 5 etc. (Those aren't the exact values, but it was something like that.)
When you've worked out "good candidates" you can ask an on-site API or the Google Maps API for more accurate directions for your exact postcodes, of course.

You want to look for a spatial-index or a space-filling-curve. A spatial index reduce the 2d problem to a 1d problem and recursivley subdivide the surface into smaller tiles but it is basically a reordering of the tiles. You can subdivide the surface either with an index or a string using 4 characters. The latter one can be useful to you because it let you query the string with all string operation hidden in the database engine. You want to look for Nick's spatial index quadtree hilbert-curve blog.

Feeding a UITableViewController calculated values

Very quick question. I need to feed a UITableViewController daily averages calculated from multiple objects with an NSNumber attribute (each object is timestamped, with 8-10 objects per day usually). Is there any straightforward way to calculate on the fly using my own version of lazy loading (I would average data in chronological order till I had a screenful of daily averages) or should I just take the easy way out and maintain an Averages entity pre-populated with averages for all possible days, which I present conventionally in my UITableViewController?
Thanks.

Depends how long the calculation takes? Why not just calculate it in the UITableViewDataSource method that asks for the cellForRowAtIndexPath? Unless it takes time to calculate - thereby causing a delay in populating the values onscreen - I'd say there's no harm in calculating on the fly.
There's plenty of other places to do the calculation. You could do it on app startup, or on loading the view. You could even start an NSOperation that does it in the background with the controller as a delegate that tells it when to reloadData.

You should do a fetch by value (see Core Data Programming Guide) and then use the #avg collections operator on the attribute name.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string