Count cases using segmentation variable in SPSS - statistics

I have an SPSS dataset with more than 5,000 cases that looks like this:
ID, relation to head of household
1, head of household
1, son
1, partner
2, head of household
2, son
3, head of household
3, son
3, cousin
I need to count the number of households that have
Head of household + child(s)
Head of household + partner + child(s)
Head of household + relative(s)
Head of household + partner + relative(s).
I know this should be done using ID as segmentation variable, but can't figure out how.

One way to do it would be to make a set of dummy variables for each category and then use AGGREGATE to get the household level statistics.
DATA LIST LIST (",") /ID (F1.0) Relation (A20).
BEGIN DATA
1,head of household
1,son
1,partner
2,head of household
2,son
3,head of household
3,son
3,cousin
END DATA.
DATASET NAME Houses.
*Making dummy variables.
COMPUTE HeadHouse = (Relation = "head of household").
COMPUTE Partner = (Relation = "partner").
COMPUTE Child = (Relation = "son").
COMPUTE Relative = (Relation = "cousin").
DATASET DECLARE AggHouse.
AGGREGATE OUTFILE='AggHouse'
/BREAK ID
/HeadHouse = SUM(HeadHouse)
/Partner = SUM(Partner)
/Child = SUM(Child)
/Relative = SUM(Relative).
Then with the aggregated dataset you can subsequently use IF statements to calculate the conditions you want. E.g.
DATASET ACTIVATE AggHouse.
IF (HeadHouse > 0) AND (Child > 0) First = 1.
IF (HeadHouse > 0) AND (Partner > 0) AND (Child > 0) Second = 1.
For your real dataset you will need to insert more conditions for both the original set of dummy variables, but I leave that as an exercise to you.

Related

Creating random, non recurrent groups of undetermined size from a list

I want to create a program where a user can input what the maximum size of the groups is that they want. These groups are formed from a list of names from a submission form. The idea is that there are multiple rounds in which the names are paired in the requested maximum group size and each round does not create previously formed groups. Also, no one should be left out, so no groups of 1 person.
I have two problems: first off: if I have a list of 10 names and I input that I want max size groups of 3 persons, I get 3 groups of 3 persons and 1 of 1, but it should be 3, 3, 2, 2. I used two different functions I found on here, but both have the same problem.
Secondly, I have no idea how to make sure that in a new round there won't be any groups from previous round.
I am pretty new to programming, so any tips are welcome.
This is the first function I have:
members = group_size()
def teams(amount, size):
for i in range(0, len(amount), size):
yield amount[i:i + size]
participants = Row_list_names
random.shuffle(participants)
print("These are your groups:")
print(list(teams(participants, members)))
And this is the second:
members = group_size()
participants = Row_list_names
random.shuffle(participants)
for i in range(len(participants) // members + 1):
print('Group {} consists of:'.format(i+1))
group = participants[i*members:i*members + members]
for participant in group:
print(participant)
group_size() returns an integer number for how many people should be in the group.
For the second problem, shuffling as you do should do the trick nicely.
For the first problem, the functions are doing what you tell them to: you skip ahead and slice the list in chunks that contain exactly member participants. You do not notice that the last slice is out of bound because python is lenient on that:
>>> l = [0,1,2,3,4]
>>> l[:40]
[0, 1, 2, 3, 4]
The point is that not all groups should be of the same size:
from math import ceil
from math import floor
def split_groups(group_size, part):
# first compute the number of groups given the requested size
group_num = ceil(len(part) / group_size)
print(f"group_num {group_num}")
# compute a fractional length of the groups
group_size_frac = len(part) / group_num
print(f"group_size_frac {group_size_frac}")
total_assigned = 0
for i in range(group_num):
# get the start and end indexes using the fractional length
start = floor(i * group_size_frac)
end = floor((i + 1) * group_size_frac)
group = part[start:end]
print(f"start {start} end {end} -> {group}")
print(f"number of participants in this group {len(group)}")
total_assigned += len(group)
# check that we assigned all of the participants
assert total_assigned == len(part)
I have not tested any edge case, but a quick check by running
for group_size in range(1, 5):
for num_participants in range(10, 50):
part = list(range(num_participants))
split_groups(group_size, part)
shows that every participant was assigned to a group.
Plug in the shuffling you did before and you have random groups.
Cheers!

Generate numpy matrix with unique range for each element

I'm trying to generate random matrices. However, each element of the random matrix has a different range. So I want to generate a random matrix such that each element has that random number within that range. So far i've been able to generate matrices with unique column ranges:
c1 = np.random.uniform(low=2, high=1000, size=(15,1))
c2 = np.random.uniform(low=0.001, high=100, size=(15,1))
c3 = np.random.uniform(low=30, high=10000, size=(15,1))
c4 = np.random.uniform(low=1, high=25, size=(15,1))
mtx = np.concatenate((c1,c2,c3,c4), axis=1)
Now Low and high for rows in mtx is also quite different. How can I generate such random matrix with each row element also having unique range and not just columns?
Something like this would probably work:
low = np.array([ 2, 0.001, 30, 1])
high = np.array([1000, 100, 10000, 25])
l = 15
mtx = np.random.rand((l,) + low.shape) * (high - low)[None, :] + low[None, :]
I think what you need to do to achieve what you want is the following:
Specify the low and high for each column and each row
Check for each element what the range is that it can be sampled from (that means the highest low and the lowest high of the two ranges imposed by its row and is column)
Sample each element separately (from a uniform distribution) with the element's specified high and low.
Now each element in each row will certainly be within the row's limits and the same would go for elements in a column.
You should be careful though not to select mutual exclusive ranges in rows and columns.
That said here some code that does this (with comments):
import numpy as np
from numpy.random import randint
n_rows = 15
n_cols = 4
# here I make random highs and lows for each row and column
# these are lists of tuples like this: [(39, 620), (83, 123), (67, 243), (77, 901)]
# where each tuple contains the low and high for the column (or row).
ranges_rows = [ (randint(0,100), randint(101, 1001)) for _ in range(n_rows) ]
ranges_cols = [ (randint(0,100), randint(101, 1001)) for _ in range(n_cols) ]
# make an empty matrix
mtx = np.empty((n_rows, n_cols))
# fill in the matrix
for x in range(n_rows):
for y in range(n_cols):
# get the specified low and high for both the column and row of the element
row_low, row_high = ranges_rows[x]
col_low, col_high = ranges_cols[y]
# the low and high for each element should be within range of both the
# row and column restrictions
elem_low = max([row_low, col_low])
elem_high = min([row_high, col_high])
# get the element within the range
rand_elem = np.random.uniform(low=elem_low, high=elem_high)
# put it in its right place in the matrix
mtx[x,y] = rand_elem

OR-Tools VRP: Constrain locations to be served by same vehicle

I would like to constrain locations to be served by the same vehicle.
I used capacity-constraints for achieving this. Say we have l = [[1,2], [3,4]] which means that location 1, 2 must be served by the same vehicle and 3, 4 as well. So 1, 2 ends up on route_1 and 3, 4 on route_2
My code for achieving this is:
for idx, route_constraint in enumerate(l):
vehicle_capacities = [0] * NUM_VEHICLES
vehicle_capacities[idx] = len(route_constraint)
route_dimension_name = 'Same_Route_' + str(idx)
def callback(from_index):
from_node = manager.IndexToNode(from_index)
return 1 if from_node in route_constraint else 0
same_routes_callback_index = routing.RegisterUnaryTransitCallback(callback)
routing.AddDimensionWithVehicleCapacity(
same_routes_callback_index,
0, # null capacity slack
vehicle_capacities, # vehicle maximum capacities
True, # start cumul to zero
route_dimension_name)
The idea is that 1,2 have a capacity demand of each 1 unit (all others have zero). As only vehicle 1 has a capacity of 2 it is the only one able to serve 1,2.
This seems to work fine if len(l) == 1. If greater the solver is not able to find a solution if though I put into l pairs of locations which were on the same route without the above code (hence without the above capacity constraints.
Is there a more elegant way to model my requirement?
Why does the solver fail to find a solution?
I have also considered the possibility of dropping visits (at a high cost) to give the solver the possibility to start from a solution which drops visits such that it will find his way fro this point to a solution without any drops. I had no luck.
Thanks in advance.
Each stop has a vehicle var whose values determine what vehicle is allowed to visit the stop. If you want to have stops 1 and 2 serviced by vehicle 0 use a member constraint on the vehicle var of each stop and set it to [0]. Since you might have other constraints that make stops optional add the value -1 to the list. It is a special value that indicates that the stop is not serviced by a vehicle.
In Python:
n2x = index_manager.NodeToIndex
cpsolver = routing_model.solver()
for stop in [1,2]:
vehicle_var = routing_model.VehicleVar(n2x(stop))
values = [-1, 0]
cpsolver.Add(cpsolver.MemberCt(vehicle_var, values))

How to set LpVariable and Objective Function in pulp for LPP as per the formula?

I want to calculate the Maximised value of the particular user based on his Interest | Popularity | both Interest and Popularity using following Linear Programming Problem(LPP) equation
using pulp package in python3.7.
I have 4 lists
INTEREST = [5,10,15,20,25]
POPULARITY = [4,8,12,16,20]
USER = [1,2,3,4,5]
cost = [2,4,6,8,10]
and 2 variable values as
e=0.5 ; e may take (0 or 1 or 0.5)
budget=20
and
i=0 to n ; n is length of the list
means, the summation want to perform for all list values.
Here, if e==0 means Interest will 0 ; if e==1 means Popularity will 0 ; if e==0.5 means Interest and Popularity will be consider for Max Value
Also xi takes 0 or 1; if xi==1 then the user will be consider else if xi==0 then the user will not be consider.
and my pulp code as below
from pulp import *
INTEREST = [5,10,15,20,25]
POPULARITY = [4,8,12,16,20]
USER = [1,2,3,4,5]
cost = [2,4,6,8,10]
e=0.5
budget=10
#PROBLEM VARIABLE
prob = LpProblem("MaxValue", LpMaximize)
# DECISION VARIABLE
int_vars = LpVariable.dicts("Interest", INTEREST,0,4,LpContinuous)
pop_vars = LpVariable.dicts("Popularity",
POPULARITY,0,4,LpContinuous)
user_vars = LpVariable.dicts("User",
USER,0,4,LpBinary)
#OBJECTIVE fUNCTION
prob += lpSum(USER(i)((INTEREST[i]*e for i in INTEREST) +
(POPULARITY[i]*(1-e) for i in POPULARITY)))
# CONSTRAINTS
prob += USER(i)cost(i) <= budget
#SOLVE
prob.solve()
print("Status : ",LpStatus[prob.status])
# PRINT OPTIMAL SOLUTION
print("The Max Value = ",value(prob.objective))
Now I am getting 2 errors as
1) line 714, in addInPlace for e in other:
2) line 23, in
prob += lpSum(INTEREST[i]e for i in INTEREST) +
lpSum(POPULARITY[i](1-e) for i in POPULARITY)
IndexError: list index out of range
What I did wrong in my code. Guide me to resolve this problem. Thanks in advance.
I think I finally understand what you are trying to achieve. I think the problem with your description is to do with terminology. In a linear program we reserve the term variable for those variables which we want to be selected or chosen as part of the optimisation.
If I understand your needs correctly your python variables e and budget would be considered parameters or constants of the linear program.
I believe this does what you want:
from pulp import *
import numpy as np
INTEREST = [5,10,15,20,25]
POPULARITY = [4,8,12,16,20]
COST = [2,4,6,8,10]
N = len(COST)
set_user = range(N)
e=0.5
budget=10
#PROBLEM VARIABLE
prob = LpProblem("MaxValue", LpMaximize)
# DECISION VARIABLE
x = LpVariable.dicts("user_selected", set_user, 0, 1, LpBinary)
# OBJECTIVE fUNCTION
prob += lpSum([x[i]*(INTEREST[i]*e + POPULARITY[i]*(1-e)) for i in set_user])
# CONSTRAINTS
prob += lpSum([x[i]*COST[i] for i in set_user]) <= budget
#SOLVE
prob.solve()
print("Status : ",LpStatus[prob.status])
# PRINT OPTIMAL SOLUTION
print("The Max Value = ",value(prob.objective))
# Show which users selected
x_soln = np.array([x[i].varValue for i in set_user])
print("user_vars: ")
print(x_soln)
Which should return the following, i.e. with these particular parameters only the last user is selected for inclusion - but this decision will change - for example if you increase the budget to 100 all users will be selected.
Status : Optimal
The Max Value = 22.5
user_vars:
[0. 0. 0. 0. 1.]

Order constraints in optimisation

I have a set of many (10000+) items, from which have I have to choose exactly 20 items. I can only choose each item once. My items have profits, and costs, as well as several boolean properties (such as colour). I need to output the results in a specific order: in particular I need the first and third items to be blue, and the second and fourth items to be red.
Each item is represented as a tuple:
item = ('item name', cost, profit, is_blue, is_red)
as an example
vase = ['Ming Vase', 1000, 10000, 0, 1]
plate = ['China Plate', 10, 5, 1, 0]
and the total set of items is a list of lists:
items = [item1, item2, ..., itemN].
My profits and costs are also lists:
profits = [x[2] for x in items]
costs = [x[1] for x in items]
For each item chosen, it needs to have a minimum value, and a minimum of 5 items must have the property (is_blue) flag set to 1.
I want to choose the 20 cheapest items with the highest value, such that 5 of them have the is_blue flag set to 1, and the first and third items are blue (etc).
I'm having trouble formulating this using google OR tools.
from ortools.linear_solver import pywraplp
solver = pywraplp.Solver('SolveAssignmentProblemMIP',
pywraplp.Solver.CBC_MIXED_INTEGER_PROGRAMMING)
x = {}
for i in range(MAX_ITEMS):
x[i] = solver.BoolVar('x[%s]' % (i))
#Define the constraints
total_chosen = 20
solver.Add(solver.Sum([x[i] for i in range(MAX_ITEMS)]) == total_chosen)
blues = [x[3] for x in items]
solver.Add(solver.Sum([blues[i] * x[i] for i in .
range(MAX_ITEMS)]) >= 5)
max_cost = 5.0
for i in range(MAX_ITEMS):
solver.Add(x[i] * cost[i] <= max_cost)
solver.Maximize(solver.Sum([profits[i] * x[i] for i in range(total_chosen)]))
sol = solver.Solve()
I can get the set of items I've chosen by:
for i in range(MAX_ITEMS):
if x[i].solution_value() > 0:
print(item[i].item_name)
This works fine - it chooses the set of 20 items which maximise the profits subject to the cost constraint, but I'm stuck on how to extend this to choosing items in way that guarantees that the first is blue etc.
Any help in formulating the constraints and objective would be really helpful. Thanks!
Instead of expressing chosen items with BoolVar, consider making a list of 20 IntVar with domain of 0..MAX_ITEMS. From there it should be fairly easy to do something like this:
solver.Add(chosens[0].IndexOf(all_items)[3] == 1)
solver.Add(chosens[2].IndexOf(all_items)[3] == 1)
chosens[i].IndexOf(all_items) simply means all_items[IndexOfChosen], I.E: whichever item is chosen for the Ith place. If you go with this approach, do not forget to MakeAllDifferent!

Resources