I am running an analysis in which I want identify if there are significant differences in the features that make up two groups: people that commit crimes against children and people that commit crimes against adults – the group sizes are not equal.
My data is all frequency counts, and has been added up to create percentages
For example –
80% of offenders that commit crimes against children are male 20% are female
75% of offenders that commit crimes against adults are male 25% are female
50% of offenders that commit crimes against children plead guilty, 20% plead not guilty, 30% are acquitted
20% of offenders that commit crimes against adults plead guilty, 60% plead not guilty, 20% are acquitted
I want to know:
Is there a significant difference in the number of females that commit crimes against adults vs. females that commit crimes against children?
Is there a significant difference in the number of males that commit crimes against adults vs. males that commit crimes against children?
Is there significant difference in the number of guilty pleas/ non guilty pleas/ acquittals/ from offenders that commit crimes against children vs. offenders that commit crimes against adults?
Is there significant difference in the number of guilty pleas/non-guilty pleas / acquittals between those commit offences against adults vs. those that commit offences against children?
I am thinking it should be a chi-squared test – but this seems to be for answering questions such as ‘is there a significant difference between the number of men vs. the number of women that commit crimes against children?’ rather than answering ‘is there a significant difference in the number of women that commit crimes against children vs. the number of women that commit crimes against adults?’
Also, in a chi squared test, would my ‘expected value’ just be what I observed in one group and my ‘observed value’ be what I observed in the other group? I,e., I expect the groups to be the same.
I also thought about a t-test where male and female could be coded as 0 and 1 but that would give no standard deviations so would not be feasible.
I would greatly appreciate any help or advice with this or on which test would be appropriate, thank you!
Related
I am working in mass spectrometry field and doing data processing (currently using python scripts). Recently I came up with a question that I have no idea how to solve. Maybe someone was dealing with similar issues...
To be less specific, I will describe it with analogy of apples. They can be either red or green.
We know that in nature the distribution of apples in a fruit forest is 55% red to 45% green (obtained from "very large" number of apples, giving a ratio of 1.22 red/green).
However, we have a supplier who brings us trucks with unknown number of apples that vary from time to time, but the number of apples in the trucks is more or less same. We are only receiving the red/green ratio value. And that ratio varies from 2.5 (red/green) to 0.5 (red/green) and has known shape (not gaussian, but gaussian-like). Is it possible based on the distribution of ratios to have approximation of number of apples in the trucks (approximation means that answer - there are around 10000 apples +/- 100% is still good)?
Data set given has below columns
age sex bmi children smoker region charges
19 female 27.9 0 no southwest 19393.03
i need just the graph name along with the parameters to be used to achieve the below questions result.
i need 5 point summary of numerical attributes?
Distribution of bmi column
Measure of skewness of bmi column
Distribution of categorical column
Do charges of people who smoke differ significantly from the people who don't?
6.Does bmi of males differ significantly from that of females?
Is the proportion of smokers significantly different in different genders
Is the distribution of bmi across women with no children, one child and two children,the same
I recommend https://bookdown.org/ndphillips/YaRrr/ as a good introduction to R that includes a big section on data visualisation.
Background
I wish to compare menu sales mix ratios for two periods.
A menu is defined as a collection of products. (i.e., a hamburger, a club sandwich, etc.)
A sales mix ratio is defined as a product's sales volume in units (i.e., 20 hamburgers) relative to the total number of menu units sold (i.e., 100 menu items were sold). In the hamburger example, the sales mix ratio for hamburgers is 20% (20 burgers / 100 menu items). This represents the share of total menu unit sales.
A period is defined as a time range used for comparative purposes (i.e., lunch versus dinner, Mondays versus Fridays, etc.).
I am not interested in overall changes in the volume (I don't care whether I sold 20 hamburgers in one period and 25 in another). I am only interested in changes in the distribution of the ratios (20% of my units sold were hamburgers in one period and 25% were hamburgers in another period).
Because the sales mix represents a share of the whole, the mean average for each period will be the same; the mean difference between the periods will always be 0%; and, the sum total for each set of data will always be 100%.
Objective:
Test whether the sales distribution (sales mix percentage of each menu item relative to other menu items) changed significantly from one period to another.
Null Hypothesis: the purchase patterns and preferences of customers in period A are the same as those for customers in period B.
Example of potential data input:
[Menu Item] [Period A] [Period B]
Hamburger 25% 28%
Cheeseburger 25% 20%
Salad 20% 25%
Club Sandwich 30% 27%
Question:
Do common methods exist to test whether the distribution of share-of-total is significantly different between two sets of data?
A paired T-Test would have worked if I was measuring a change in the number of actual units sold, but not (I believe) for a change in share of total units.
I've been searching online and a few text books for a while with no luck. I may be looking for the wrong terminology.
Any direction, be it search terms or (preferably) the actual names appropriate tests, are appreciated.
Thanks,
Andrew
EDIT: I am considering a Pearson Correlation test as a possible solution - forgetting that each row of data are independent menu items, the math shouldn't care. A perfect match (identical sales mix) would receive a coefficient of 1 and the greater the change the lower the coefficient would be. One potential issue is that unlike a regular correlation test, the changes may be amplified because any change to one number automatically impacts the others. Is this a viable solution? If so, is there a way to temper the amplification issue?
Consider using a Chi Squared Goodness-of-Fit test as a simple solution to this problem:
H0: the proportion of menu items for month B is the same as month A
Ha: at least one of the proportions of menu items for month B is
different to month A
There is a nice tutorial here.
You need 100 lbs of bird feed. John's bag can carry 15 lbs and Mark's bag can carry 25 lbs. Both guys have to contribute exactly the same total amount each. What's the lowest number of trips each will have to take?
I have calculated this using systems of equations.
15x + 25y = 100
15x - 25y = 0
This equals out to:
John would have 3.33 trips and Mark would have 2 trips. Only one problem: you can't have 1/3 of a trip.
The correct answers is:
John would take 5 trips (75 lbs) and Mark would take 3 trips (75 lbs).
How do you calculate this? Is there an excel formula which can do both layers of this?
Assuming you put the total bird feed required in A1 and John's and Mark's bag limits in B1 and B2 respectively, then this formula in C1:
=MATCH(TRUE,INDEX(2*ROW(INDIRECT("1:100"))*LCM($B$1:$B$2)>=$A$1,,),0)*LCM($B$1:$B$2)/B1
will give the lowest number of trips required of John. Copying this formula down to C2 will give the equivalent result for Mark.
Note that the 100 in the part:
ROW(INDIRECT("1:100"))
was arbitrarily chosen and will give correct results providing neither John nor Mark is required to make more than twice that number of trips, i.e. 200. Obviously you can amend this value if you feel it necessary (up to a theoretical limit of 2^20).
Regards
Since John and Mark need to carry the same total amount of bird feed, what they will carry has to be a multiple of the least common multiple.
Since they both carry that amount the total amount will always be an even multiple of the LCM.
So find the least even multiple of the LCM that is larger than 100. And calculate the number of trips John and Mark will have to take from that.
For John:
CEILING(100/(2*LCM(15; 25));1)*LCM(15;25)/15
For Mark:
CEILING(100/(2*LCM(15; 25));1)*LCM(15;25)/25
I am trying to figure out what the optimal number of products I should make per day are, displaying the values in a chart and then using the chart to find the optimal number of products to make per day.
Cost of production: $4
Sold for: $12
Leftovers sold for $1
So the ideal profit for a product is $8, but it could be -$3 if it's left over at the end of the day.
The daily demand of sales has a mean of 150 and a standard deviation of 30.
I have been able to generate a list of random numbers using to generate a list of how many products: NORMINV(RAND(),mean,std_dev)
but I don't know where to go from here to figure out the amount sold from the amount of products made that day.
The number sold on a given day is min(# produced, daily demand).
ADDENDUM
The decision variable is a choice you make: "I will produce 150 each day", or "I will produce 145 each day". You told us in the problem statement that daily demand is a random outcome with a mean of 150 and a SD of 30. Let's say you go with producing 150, the mean of demand. Since it's the mean of a symmetric distribution, half the time you will sell everything you made and have no losses, but in most of those cases you actually could have sold more and made more money. You can't sell products you didn't make, so your profit is capped at selling 150 on those days. The other half of the time, you won't sell all 150 and will take a loss on the unsold items, reducing your profit a bit. The actual profit on any given day is a random variable, because it is determined by random demand.
Since profit is random, you can calculate your average earnings across many days based on the assumption that you produce 150. You can also average earnings based on the assumption that you produce 140 per day, or 160 per day, or any other number. It sounds like you've been asked to plot those average earnings versus how many you decided to produce, and choose a production level that results in the highest long-term average earnings.