statistical test for an effect across groups, data are nested, not of normal distribution

statistical test for an effect across groups, data are nested, not of normal distribution - nested

What is the best statistical test for an effect across groups when the data are nested but may not be of normal distribution? I get a highly significant effect using Kruskall Wallis test, but using it there is no account to that the data points are from several locations, each contributed for several years, and in every year the data were pulled into age groups.

I think you can categorize the data by year, and change the data structure so that the data will be non-nested, making it easier to process. I agree that Kruskal-Wallis' test is a good choice of the cross-group effect test.

Related

How to deal with end of life scenarios on Brightway?

I am currently working on a project about life cycle models for vehicles on Brightway. The models I am using are inspired from models on the software Simapro. All the life cycle processes are created fine except for the end of life scenarios. On Simapro the end of life scenarios are described with percentages of recycled mass for each type of product (plastics, aluminium, glass, etc) but I can't find how to translate this into Brightway. Do you have ideas on how to deal with these end of life scenarios on Brightway ? Thank you for your answer.
Example of the definition of an end of life scenario on Simapro

There are many different ways to model End-of-Life, depending on what kind of abstraction you choose to map to the matrix math at the heart of Brightway. There is always some impedence between our intuitive understanding of physical systems, and the computational models we work with. Brightway doesn't have any built-in functionality to calculate fractions of material inputs, but you can do this manually by adding an appropriate EoL activity for each input to your vehicle. This can be in the vehicle activity itself, or as a separate activity. You could also write functions that would add these activities automatically, though my guess is that manual addition makes more sense, as you can check the reasonableness of the linked EoL activities more easily.
One thing to be aware of is that, depending on the background database you are using, the sign of the EoL activity might not be what you expect. Again, what we think of is not necessarily what fits into the model. For example, aluminium going to a recycling center is a physical output of an activity, and all outputs have positive signs (inputs have negative signs in the matrix, but Brightway sets this sign by the type of the exchange). However, ecoinvent models EoL treatment activities as negative inputs (which is identical to positive outputs, the negatives cancel). I would build a simple system to make sure you are getting results you expect before working on more complex systems.

ANOVA test on time series data

In below post of Analytics Vidya, ANOVA test has been performed on COVID data, to check whether the difference in posotive cases of denser region is statistically significant.
I believe ANOVA test can’t be performed on this COVID time series data, atleast not in way as it has been done in this post.
Sample data has been consider randomly from different groups(denser1, denser2…denser4). The data is time series so it is more likely that number of positive cases in random sample of groups will be from different point of time.
There might be the case denser1 has random data from early covid time and another region has random data from another point of time. If this is the case, then F-Statistics will high certainly.
Can anyone explain if you have other opinions?
https://www.analyticsvidhya.com/blog/2020/06/introduction-anova-statistics-data-science-covid-python/

ANOVA should not be applied to time-series data, as the independence assumption is violated. The issue with independence is that days tend to correlate very highly. For example, if you know that today you have 1400 positive cases, you would expect tomorrow to have a similar number of positive cases, regardless of any underlying trends.
It sounds like you're trying to determine causality of different treatments (ie mask mandates or other restrictions etc) and their effects on positive cases. The best way to infer causality is usually to perform A-B testing, but obviously in this case it would not be reasonable to give different populations different treatments. One method that is good for going back and retro-actively inferring causality is called "synthetic control".
https://economics.mit.edu/files/17847
Above is linked a basic paper on the methodology. The hard part of this analysis will be in constructing synthetic counterfactuals or "controls" to test your actual population against.
If this is not what you're looking for, please reply with a clarifying question, but I think this should be an appropriate method that is well-suited to studying time-series data.

How can I determine the best data structure/implementation for my dataset?

Preface: I'm a self-taught coder, so a lot of my knowledge is limited to my research. I'm hoping to have other opinions as I want to build things right the first time. I need help with determining an appropriate solution and how to implement the solution.
I'm looking to build a least cost alternative model (essentially a shortest path) for delivering between locations (nodes), based on different modes of transportation (vehicles) and the different roads taken (paths). Another consideration is the product price (value) to determine the least cost path.
Here are my important data items:
nodes: cities where the product will travel to and from.
paths: roads have different costs, depending on the road.
vehicles: varying vehicles have differing rental costs when transporting (motorbike, car, truck). Note that the cost of a vehicle is not constant, it is highly dependent on the to/from nodes. For example, using a car to go from city A to city B will have a different cost than using a car to go from city B to A or city A to city C.
value: Product value. Again, a product's value is highly dependent on its destination node. The same product can have a different value at City A, B or C.
Problem Statement
How to setup data structure to best determine where the least cost path would be to get a product from one location to every other location.
Possible Solutions
From my research, I believe a weighted graph data structure would be most suitable for my situation in combination with dijkstra's algorithm. I believe breaking the problem down simpler would be essential, to first create a simple weighted graph of only nodes and paths.
From there, adding the vehicle cost and the product value considerations afterwards. Perhaps just adding the two values as a cost to "visit" a node? (aka incorporate it into the path cost?)
Thoughts on my current solution? Other considerations I overlooked? Perhaps a better solution?
Implementation
I'd love to be able to build this within Excel VBA (as that is how I learned how to code) and Excel is what I use for my tools. Would VBA be too limited in this task? How else can I incorporate my analysis with Excel with another language?

Try the book Practical Management Science by Winston & Albright and check out the chapter on Operations Management - lots of models explained in there from the simple onwards. Available online as a pdf : http://ingenieria-industrial.net/downloads/practicalmanagementscience.pdf

VBA is more a scripting language than a full-fledged one, though one may contend that the underlying framework is .NET. Why don't you give a shot at C++ or Java? If you intuitively understand the data structure and the algorithm, then it'll be a breeze coding in these. Chapter 4 of Algorithms by Sedgewick and Wayne has a beautiful explanation of Shortest Paths. You may also consider studying Bellman-Ford algorithm if you foresee any negative weight cycles on a vertex.

Determine coefficients for some function

I have a task that is probably related to data analysis or even neural networks.
We have a data source of our partners, job portal. The source values are arrays of different attributes related to the particular employee:
His\her gender,
Age,
Years of experience,
Portfolio (number of the projects done),
Profession and specialization (web design, web programming, management etc.),
many other (around 20-30 totally)
Every employee has it's own salary (hourly) rate. So, mathematically, we have some function
F(attr1, attr2, attr3, ...) = A*attr1 + B*attr2 + C*attr3 + ...
With unknown coefficient. But we know the result of the function for the specified arguments (let's say, we know that a male programmer with 20 years of experience and 10 works in portfolio has a rate of $40 per hour).
So we have to find somehow these coefficients (A, B, C...), so we can predict the salary of any employee. This is the most important goal.
Another goal is to find which arguments are most important - in other words, which of them cause significant changes to the result of the function. So in the end we have to have something like this: "The most important attributes are years of experience; then portfolio; then age etc.".
There may be a situation when different professions vary too much from each other - for example, we simply may not be able to compare web designers with managers. In this case, we have to split them by groups and calculate these ratings for every group separately. But in the end we need to find 'shared' arguments that will be common for every group.
I'm thinking about neural networks because it's something they may deal with. But I'm completely new to them and have totally no idea what to do.
I'd very appreciate any help - which instruments to use, what algorithms, or even pseudo-code samples etc.
Thank you very much.

That is the most basic example of (linear) regression. You are using a linear function to model your data, and need to estimate the parameters.
Note that this is actually a part of classic mathematical statistics; not data mining yet but much much older.
There are various methods. Given that there likely will be outliers, I would suggest to use RANSAC.
As for the importance, doesn't this boil down to "which is largest, A B or C"?

Data Aggregation :: How important is it really?

I'm curious to know where people value Data Aggregation. I'm truly curious, if you don't mind letting me know how important this really is to you personally with respect to your work environment, and if you have to work directly with data agg in your line of work.
Really interested to hear about your feedback.

If you persist data (e.g. store it in a database) chances are that the data will be used by managers, statisticians, stake holders etc. to analyze the workings of their software-supported undertaking to make executive decisions. This analysis can only take place by methods of aggregation. There's no one in the world who can look at a million rows of raw data and glean insight. The data has to be summed, averaged, standard deviated etc. to make any sense to a human being.
A few examples of areas where data aggregation is important:
Public Health (CDC, WHO)
Marketing
Advertising
Politics
Organizational Management
Space Exploration
lol. Take your pick!

Very important, what else is there to say?
I work at a large hospital and not only do we have numerous departments using Analysis Services cubes we develioped but they rely heavily on the daily totals and different aggregations they can derive from these cubes by simple browsing. Without the very basic capability of being able to aggregate on some portions of your data you might as well write it on paper (IMO).

Say you have data over every individual sale.
Looking at these individual purchases could be interesting some level level(e.g. whne a customer comes and wants a refund)
However, I cannot send those 20 million records to my boss at the end of each month and say "Heres how much we sold this month".
This data needs to be aggregated and summarized on various levels. The business would not operate if the marketing guys couldn't get an aggregate for each product, the regional boss couldn't get an total aggregate over a time period and so on.

Our databases have millions of rows, of course we rely on aggregation for managment information, not to use it would be to put too heavy a load on the database in order to run large reports which would impact heavily on the users of the database. I can't think of many cases where the database contains business critical information that managers use to make decisions where aggregation would not be needed for managment reports.

I view data aggregation like data in a grid and being able to group, order, and sort columns. In a large grid of data, this is very important. It's really the difference between looking at a pile of numbers and looking at meaningful data.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string