So I am stuck on a task for which I need some guidance. I have two datasets in hand. One is output of a SVM, has 4 dimensions and represents DoE data, which will be later used for training a ML model. On the other hand, I have calibration data which is based on the 4 dimensions mentioned previously. Now, the task I have at hand is to verify if this calibration data falls within the DoE data, preferably at its center and how each individual operating point in the calibration data is distributed along the 4 dimensions. I need to do this later for all the calibration points to verify the accuracy of the DoE data.
Can anyone suggest how I can proceed with this?
Related
I have two data sets (each with datapoints + standard deviation) and want to check whether they are statistically different. What kind of test would be appropriate?
Thank you!
The answer depends. If blue and red samples are randomly obtained, and the same group of items but measured at different times, then paired two-sample t-test applies. If they belong to different groups, the unpaired two-sample t-test is suitable. This decision is based on the assumption that both blue and red samples are normally distributed or can be transformed to a normal distribution by means of a logarithmic transformation. Otherwise, you need to implement Mann-Whitney test. The data values to be used are Output percentages given the same Input value. Data values should be continuous as in your case.
I want to know what is the best approach to handle a regression analysis on all text data type. I have the following data set.
my feature columns are: Strength, area of development, leadership, satisfactory
values of these columns are predefined set of texts eg. "Continuous Improvement,Self-Development,Coaching and Mentoring,Creativity,Adaptability"
based on the value in these columns I want to predict the label (overall Performance) - Outstanding or Exceeding Expectation or Meeting Expectation.
what should be the best approach to deal with this dataset ?
I am currently working on a financial data problem. I want to detect trades for which anomalous theta values are being generated by the models (due to several factors).
My data mainly consists of trade with its profile variables like dealId, portfolio, etc. along with different theta values along with the theta components for different dates(dates back to 3 years).
Data that I am currently using looks like this:
Tradeid
Date1
Date 2 and so on
id1
1234
1238
id2
1289
1234
Currently, I am Tracking daily theta movement for all trades and sending trades whose theta has moved more than 20k (absolute value).
I want to build an ML model which tracks theta movement and detects that for the current date this(or these) particular deal id/s are having anomalous theta.
So far, I have tried clustering trades based on their theta movement correlation using DBSCAN with a distance matrix. I have also tried using Isolation forest but it is not generalizing very well on the dataset.
All the examples that I have seen so far for anomaly detection are more like finding a rotten apple from a bunch of apples. Is there any algorithm that would be best suitable for my case or can be modified to best suit my problem?
Your problem seems to be too simple for the machine learning world.
You can manually define a threshold, for which the data is anomalous and identify them.
And to do that, you can easily analyze your data using pandas to find out the mean, max, min etc. and then proceed to define a threshold.
Naive Question:
In the attached snapshot, I am trying to figure out the correlation concept when applied to actual values and to calculation performed on those actual values and creating a new stream of data.
In the example,
Columns A,B,C,D,E have very different correlation but when I do a rolling sum on the same columns to get G,H,I,J,K the correlation is very much the same(negative or positive.
Are these to different types of correlation or am I missing out on something.
Thanks in advance!!
Yes, these are different correlations. It's similar to if you were to measure acceleration over time of 5 automobiles (your first piece of data) and correlate those accelerations. Each car accelerates at different rates over time leaving your correlation all over the place.
Your second set of data would be the velocity of each car at each point in time. Because each car is accelerating at a pretty constant rate (and doing so in two different directions from the starting point) you either get a big positive or big negative correlation.
It's not necessary that you get that big positive or big negative correlation in the second set, but since your data in each list is consistently positive or negative and grows at a consistent rate, it correlates with either similar lists.
Suppose I have a list of values that I can histogram and calculate descriptive statistics on such as mean, average, max, standard deviation, etc. Perhaps this histogram is bimodal or right skewed. Let’s call this group of data “DataSet1”.
Suppose I had just a mean or median of another set of data. Lets call that DataSet2. I do not have all the raw data for DataSet2, just the median or mean. There is a strong belief that DataSet1 and DataSet2 would show the same variability in values.
If I knew just a single value of either mean or median, can I apply the description statistics from DataSet1 to create a new histogram that mirrors the bimodal or right skewed behavior from DataSet1?
Thanks
Dan
Alternative intent:
I have 3 years of historical data, where the data definitely has a "day of week" trend to it. I am using a python api to apply seasonal ARIMA to forecast the next 7 days from the 3 years of historical data. The predicted value is great, but it is only 1 value. I would like to use that predicted value as the "mean" and create a histogram from the variability of values shown to exists historically by day of week.
so, today is thursday. Lets say i predict tomorrow to have a value of 78.6.
I want to sample potential values of tomorrow based upon a mean of 78.6 but with variability similar to that showed to exist on all historical fridays
If i look at historical fridays, perhaps it shows a skewed to the left behavior
so when i sample with a mean of 78.6, if i sampled 100 times, the values sampled, if plotted in a histogram, would also skew to the left
Hope that helps..