I have dataset for indoor localization.the dataset contain columns for wireless access point about 520 column with the RSSI value for each one .the problem is each row of the dataset has values of one scan for a signals that can be captured by a device and the maximum number of wireless access point that can be captured about only 20 ( the signal can be from 0dbm which is when the device near the access point and minus 100dbm when the device far from the access point but it can capture the signal) the rest of access points which are out of the range of the device scan they have been compensated with a default value of 100 positive.these value (100 dbm ) about 500 column in each row and have different columns when ever the location differ .the question is how to deal with them?
One option to deal with this issue, you could try to impute (change) the values that are out of range with a more reasonable value. There are several approaches you could take to do this:
Replacing the out-of-range values with the mean or median of the in-range values
Using linear interpolation to estimate the missing values based on the surrounding values
The choice will depend on the goal of your machine learning model and what you want to achieve.
Related
I am working on a little data mining project (I am still a Data Science student, not a professional). Maybe you can help me to choose a proper model for my task.
So, let's say we have a table with three columns and around 4000 rows:
YEAR
COLOR
NAME
1900
Green
David
1901
Yellow
Sarah
1902
Green
???
1902
Red
Sarah
…
…
…
2020
Purple
John
Any value for any field can be repeated in the dataset (also Year values).
In the first two columns we don't have missing values, but we only have around 20% of Name values in the third column. Name value deppends somewhat on the first two columns (not a causal relation).
My goal is to extrapolate the available Name values to the whole table and get a range of occurrences for each name value (for example in a boxplot)
I have imagined a process like that, although I am not very sure if statitically it makes sense (any objections and suggestions are appreciated):
For every unknown NAME value, the algorythm choose randomly one of the already known NAME values. The odds of a particular NAME value to be chosen depend on the variables YEAR and COLOR. For instance, if 'David' values tend to be correlated with low Year values AND with 'Green' or 'Purple' values for Color, the algorythm give 'David' a higher probability to be chosen if input values for Year and Color are "1900, Purple".
When the above process ends, the number of occurrences for each name is counted.
The above process is applied 30 times and the results for each name are displayed in a plotbox.
However, I don't know which is the best model to implement an idea similar to this. I have drawn the process in a simple paint drawing:
Possible output for the task
Which do you think it could be a good approach to this task? I appreciate any help.
I think you have the process down, it's converting the data which may be the first hurdle.
I would look at using from sklearn.preprocessing import OrdinalEncoder to encode the data to convert from categorical to numeric.
You could then use a random number generator to produce a number within the range defined by the encoding which would randomly select a name.
Loop through this 30 times with an f loop to achieve the result.
It also looks like you will need to provide the ranking values for year and colour prior to building out your code. From here you would just provide bands, for example, if year > 1985, etc within your for loop to specify the names.
I’m having trouble understanding if Spotfire allows for conditional computations between arbitrary rows containing numerical data repeated over data groups. I could not find anything to cue me onto a right solution.
Context (simplified): I have data from a sensor reporting state of a process and this data is grouped into bursts/groups representing a measurement taking several minutes each.
Within each burst the sensor is measuring a signal and if a predefined feature (signal shape) was detected the sensor outputs some calculated value, V quantifying this feature and also reports a RunTime at which this happened.
So in essence I have three columns: Burst number, a set of RTs within this burst and Values associated with these RTs.
I need to add a calculated column to do a ratio of Values for rows where RT is equal to a specific number, let’s say 1.89 and 2.76.
The high level logic would be:
If a Value exists at 1.89 Run Time and a Value exists at 2.76 Run Time then compute the ratio of these values. Repeat for every Burst.
I understand I can repeat the computation over groups using OVER operator but I’m struggling with logic within each group...
Any tips would be appreciated.
Many thanks!
The first thing you need to do here is apply an order to your dataset. I assume the sample data is complete and encompasses the cases in your real data, thus, we create a calculated column:
RowID() as [ROWID]
Once this is done, we can create a calculated column which will compute your ratio over it's respective groups. Just a note, your B4 example is incorrect compared to the other groups. That is, you have your numerator and denominator reversed.
If(([RT]=1.89) or ([RT]=2.76),[Value] / Max([Value]) OVER (Intersect([Burst],Previous([ROWID]))))
Breaking this down...
If(([RT]=1.89) or ([RT]=2.76), limits the rows to those where the RT = 1.89 or 2.76.
Next comes the evaluation if the above condition is TRUE
[Value] / Max([Value]) OVER (Intersect([Burst],Previous([ROWID])))) This takes the value for the row and divides it by the Max([Value]) over the grouping of [Burst] and AllPrevious([ROWID]). This is noted by the Intersect() function. So, the denominator will always be the previous value for the grouping. Note that Max() was a simple aggregate used, but any should do for this case since we are only expecting a single value. All Over() functions require and aggregate to limit the result set to a single row.
RESULTS
I made a little test machine that accidentally created a 'big' data set:
6 columns with +/- 550.000 rows.
The end result I am looking for is a graph with 6 lines, horizontal axis 1 - 550.000 measurements and vertically the values in the rows. (capped at 200 or so). Data is a resistance measurement that should be between 0 - 30 or very big (borken), the software writes 'inf' in these cases.
My skill is limited to excel, so what have I done until now:
Imported in Excel. The measurements are valuable between 0 - 30 and inf is not good for a graph, so I did: if(cell>200){200}else{keep cell value}.
Now making a graph is a timely exercise and excel does not like this, result is not good.
So I would like to take the average value of 60 measurements to reduce the rows to below 10.000. So =AVERAGE(H1:H60)
But I cannot get this to work.
Questions:
How do I reduce this data set and get a good graph.
Should I switch
to other software that is more applicable?
FYI: I already changed the software of the testing device to take the average value of a bunch of measurements the next time... But I cannot repeat this test.
Download link of data set comma separated file 17MB
I think you are on the right track, however my guess is that you only want to get an average every 60 rows and are unsure how to do this.
Using MOD(Number, Divisor) inside an if statement will let you specify that the average should be calculated only once in every x number of cells.
Assuming you'll have one row above your data table for headers, you are looking for something along the lines of:
=IF(MOD(ROW(A61),60) = 1,AVERAGE(H2:H61),"")
Once you have this you can filter your average column to non-blank values and use this to create your graph.
I have got data points of a sound sample from Audacity which I exported to a .txt file and imported in Excel. Is it possible to plot an upper envelope function in Excel?
(In the end I have to determine the reverberation time, so the time in which the loudness decreases with 60dB.)
For a decaying oscillation such as this or a damped pendulum, the decay envelope can be found by looking at the difference between each sampled reading. You then need to see where the is a change in gradient from +ve to -ve. (or zero on one side). To do this some logical operators are used.
Method:
Consider data starting at column A row 5 for time and Column B row 5 for first data value.
Create a column which has the value minus the previous value. [ C6=B6-B5 etc.]
Next do a column for the gradient change giving a "flag" of 1 for a positive to negative or zero inflection [D7= IF (AND(C7>0,C6<=0],1,0)
This should produce a column of data that corresponds to the peaks.
In the next columns use the flag to get the original co-ordinates to display
[E7 =IF (D7=1,A7,"")] For time and [F7 IF(D7=1,B7,"")]
Copy this data using "Value" only in the paste option to yet another column.
Filter this to exclude the null data.
Beware of aliasing in data set.
Not the most elegant of solutions (and some steps can be linked -for clarity shown separately) but it works.
Tech99m
I've created a spreadsheet for choosing resistor combinations for an RC Operational Amplifier. I've used a list of available capacitors and resistors for my limiting values to produce values of one of the resistors based on the resistance and capacitance values of the available (standard) components. The values in my tables look like 7.23436793078690. I wish to apply a filter that will find the values closest to a whole number (1592.00188622182000). Then I wish to apply another filter that will compare those values to a list of available resistors and highlight resistors closest to the desired value. Many of the returned values of R2 are negative values so I also wish to filter values of R2<0.
For this spreadsheet I've used the equation R2=(Req)(R1)/(R1-Req), which is an equation to determine Req, for parallel resistors, that is solved for R2. In Column 1, the Rows are populated with values for available (standard) resistors. All other columns are populated with the equation for R2. The value for Req is obtained from another table in the Workbook that uses available (standard) capacitor values. Therefore, Columns B and beyond are labeled R2(C=.47 uF), for example. Essentially, Columns B and beyond reference the available (standard) capacitor values.
I wish to highlight the values I discussed in the first paragraph so I can quickly scan the workbook for the best possible value of R2. Then I can quickly determine the values of R1 and C to complete my task and minimize the tolerance for the given op-amp application.
I have some C++ programming knowledge and I have enough experience with Excel so I should be able to understand where and how to do what I wish to do but I wish to get some advice and direction from a more experienced Excel user.
***UPDATE***
Since my first post, I've done some research. It seems like the easiest approach would be to apply a "closest to" filter. I've attached a screenshot of a small portion of my workbook, which contains the equation for the "closest to" filter, a partial range of available resistor values, and the results for my filter. I have multiple tabs in my workbook.
I lied. I'm unable to post an image until I gain 10 reputation. I have 6 reputation. If you're reading this post and you're able to contribute to my reputation, please contribute.
This is my equation: =INDEX(A3:BZ26,MATCH(MIN(ABS(A3:BZ26-CB3)),ABS(A3:BZ26-CB3),0))
The equation format is: =INDEX(rng,MATCH(MIN(ABS(rng-value)),ABS(rng-value),0))
My formula seems to be correct but it returns "#VALUE!".