This is a table consisting statistical summary of machining parts manufactured and measured
.
The PIDs are different classes of parts, so a PID 123456 can have 100s of parts under it. Each machining part has 4 attributes to it such as pitch diameter, POW diameter, Major Diameter, Minor diameter. Unfortunately, the report is generated such that the data is in rows and not in its adjacent columns, which would would have been easier to visualize later.
How can I parse/group these row values such that I can store the part info in an object so it has the PID and other measurements with the date manufactured, for each part. I want this sorted information to be able to use it in visualizations later. I would like to differentiate between the parts with the time/date they were manufactured at.
For PID 123456, I have 2 parts and each part has 4 properties. So for example, how could I draw a chart for the upper and lower values for the major diameter or minor diameter of different parts (under the same PID)? Thank you.
Related
I am working on a little data mining project (I am still a Data Science student, not a professional). Maybe you can help me to choose a proper model for my task.
So, let's say we have a table with three columns and around 4000 rows:
YEAR
COLOR
NAME
1900
Green
David
1901
Yellow
Sarah
1902
Green
???
1902
Red
Sarah
…
…
…
2020
Purple
John
Any value for any field can be repeated in the dataset (also Year values).
In the first two columns we don't have missing values, but we only have around 20% of Name values in the third column. Name value deppends somewhat on the first two columns (not a causal relation).
My goal is to extrapolate the available Name values to the whole table and get a range of occurrences for each name value (for example in a boxplot)
I have imagined a process like that, although I am not very sure if statitically it makes sense (any objections and suggestions are appreciated):
For every unknown NAME value, the algorythm choose randomly one of the already known NAME values. The odds of a particular NAME value to be chosen depend on the variables YEAR and COLOR. For instance, if 'David' values tend to be correlated with low Year values AND with 'Green' or 'Purple' values for Color, the algorythm give 'David' a higher probability to be chosen if input values for Year and Color are "1900, Purple".
When the above process ends, the number of occurrences for each name is counted.
The above process is applied 30 times and the results for each name are displayed in a plotbox.
However, I don't know which is the best model to implement an idea similar to this. I have drawn the process in a simple paint drawing:
Possible output for the task
Which do you think it could be a good approach to this task? I appreciate any help.
I think you have the process down, it's converting the data which may be the first hurdle.
I would look at using from sklearn.preprocessing import OrdinalEncoder to encode the data to convert from categorical to numeric.
You could then use a random number generator to produce a number within the range defined by the encoding which would randomly select a name.
Loop through this 30 times with an f loop to achieve the result.
It also looks like you will need to provide the ranking values for year and colour prior to building out your code. From here you would just provide bands, for example, if year > 1985, etc within your for loop to specify the names.
Sorry about the title, it was hard to figure out how to word this. So, my main problem is a Total Count column represent the overall quantity, but the other columns are PART of a Total Count column. This wouldn't be so bad except the other columns may be part of each other too. Meaning, for like 11/26/18... the Total Count column is ALL of the Item Count, but some of the other ones may be in New PR Count too, but not necessarily all. Same thing with Dropped Outside LT and Dropped Late columns. They are all part of the Total Count column but all are not necessarily separate.
I feel nothing short of a bunch of complex formulae or macros will fix this so, what is my best option to show at least the individual counts? I was thinking to have the Total Count column as a Clustered Column chart type, the other columns as a Stacked Column and the 2 lines as is. Or all columns just be Clustered Columns and lines as is. What do you all think would best show this data?
So sadly, due to proprietary, I am told at work not to upload an image of the chart. FUN. So, here is what I have..
7 Series - 5 Series are Stacked Columns in Primary Axis, 2 Series are Line with Markers in Secondary Axis. Each column is for a single week's entry of data (so based around a Date entry.
The two Line with Markers are percentages (Secondary Vertical (Value) Axis) on right side of chart. On left side is Total Count (Verical (Value) Axis) basically showing a Count being connected to the Total Count column.
I have a small amount of csv data counting connections between different countries, with only three cols, e.g.:
I can use this (about 100 rows) to create a nice network vis in gephi where node sizes can be resized on degree
However, ideally I'd like the edges to be weighted in size/thickness based on how many connections... e.g. in the image about UK and USA are connected twice, so their thickness would be twice the size of Greece and Nepal's connecting edge.
Is there any way to generate these weighting values automatically, either in gephi or in excel?
The one problem is that the countries are not in a standard order between source and target (e.g. USA and UK in the example above are in different orders, UK coming in the source column for one connection and USA coming first in the source column for the other connection).
Basically, I just need a way to auto count the node connections to make a value for each edge popularity/occurrence. Thanks.
Well, did this with two helper columns using match():
So, edited based on the comments and countif() to count multiple ocurrences:
=COUNTIF(F$3:F$12,"="&CONCATENATE(B3,C3))+COUNTIF(F$3:F$12,"="&CONCATENATE(C3,B3))
I made a little test machine that accidentally created a 'big' data set:
6 columns with +/- 550.000 rows.
The end result I am looking for is a graph with 6 lines, horizontal axis 1 - 550.000 measurements and vertically the values in the rows. (capped at 200 or so). Data is a resistance measurement that should be between 0 - 30 or very big (borken), the software writes 'inf' in these cases.
My skill is limited to excel, so what have I done until now:
Imported in Excel. The measurements are valuable between 0 - 30 and inf is not good for a graph, so I did: if(cell>200){200}else{keep cell value}.
Now making a graph is a timely exercise and excel does not like this, result is not good.
So I would like to take the average value of 60 measurements to reduce the rows to below 10.000. So =AVERAGE(H1:H60)
But I cannot get this to work.
Questions:
How do I reduce this data set and get a good graph.
Should I switch
to other software that is more applicable?
FYI: I already changed the software of the testing device to take the average value of a bunch of measurements the next time... But I cannot repeat this test.
Download link of data set comma separated file 17MB
I think you are on the right track, however my guess is that you only want to get an average every 60 rows and are unsure how to do this.
Using MOD(Number, Divisor) inside an if statement will let you specify that the average should be calculated only once in every x number of cells.
Assuming you'll have one row above your data table for headers, you are looking for something along the lines of:
=IF(MOD(ROW(A61),60) = 1,AVERAGE(H2:H61),"")
Once you have this you can filter your average column to non-blank values and use this to create your graph.
I want to use the RAND() function in Excel to generate a random number between 0 and 1.
However, I would like 80% of the values to fall between 0 and 0.2, 90% of the values to fall between 0 and 0.3, 95% of the values to fall between 0 and 0.5, etc.
This reminds me that I took an applied statistics course once upon a time, but not of what was actually in the course...
How is the best way to go about achieving this result using an Excel formula. Alternatively, what is this kind of statistical calculation called / any other pointers that I can Google around for.
=================
Use case:
I have a single column of meter readings, which I would like to duplicate 7 times (each column for a new month). each column has 55 000 rows. While the meter readings need to vary for each month, when taken as a time series, each meter number should have 7 realistic readings.
The aim is to produce realistic data to turn into heat maps (i.e. flag outlying meter readings)
I don't think that there is a formula which would fit exactly to your requirements. I would use a very straightforward solution:
Generate 80% of data using =RANDBETWEEN(0,20)/100
Generate 10% of data using =RANDBETWEEN(20,30)/100
Generate 5% of data using =RANDBETWEEN(30,50)/100
and so on
You can easily change the precision of generated data by modifying the parameters, for example: =RANDBETWEEN(0,2000)/10000 will generate data with up to 4 digits after decimal point.
UPDATE
Use a normal distribution for the use case, for example:
=NORMINV(RAND(), 20, 5)
where 20 is a mean value and 5 is a standard deviation.