I have a data set something like shown below which in real scenario wil have row count something between 10000 to 1000000.
There would be more columns but the core problem revolves round these two fields.
Known Labels
I have known categories -'Apple', 'Blueberry','Orange','Lettuce'
Dataset
DataFrame
({'ROWID':1,2,3,4,5,6,7,8,9,10],
'Category':'Apple','Blueberry'.'Orange','Lettuce','Fruit','Salad','xyz','Fruit'
,'Leaf','Avocado'],
'Details':['Eat one a day ,doctors keep away','Like it in a muffin',
'Tastes yummy','Like it with
salmon','Glass of a juice','Ceser dressing on lettuce','Nothing in my
basket','Like it in a muffin','I like it it with salami','Comes from
Mexico']})
Problem:
I have to create one or many metrics using groupby on category
When the category column has unknown cell value I need to read the text from the 'Details' and predict the best suited label for category.
For example
Salad ->Lettuce, Fruit(Row#5)-> Orange Fruit(Row#8)-> Blueberry
Leaf(Row#9)-> 'Lettuce' It is understood that some of the rows can
not be categorized.
Help Needed:
I am a newbie in data science algorithm, looking for some guidance to identify the right model to solve the problem.
Use Naive Bayes for the Details column, before that do a simple filtering on the Category column and remove rows having known category values.
Related
I am working on a little data mining project (I am still a Data Science student, not a professional). Maybe you can help me to choose a proper model for my task.
So, let's say we have a table with three columns and around 4000 rows:
YEAR
COLOR
NAME
1900
Green
David
1901
Yellow
Sarah
1902
Green
???
1902
Red
Sarah
…
…
…
2020
Purple
John
Any value for any field can be repeated in the dataset (also Year values).
In the first two columns we don't have missing values, but we only have around 20% of Name values in the third column. Name value deppends somewhat on the first two columns (not a causal relation).
My goal is to extrapolate the available Name values to the whole table and get a range of occurrences for each name value (for example in a boxplot)
I have imagined a process like that, although I am not very sure if statitically it makes sense (any objections and suggestions are appreciated):
For every unknown NAME value, the algorythm choose randomly one of the already known NAME values. The odds of a particular NAME value to be chosen depend on the variables YEAR and COLOR. For instance, if 'David' values tend to be correlated with low Year values AND with 'Green' or 'Purple' values for Color, the algorythm give 'David' a higher probability to be chosen if input values for Year and Color are "1900, Purple".
When the above process ends, the number of occurrences for each name is counted.
The above process is applied 30 times and the results for each name are displayed in a plotbox.
However, I don't know which is the best model to implement an idea similar to this. I have drawn the process in a simple paint drawing:
Possible output for the task
Which do you think it could be a good approach to this task? I appreciate any help.
I think you have the process down, it's converting the data which may be the first hurdle.
I would look at using from sklearn.preprocessing import OrdinalEncoder to encode the data to convert from categorical to numeric.
You could then use a random number generator to produce a number within the range defined by the encoding which would randomly select a name.
Loop through this 30 times with an f loop to achieve the result.
It also looks like you will need to provide the ranking values for year and colour prior to building out your code. From here you would just provide bands, for example, if year > 1985, etc within your for loop to specify the names.
I'm trying to correlate population data and human right scores but I've looked to many articles and did not find the answer I intended.
This is the data set I'm working with and I will be needing each correlation value for each row.
everybody I did a bit of research and added few more features to my dataset now everything works as intended.
Data and description of variables
Picture 1 and Sample unbalanced paneldata
Picture 1 shows a balanced panel data that I have created using an unbalanced one provided as a sample in the same image, where I had multiple products (ID) for different amount of years (YEAR). For each product, there were a different number of Shops offering the given product (ID). So as stated, this is a balanced set created by sorting out for the same years, same products (ID), and same shops (marked by the orange area in the sample unbalanced paneldata). This is an important assumption that might affect the perception of the issue stated below. The following is therefore a description of the table shown in Picture 1:
Years indicates the amount of period a product lasts for a given product (ID)
Shop 1, Shop 2, Shop 3 indicate different prices for a given product (ID) by different firms
The minimum and second minimum value depict what shops for a given year and product (ID), have the lowest and second lowest price for that given year. This is needed to calculate the Price difference, which is **(Second minimum value - Minimum Value) / (Minimum Value)
An example of this, is given for row 5 (Year 01.01.1995 - ID 101) where Price difference would be (3999-3790)/3790 = 5,51% (In Picture 1)
Issue
In my balanced panel data, (Picture 1), I want to run a fixed effect regression in STATA using xtreg function, where the dependent variable is the Price difference, and number of shops selling a product are the independent variables. This is, so I can say how Price difference as a dependent variable is affected when there is 1 shop selling, when there are two shops selling, and when there are three shops selling.
Another problem is, is my assumption valid at all of creating a balanced panel? Is it correct to create a balanced from the unbalanced paneldata, or must I use the unbalanced panel to create such a variable?
So my main issue is how to create such independent variables, that measure the dimension of number of shops offering products. To
clarify what I mean, I have included an example of a sample fixed
effect regression that may explain the structure that I attempt to
seek, in Picture 2 below:
NOTE (In picture 2 expected cell mean to the right is the same as Price difference in Picture 1, and is used as dependent variable. They are regressed on number of firms/shops as independent variables, and these I have an issue creating)
Picture 2
What I have tried
I have tried, using dummy variables, on shops, but they ended up getting dropped. The dataset provided in picture 1 is a balanced data set as mentioned, which is needed to run (I assume) a fixed effect regression on a paneldata.
End remark
I stated this question earlier in a much more imprecise manner, where I apologiese for any inconvenience. The problem I think, might be that either I have set it up wrong in excel, hence the dummy's are dropped, or something of that nature. It might also be, that I have to use the unbalanced set in order to create this independent variable, so that might also be a problem, that I am attempting to use a balanced set instead of the unbalanced one.
In your unbalanced sample (as we discussed in the comments, the balanced sample will not make sense) we first need to create a variable for the number of shops offering each ID, let us say we have the same data as in the top portion of your Picture 1
egen number_of_firms = rownonmiss(Shop*)
xtset ID year // to use xtreg, we must tell Stata the data are panel
xtreg Price_difference i.number_of_firms
The xtreg is the regression shown in your Picture 2.
If you want the number of firms variable to be formatted a bit more like Picture 2, you can do something like this:
qui levelsof number_of_firms, local(num)
foreach n in `num' {
local lab_def `lab_def' `n' "`n' Firms"
}
label def num_firms `lab_def'
label values number_of_firms num_firms
label var number_of_firms "Number of Firms"
And then run the regression and the output will be formatted with the number of firms lables.
Right now I have three columns of data that I would need on a graph. It's about the score of different countries on non-related evaluations. So there's a column for the year of the evaluation, the name of the country and the score it got.
Since there are hundreds of them, it would take a lot of time to add data series individually, so I was wondering if isn't there a way to just select the columns and Excel could identify each series automatically.
Illustrating:
Supposing I had this table:
And wanted to create a graph like this:
Is there a way to do this easily?
Plot a PivotChart Line type: Years for ROWS, Country for COLUMNS and Sum of Score for VALUES.
I have this dashboard. In sheet two , the bar chart is just counting the number of marks
in each range specified in the x-axis. What I actually want is a bar chart according to the same range, but it should count the average of marks of each student. In sheet 3, the bar chart looks similar to what I expect, but if you take a look, it's just adding each average of student one above the another.
So, how can I make a char bart with frequency of students average of marks. The ranges should be: [0 , 5>,[5,10>, [10,15>, [15,20].
One solution is to create a custom SQL data connection to first calculated the avg NOTA for each student as below:
select NOMBRES, avg(NOTA) as avg_nota from YOUR_TABLE group by NOMBRES
Then you can create a histogram for avg_nota, either with Show Me or manually.
Here is a link to an example based on your original
The SQL above weighs each score equally, which is fine if each course has exactly the same number of grades. But if the number of records varies between courses, you should adjust the approach to make sure each course is weighted the same (e.g. so that a course with 10 small tests does not get weighted twice as much as a course with 5 larger tests). The solution in that case, might involve repeating the above step in a nested subquery or view grouping by both NOMBRE and CURSO. Still this simple approach should give you the basic idea.
The solution above works but I think there ought to be a way to get the same effect using table calculations without resorting to custom SQL