Decision Tree status column & related numerical value column - scikit-learn

I have a data including two columns where one is categorically shows the status of the feature & the other one numerically shows the related value. Just like below:
I want to run a decision tree algorithm via scikit learn on this data. I am not sure how to deal with these two columns because conceptually I cannot figure out how to bond these tho very correlated features. Basically, we are not supposed to leave null data, however, this one is supposed to be null in numerical column by nature. If we make it "0", it has another meaning.
So, how should I pre-process this data to have the decision tree algorithm work properly?

My prefossor provides a reasonable answer as below.
First, fill the null cells with "0".
If you plug the data into decision tree algorithms with these two features, we have two cases:
If "Status" comes first:
The tree will split 0's and 1's into two branches. Under 0, all Amount values will be already 0, hence this feature will not be chosen. Under 1, there will not be any 0 Status.
If "Amount" comes first: All Status 0's will go under only one branch and they will get together with the ones that are very small amounts.
So, If the Amount data is noisy, it might be helpful to keep the Status column. Otherwise, I would remove the Status column.

Related

How to split a Pandas dataframe into multiple csvs according to when the value of a column changes

So, I have a dataframe with 3D point cloud data (X,Y,Z,Color):
dataframe sample
Basically, I need to group the data according to the color column (which takes values of 0,0.5 and 1). However, I don't need an overall grouping (this is easy). I need it to create new dataframes every time the value changes. That is, I'd like a new dataframe for every set of rows that are followed by and preceded by 5 zeros (because single zeros are sometimes erroneously present in chunks of data that I'm interested in).
Basically, the zero values (black) are meaningless for me; I'm only interested in the 0.5 (red) and 1 values (green). What I want to accomplish is to segment the original point cloud into smaller clusters that I can then visualize. I hope this is clear. I can't seem to find answers to my question anywhere.
First of all, you should understand the for loop well. Python is a great programming language for using the code of any library inside functions and loops. Let's say you have a dataset and you want to navigate and control column a. First, let's start the loop with the "for i in dataset:" code. When you move to the bottom line, you have now specified the criteria you want with the code if "i[a] > 0.5:" in each for loop. Now if the value is greater than 0.5, you can write the necessary codes to create a new dataset with all the data of the row you are in. In terms of personal training, I did not write ready-made code.

Deal with Ties when Using Index/Match

I'm currently pulling the top (5) number of numerical values from one sheet and inputting them into a different sheet. Each number is within its own column and there is a name matching that column, EX:
And so, having a tie is common with the data that I'm working with, so it nearly deprecates my formulas.
For getting the name:
=INDEX('Total Cases by Categories'!$B$18:$B$50, MATCH(LARGE('Total Cases by Categories'!$H$18:$H$50, A39),'Total Cases by Categories'!$H$18:$H$50, 0))
For getting the numerical value associated with the name:
=LARGE('Total Cases by Categories'!$H$18:$H, A39)
And so, when there are 2 people with the same numerical value associated within a category, then that person appears twice, I assume because of their position within the sheet.
So something like this happens:
So in the event of a tie, I would want to list both names that have the same amount of points instead of the first name that shows up with the duplicated value.
Any help would be appreciated!
Actually, LARGE will give you both of tied names. It's MATCH that can't look beyond the first. To the best of my knowledge there is no way around that (the difficult one being not to use MATCH). Therefore the solution is to have no ties.
This is achieved with helper columns that contain no identical numbers. This can be achieved by adding an insignificant decimal. Since you are dealing with integers, adding 0.1 would be insignificant for your purposes but 13.1 is different from 13.2. If you need to extract the "real" number from this use INT(13.2).
Using the row number to generate an insignificant decimal is popular for this purpose. In row 1 ROW()/10 will return 0.1. But in row 10 ROW()/10 will return 1.0 which isn't an insignificant number anymore. Therefore you have to work with ROW()/100 or an even larger divisor, depending upon how many rows you have. Try ROW()/10^6 - any decimal will do the tie-breaking job.
You may not like that using ROW() will list tied participants in the order in which they appear in the worksheet. The differentiating decimals can be created by any other means that doesn't create ties in itself.
Normally, the helper columns with the decimals added will be hidden. They contain a formula like =D23 + (ROW()/10000) which manages itself. You can then use that column for the MATCH function to list all participants in the order of LARGE using the helper column or the original. Just make sure that MATCH refers to the helper column.

Spotfire DenseRank by category, do I use OVER?

I'm trying to rank some data in spotfire, and I'm having a bit of trouble writing a formula to calculate it. Here's a breakdown of what I am working with.
Group: the test group
SNP: what SNP I am looking at
Count: how many counts I get for the specific SNP
What I'd like to do is rank the average # of counts that are present for each SNP, within the group. Thus, I could then see, within a group, which SNP ranks #1, #2, etc.
Thanks!
TL;DR Disclaimer: You can do this, though if you are changing your cross table frequently, it may become a giant hassle. Make sure to double-check that logic is what you'd expect after any modification. Proceed with caution.
The basis of the Custom Expression you seem to be looking for is as follows:
Max(DenseRank(Count() OVER (Intersect([Group],[SNP])),"desc",[Group]))
This gives the total count of rows instead of the average; I was uncertain if "Count" was supposed to be a column or not. If you really do want to turn it into an average, make sure to adjust accordingly.
If all you have is the Group and the SNP nested on the left, you're done and good to go.
First issue, when you want to filter it down, it gives you the dense rank of only those in the filtered set. In some cases this is good, and what you're looking for; in others, it isn't. If you want it to hold fast to its value, regardless of filtering, you can use the same logic, but throw it in a Calculated column, instead of in the custom expression. Then, in your CrossTable Aggregation, get the max of the Calculated Column value.
Calculated Column:
DenseRank(Count() OVER (Intersect([Group],[SNP])),"desc",[Group])
Second Issue: You want to pivot by something other than Group and SNP. Perhaps, for example, by date? If you throw the Date across the top, it's going to show the same numbers for every month -- the overall numbers. This is not particularly helpful.
To a certain extent, Spotfire's Custom Expressions can handle this modification. If you switch between using a single column, you could use the following:
Max(DenseRank(Count() OVER (Intersect([${Axis.Columns.ShortDisplayName}],[Group],[SNP])),"desc",[Group],[${Axis.Columns.ShortDisplayName}]))
That would automatically pull in the column from the top, and show you the ranking for each individual process date.
However, if you start nesting, using hierarchies, renaming your columns, or having multiple aggregations and throwing (Column Names) across the top, you're going to start having to pay a great deal to your custom expression. You'll need to do some form of string replacement around the Axis.Column, or use expression instead of Short Names, and get rid of Nests, etc.
Any layer of complexity will require this sort of analysis, so if your end-users have access to modify the pivot table... honestly, I probably wouldn't give them this column.
Third Issue: I don't know if this is an issue, exactly, but you said "Average Counts" -- Average per day? Per Month? When averaging, you will need to decide if, for example, a month is the total number of days in month or the number of days that particular payor had data. However you decide to aggregate it, make sure you're doing it on the right level.
For the record, I liked the premise of this question; it's something I'd thought would be useful before, but never took the time to try to implement, since sorting a column or limiting a table to only show the top 10 values is much simpler

Generating a unique identifier list with multiple tables & criteria

I'm not a coder, just someone who uses excel for basic estimating functions at work. However I've found myself in need of a complex list or index system.
Background/Intent: (Skip below if doesn't matter.) In an apartment building construction they build buildings like opened books - mirror images of 2 bed 2 bath apartments, for example. There is a standard "typical" unit and then the mirror image across the hall, the "reverse" unit. The door swings are all opposite from one to the other. My job is to figure out how to give each door a unique identifier code based on: Bldg No., Unit Type, Door No., Door Swing (left or right.) The raw data tables are provided below.
I've attempted to clean this up as much as possible, but there are two steps (I think) to this process.
Step 1:
The raw data table is on the left. My output field is on the right. I want to be able to select a drag down box, like data validation list, and select the building. Then a formula (which one?) spits out a list of every unit type per building. For example, Bldg 5 has 2 each of "A1 Typ." How do I get the formula to recognize that if there are 2 of them, to produce 2 separate lines for "A1 Typ." And so on and so forth until all 41 occurences/units have been accounted for and labeled appropriately. Some occur once, some multiple times, and some zero.
Step 1
From there, Step 2.
I want to use this output field again to automate another sequence, this time pulling from a different table, see picture. Now, depending on the unit type under the "type" column, I want it to expand each unit type showing each indivudual door number (1,2,3 etc through 12) and if it's an L (LH) or R (RH), and if there is more than one, to list out each occurence. (what formula?)
Then the decriptor text that will pop up under "DOOR LABEL" column would just be a joining of several fields to give a unique identifier. (suggestions?)
Step 2.
Easy right? Is this too much for excel, or can this be done?
Thanks so much for considering helping me out!

Matrix functions to compute a correlation matrix in conjunction with an IF statement

I am trying to calculate several correlation matrices using matrix functions in Excel. I have no difficulty with a straightforward problem but when I want to compute three matrices based on three unique values of a variable I am not able to get the IF statement to work properly.
Specifically I have three scenarios ("risk loving", "normal", "risk averse") coded in say B2:B253. My return data is in C2:C253. My goal is to create three correlation matrices depending on the values in column B. My code is:
=MMULT(IF(B2:B253="RISK LOVING",TRANSPOSE($C$2:$L$253-$O$3:$X$3),$C$2:$L$253-$O$3:$X$3)/$P$1/MMULT(TRANSPOSE($O$4:$X$4),$O$4:$X$4),0). Any suggestions?
The left hand side of the condition is a range (B2:B253). The right hand side is a value. I believe that the first comparisson results in a true if and only if it is true for B2. Is this really what you want to do? Or do you like to have a column of IF statements.

Resources