Correlations/Data Mining in Microsoft Excel 2003 - excel

I have an Excel spreadsheet where each column is a certain variable. At the end of my columns I have a special last column called "Type" which can be A, B, C, or D.
Each row is a data point with different variables that ends up in a certain "Type" bucket (A/B/C/D) recorded in the last column.
I need a way to examine all entries of a certain type (say, "C" or "C"|"D") and find out which of the variable(s) is a good predictor of this last column, and which are better predictors than others.
Some variables are numbers, others are fixed strings (from a set of strings), so it's not just a number/number correlation.
Is Excel 2003 a good tool for that, or are there better statistical programs that make this easier? Do I create a Pivot/Histogram for each category, or is there a better way to run these queries? Thanks

You can make some filtering, especially to clean the data (I mean, to change the data values into one type, string or numeral) using microsoft excel. Execl also makes some data mining. However, for the kind of problems you have, a good tool that I recommend you is WEKA. Using this tool, you can make associative classification prediction (i.e., class association rule mining)of all data instances(rows) and therefore, you can determine which items fall belong to A/B/C/D. Your special attribute will be your class attribute.

Related

how to add columns with 'filled data' after filling missing values in pandas or python using different techniques?

How to add columns with 'filled data' after filling missing values in pandas or python, using different or several techniques like various statistical techniques or machine learning techniques.
What I want to do is that, after filling the data let's say with mean, median or standard deviation values or with other machine learning algos, like KNN or XGBoost or some other technique, then I want to add or append those or that particular column(s) at the end of the csv or excel file but not below the actual data, I mean towards the right-end side of the file.
For instance, I've filled the missing data of a particular column using statistical techniques and other ML techniques then I want to add those 'filled values' along with the original values in a new column having it's actual name with underscore and the technique with which the data is filled for that particular feature and add it at the end of the data to the right side of the data. Example, the column or feature is 'phone' then at the right-end side after filling missing values it must show the whole original or actual values plus the values calculated by statistical means or ML means with column name like "phone_Mean" or "phone_interpolation" or 'phone_KNN' or 'phone_XGBoost' like that.
What I've done so far ?
I've applied the ways from the pandas documentation page and stackoverflow as well, the ones which are generally high enlisted and are in top 7/10 links on google or duckduckgo search engines, but all went in vain.
I'm really facing this issue from last few days due to which I'm crippled at convincing my client. So, it will be great help if you can assist me with some code example using pandas or core python code to support your answer.
Here's the snippet of the dataset. Let's say I'm applying techniques on a feature/column named 'phone':
One of the way is by making use of pandas like:-
df_01["phone_mean"] = df_01["phone"].fillna().mean()

Column to rows and highlight difference between values in the same group

I have a huge table with data structured like this:
And I would like to display them in Spotfire Analyst 7.11 as follows:
Basically I need to display the columns that contain "ANTE" below the others in order to make a comparison. Values that have variations for the same ID must be highlighted.
I also have the fields "START_DATE_ANTE" and "END_DATE_ANTE" which have been omitted in the example image.
Amusingly, if you were limited to just what the title asks, this would be a very simple answer.
If you wanted this in a table where the rows are displayed as usual, and the cells are highlighted, you can do this by going to properties, adding a newGrouping where you select VAL_1 and VAL_1_ANTE and add a Rule, Rule type "Boolean expression", where the value is:
[VAL_1] - [VAL_1_ANTE] <> 0
This will highlight the affected cells, which you can place next to each other. You can even throw in a calculated column showing the difference between the two columns, and slap it on right next to it. This gives you the further option to filter down to only showing rows with discrepancies, or sorting by these values.
However, if you actually need it to display the POSTs on different lines from the ANTEs, as formatted above, things get a little tricky.
My personal preference would be to pivot (split/union/etc) the data before pulling it in to Spotfire, with an indicator flag on "is this different", yes/no. However, I know a lot of Spotfire users either aren't using a database or don't have leeway to perform the SQL themselves.
In fact, if you try to do it in Spotfire using custom expressions alone, it becomes so tricky, I'm not sure how to answer it right off. I'm inclined to think you should be able to do it in a cross table, using Subsets, but I haven't figured out a way to identify which subset you're in while inside the custom expressions.
Other options include generating a table using IronPython, if you're up to that.

Replacing numeric values in Excel sheet with text values from other sheet

I am using Surveymonkey for a questionnaire. Most of my data has a regular scale from 0-6, and additionally an "Other" option that people can use in case they choose to not answer the item. However, when I download the data, Surveymonkey automatically assigns a value of 0 to that not-answer category, and it appears this cant be changed.
This leads to me not knowing when a zero in my numeric dataset actually means zero or just participants choosing to not answer the question. I can only figure that out by looking at another file that includes the labels of participants answers (all answers are provided by the corresponding labels, so this datafile misses all non-labeled answers...).
This leads me to my problem: I have two excel files of same size. I would need to find a way to find certain values in one dataset (text value, scattered randomly over dataset), and replace the corresponding numeric values in the other dataset (at the same position in the dataset) with those values.
I thought it would just be possible to find all values and copy paste in the same pattern, but I cannot seem to find a way to do that. I feel like I am missing an obvious solution, but after searching for quite a while I really could not find an answer to my specific question.
I have never worked with macros or more advanced excel programming before, but have a bit of knowledge about programming in itself. I hope I explained this well, I would be very thankful for any suggestions or scripts that could help me out here!
Thank you!
Alex
I don't know how your Excel file is organised, but if it's like the legacy Condensed format, all you should need to do is to select the column corresponding to a given question (if that's what you have), and search and replace all 0 (match entire cell) with the text you want.

Stata tab over entire dataset

In Stata is there any way to tabulate over the entire data set as opposed to just over one variable/column? This would give you the tabulation over all the columns.
Related - is there a way to find particular values in Stata if one does not know which column they occur in? The output would be which column and row they are located in or at least which column.
Stata does not use row and column terminology except with reference to matrices and vectors. It uses the terminology of observations and variables.
You could stack or reshape the entire dataset into one variable if and only if all variables are numeric or all are string. If that assumption is incorrect, then you would need to convert numeric variables to string, at least temporarily, before you could do that. I guess wildly that you are only interested in blocks of variables that are all either numeric or string.
When you say "tabulate" you may mean the tabulate command. That has limits on the number of rows and/or columns it can show that might bite, but with a small amount of work list could be used for a simple table with many more values.
tabm from tab_chi on SSC may be what you seek.
For searching across several variables, you could automate a loop.
I'd say that if this is a felt need, it is quite probable that you have the wrong data structure for at least some of what you want to do and should reshape. But further details might explode that.

VB: Filtering data on excel table

In python, using libs to work with excel files, I could do what I want.
But now, because I'm trying to learn VBA, I need to ask this question.
I'm working on a worksheet that has around 12 columns, and 50000 rows.
This data represents Requests sent to the company.
The 5# column represents its code, 10# the time took to finish it.
But, for example, rows 5, 10 and 12 could belong to the same Request, and was just divided for organizational purposes.
I need to treat these data, so that I can:
Column 6# represent the person
who answered the request. So, I need
to put each request on the "person's
worksheet". Also, create this
worksheet for him before starting to
add requests to it.
For each person (worksheet),
contabilize request types (Column
2#) attended by him. I.e., create
another table on its worksheet
showing:
Type_Of_Request | Number_of_ocurrences
Create a final Report Worksheet, showing the same table
above, but accounting all requests
(without person filter)
Obs: I know that most questions on stackoverflow are to solve a specific question, but I'm asking for start routes here.
Or even solutions, if possible.
For explanation purposes, I think that explaining the algorithm used in python will help persons who know a little of python and VBA to help me here.
So, for each issue:
Create a dict that manages the 6# column data.
This dict will have the person's unique name as the key, and for each request that him answered, it will be added to a list pointed to his name (the dict key).
Something like:
{person1: [request1, request2, request3, ...], ... }
Another dict that manages the 2# column data (the request type).
Now, I will have a dict where each entry will have a list showing requests that are of that type.
After positioning all requests, I did a simple sum on the list, and filled a table with (key, sum(dict[key]))
where dict[key] is the list of requests of same type, and a sum on it returns the total of requests of that type.
Something like:
{request_type1: [request1, request2, request3, ...], ... }
Well, same of 2, but applying the algorithm on the initial complete table.
I don't know if VB has a dict type like python has (and helps a lot!), even because I'm new on VB.
Thanks, a lot, for any help.
vba does indeed have a dictionary type, but it's usage may not mirror python's implementation. (see: http://msdn.microsoft.com/en-us/library/aa164502%28v=office.10%29.aspx )
you can also create a user defined type ( see: http://msdn.microsoft.com/en-us/library/aa189637%28v=office.10%29.aspx )
If you have a working solution, that is your best jumping-off point. Many of the python string function etc are probably even named the same or close enough for you to easily find them in the language reference.
You may find this easier with ADO, which works quite well with Excel using the Jet/ACE connection. It also will allow to use rs.CopyFromRecordset to write suitable sets to worksheets.

Resources