How can I replace the NAs of every coumn of a big dataset with the median of the non-missing entries of the column? - median

I have a big dataset (1121 rows x 532 column).
Every column represent a single item of a self-report questionnaire.
I have several NAs (n= 3015).
I would like to replace every NAs with the median of every single column.
How can i do that?
I tried to clean the dataset from missing values with the na.omit function but R puts away the whole row where NAs was found.
This is a problem for me because after this operation I have a dataset with only 641 rows. Every column represent the name of the scale and the item number (i.e. IUI23 ... IUI is the name of the scale and 23 the number of the item).
I need to find the median of each column whilst somehow not selecting the title of the column and then replace every NAs with the median of each column.

I solved my problem and I would like to share my answer with you all.
I used the package "randomForest".
I used the function in the package called na.roughfix (object, ... ) which is imputing all missing values by median/mode. It returns a completed data matrix or data frame. For numeric variables,NAs are replaced with column medians. For factor variables,NAs are replaced with the most frequent levels (breaking ties at random).If object contains no NAs, it is returned unaltered.
My data name was IUI_data. I simply typed :
IUI_data.roughfix<- na.roughfix(IUI_data).
It worked perfectly!
For further informations on the package "randomForest" you can have a look here: cran.r-project.org/web/packages/randomForest/randomForest.pdf.

Related

How to split a Pandas dataframe into multiple csvs according to when the value of a column changes

So, I have a dataframe with 3D point cloud data (X,Y,Z,Color):
dataframe sample
Basically, I need to group the data according to the color column (which takes values of 0,0.5 and 1). However, I don't need an overall grouping (this is easy). I need it to create new dataframes every time the value changes. That is, I'd like a new dataframe for every set of rows that are followed by and preceded by 5 zeros (because single zeros are sometimes erroneously present in chunks of data that I'm interested in).
Basically, the zero values (black) are meaningless for me; I'm only interested in the 0.5 (red) and 1 values (green). What I want to accomplish is to segment the original point cloud into smaller clusters that I can then visualize. I hope this is clear. I can't seem to find answers to my question anywhere.
First of all, you should understand the for loop well. Python is a great programming language for using the code of any library inside functions and loops. Let's say you have a dataset and you want to navigate and control column a. First, let's start the loop with the "for i in dataset:" code. When you move to the bottom line, you have now specified the criteria you want with the code if "i[a] > 0.5:" in each for loop. Now if the value is greater than 0.5, you can write the necessary codes to create a new dataset with all the data of the row you are in. In terms of personal training, I did not write ready-made code.

Excel multilevel array formula with partial string matches to sum resultant cells

I've been trying to sort this for over a day now without much luck. I have successfully used SUMIFS, INDEX, MATCH, COUNTIF, "--" etc array functions previously and am not a novice, but also not an expert on these. I can't seem to weave these together correctly, and likely on an altogether incorrect path.
Basically, I am trying to aggregate data from multiple spreadsheets, requiring a mapping of various items (rows) into a canonical form for summing.
The image here shows a representative, but simplified version of my quest. Each "region" on this example spreadsheet (Final..., Mapping, DataSet1, DataSet2) is actually in different spreadsheets, and there are several sheets with 50-150 rows in each xlsx.
Note that the names in Column B are quite arbitrary (meaning not all P1's have an 'x' pattern, like shown here as x1, x2, etc. Do not rely on any pattern in the names, except the x, y , z in the Mapping table are substrings (case insensitive, trailing match) of the names in Column B in the DataSets.
And in the image, the Final Result Table (summed manually) is what I want to compute via(an array) formula: A single formula would be ideal (given I have many spreadsheets from which the monthly data is being pulled from, so I can't readily modify but can create an interim spreadsheet if required, so open to helper columns or helper rows).
Here's the process - For each name (B3-B5) in the Final Result Table, I want to sum the name from it's components as follows:
Lookup all the matches in the Mapping Table (so for P1, the formula =IF($C$10:$C$15=$B3, $B$10:$B$15,"") gives {"x1";"";"";"x2";"";"x3"}.
I then want to search each of x1, x2, and x3 in B19:B26 to get rows 21, 22, 24, 25, 26 in DataSet1 and B31:B35 to get row 32 in DataSet2, to then add up the Jan totals into C3. (Effectively,
C3=C21+C22+C24+C25+C26+C32). Same for P2 and P3, and thru Feb, Mar, ...
I am stuck on how to remove blank or 0 or Div0 or such "error rows" from the interim result in 2, and also need to use 2 arrays of different sizes (3 valid rows in example 2 above, ignoring blanks) to search many rows in DataSets. I tried SEARCH("*"&IF($C$10:$C$15=$B3, $B$10:$B$15,""), $B$19:$B$26) but get unexpected results. I have tried to replace text in the interim result {"x1";"";"";"x2";"";"x3"} with TRUE/FALSE, and 1/0, etc. to help with INDEX or MATCH, but am stymied by errors in downstream ("surrounding") formulas.
Thanks in advance.
Here is a solution without resorting to nasty (imo) CSE formulas.
= SUMPRODUCT($C$19:$F$26*(COUNTIFS($B$10:$B$15, RIGHT($B$19:$B$26,2),$C$10:$C$15,$B3)>0)*($C$18:$F$18=C$2))
+
SUMPRODUCT($C$31:$F$35*(COUNTIFS($B$10:$B$15, RIGHT($B$31:$B$35,2),$C$10:$C$15,$B3)>0)*($C$30:$F$30=C$2))
There is one SUMPRODUCT for each data set. If possible, it would be better to put all your data sets into a single table with a column identify which data set it is a part of.
The way it works is to takes each values in your data set and multiplies it by whether the 2 right most character appear in your mapping table for that P code, multiplied by whether the value is in the correct month. So it returns 0 if either of those conditions are false. Then returns the sum.
UPDATE IN RESPONSE TO OP COMMENTS
If, the X,Y, Z codes are not always 2 digits but the first part is ALWAYS 8 digits, you can easily amend the:
RIGHT($B$19:$B$26,2)
to be:
RIGHT($B$19:$B$26,LEN($B$19:$B$26)-8)
Making the formula for the first data set:
=SUMPRODUCT($C$19:$F$26*(COUNTIFS($B$10:$B$15, RIGHT($B$19:$B$26,LEN($B$19:$B$26)-8),$C$10:$C$15,$B3)>0)*($C$18:$F$18=C$2))
And you can amend for other data sets and simply add them together.
Nice challenge! Are you willing to drop all your tables (DataSet1, DataSet2...) into one spreadsheet, so that we can refer just one single range for each month?
Here's one solution (hopefully a good starting point) - array formula (Ctrl+Shift+Enter):
=SUMPRODUCT(IFERROR(IF(TRANSPOSE(IF($B3=$C$10:$C$15,$B$10:$B$15,""))=RIGHT($B$18:$B$36,2),C$18:C$36,0),0))

How to find the index of remaining columns if the data is repetitive

I have a data entry like thisData entries
Now, i need to find the smallest 10 values and also get the corresponding person and area and date along with it.
I used SMALL functoin to find the least 10 values. Then I used the index and match functions for getting their corresponding row entries. The problem is since some data entries are being repetitive, these functions are giving the row of the first 2 for all the remaining 2s. How to solve this
In F2 use Rank like this, so you have unique numbers:
=RANK(C2,$C$2:$C$21,1)+ROW()/1000
in G2 use Small, to pull the smallest of the ranked numbers and copy down 10 rows.
=SMALL($F$2:$F$21,ROW(A1))
Now you can pull person, date, real hours and area with an index match in H2, copied across and down.
=INDEX(A$2:A$21,MATCH($G2,$F$2:$F$21,0))

Attempting to sort a list that is generated from a unique point extraction of an array

The issue is sorting an array that is generated automatically from an data source using a formula that extracts unique data points. (Data points are date/time)
The data is being extracted with this fomula.
=INDEX(Table_ExternalData_1[SampleDateTime],MATCH(0,INDEX(COUNTIF($G$2:G2,Table_ExternalData_1[SampleDateTime]),0,0),0))
Once extracted, the data is not sorted right away. The current data is extracted from a database via an SQL string that pulls in data corresponding to the data and time that the data point was created.
Because of this, the extracted points are not in the correct order. I am attempting to sort the extracted data points from earliest to latest to continue with the data sorting, but need the date/times to be sorted in a separate row.
I have attempted to use a pivot table, but it isn't exactly what I need and ends up being a messier end product than I need.
All assistance is appreciated.
Example is below.
1
2
3
5
1
2
3
4
6
5
3
I need this.
1
2
3
4
5
6
I did end up finding a solution that I will be able to modify. Using a single row of a pivot table, I took just the date/time column and had the PivotTable function sort the data to be utilized as necessary.
Thank you.
The fact that the range in the example you give:
1) Consists of entries of a numeric datatype only
2) Does not contain any blanks
means that the solution is relatively simple.
Assuming that data is in A1:A11, first use a single cell somewhere within the worksheet to count the number of expected returns. For example, using B1 for this purpose, enter this formula in that cell:
=SUM(IF(FREQUENCY(A1:A11,A1:A11),1))
Your main formula is then:
=IF(ROWS($1:1)>B$1,"",SMALL(IF(FREQUENCY(A$1:A$11,A$1:A$11),A$1:A$11),ROWS($1:1)))
the latter being copied down until you start to get blanks for the results.
Regards

Sum multiple values in Index/Match function

I am working on a distribution problem, analysing the volumes delivered to a set of stores (75 stores).
I have an Excel file as follows:
As you can see, each day does not contain the same stores, given that each store does not receive a delivery every day.
I want to get a new table that has the code of the store in the columns, and the information about volume and miles in the rows. Furthermore, I want to sum the values of the volumes given that they belong to the same store. In this example this would look like this:
As you can imagine, my spreadsheet is way bigger, having a total of 6500 rows and 800 columns. I was thinking about using the function combination of INDEX/MATCH, but I cannot see how to make it sum the multiple values for a given store in a given date.
While you need to extend this formula, you could use:
=SUMIF(INDEX($C:$F,MATCH($J2,$A:$A,0),),L$1,INDEX($C:$F,MATCH($J2,$A:$A,0)+MOD(ROW(),2)+1,))
if the table is build up like this:
From L2 you can simply drag down and to the left as needed ;)
EDIT
To also get the stores:
L1: {=MIN(IF(MOD(ROW($C$1:$F$6),3)=1,$C$1:$F$6))}
This is an array formula and must be entered without the {} but being confirmed with ctrl+shift+enter!
M1: =SUMPRODUCT(SMALL(IF(MOD(ROW($C$1:$F$6),3)=1,$C$1:$F$6),SUM((IF(MOD(ROW($C$1:$F$6),3)=1,$C$1:$F$6)<=L$1)*1)+1))
from M1 you can simply copy to the right.
And to get the dates (if non continuous or something like that)
J2: =MIN(A:A)
J3: =J2
J4: =SMALL(A:A,COUNTIF(A:A,"<="&J3)+1)
J5: =J4
then copy J4:J5 simply down :)
Dont put the stores in the columns, use VBA or similar to read the input files and normalize the data so that the output would be a table looking like
Store - Date - Volume - Miles
101 10/06/2016 520 120
102 11/06/2016 500 100
Then you can always lookup a store and date or pivot the data later.

Resources