Can PCA be used for row reduction - python-3.x

I had a doubt about Principle Component analysis. If the variables are along the row:
delhi| kolkata| up| mp| bihar| assam|
popolation 1.2 | 2.2 | 1.3| 1.4| 2 | 1.1 |
crop a | b | c | a| b | c |
avg temp 1 | 2 | 3 | 4| 5 | 6 |
soil ph 1 | 2 | 1 | 3| 2 | 1 |
And one wants to do PCA to obtain most important uncorrelated variables, can one do that. The idea is not to reduce the columns but rows.
If anyone could explain this concept to me it would be very helpful as my understanding is variables exist only along columns and there are many code examples in python for column dimension reduction using pca. But I am not sure if row reduction is the same thing.
Thanks in advance.

Related

Spark replicating rows with values of a column from different dataset

I am trying to replicate rows inside a dataset multiple times with different values for a column in Apache Spark. Lets say I have a dataset as follows
Dataset A
| num | group |
| 1 | 2 |
| 3 | 5 |
Another dataset have different columns
Dataset B
| id |
| 1 |
| 4 |
I would like to replicate the rows from Dataset A with column values of Dataset B. You can say a join without any conditional criteria that needs to be done. So resulting dataset should look like.
| id | num | group |
| 1 | 1 | 2 |
| 1 | 3 | 5 |
| 4 | 1 | 2 |
| 4 | 3 | 5 |
Can anyone suggest how the above can be achieved? As per my understanding, join requires a condition and columns to be matched between 2 datasets.
What you want to do is called CartesianProduct and df1.crossJoin(df2) will achieve it. But be careful with it because it is a very heavy operation.

Apply function to an array matching a criterium

perhaps I'm just unable to formulate the question, but I was unable to find any matches for this, however is there a way you can return an array of all the matching cells matching criteria?
Let's say the following example
1 2
|---------------------|------------------|
1| A | B |
|---------------------|------------------|
2| 1 | 2 |
|---------------------|------------------|
3| 1 | 3 |
|---------------------|------------------|
4| 1 | 12 |
|---------------------|------------------|
5| 2 | 8 |
|---------------------|------------------|
Now in C2, I need to find a way to find a MAX value, out of entire B column, for all the cells that have value 1 in column A.
Now this would be a relatively simple array filter in vba, however I'm trying to achieve this by somehow using only excel formulas.
AFAIK, all the methods, like =INDEX() or =VLOOKUP() can only find a single closest (exact) match. Is there however to return an array of all the matching results?
I'd presume it would go something like
=INDEX($A$2:$B$5; MATCH($A$2; $A$2:$A$5; 0); 1)
However once again issue here being, this would stop on the first occurance, rater than go through the entire array.
Probably only thing I can think of is to exhaustively go over each and every number, return in a separate value every occurance (in a matrix) and then add the number, but that seems like way too much of a hassle
Expected result:
1 2
|---------------------|------------------|------------------|
1| A | B | C |
|---------------------|------------------|------------------|
2| 1 | 2 | 12 |
|---------------------|------------------|------------------|
3| 1 | 3 | 12 |
|---------------------|------------------|------------------|
4| 1 | 12 | 12 |
|---------------------|------------------|------------------|
5| 2 | 8 | 8 |
|---------------------|------------------|------------------|
SUMPRODUCT + MAX works for older excel versions too:
=SUMPRODUCT(MAX(($A$1:$A$4=A1)*$B$1:$B$4))
Tested this:
=MAXIFS(B:B,A:A,A1)
Returns your desired result.

Dynamic filtering of multiple columns and different conditions with List.Generate()

I need to filter a table. The challenge for me is that the filter information (column names, number of columns, as well as filter values) can change.
After doing some research I think List.Generate() could help me here. The idea is to create a loop that in each loop pass applies one filter condition that is dynamically passed to the loop.
Unfortunately I don't understand List.Generate() well enough to build this myself. Hence any help would be greatly appreciated!
Here is my setup:
I have one table with data (DATASTART)
+---+---+---+
| A | B | C |
+---+---+---+
| 1 | 1 | 2 |
| 1 | 2 | 2 |
| 1 | 3 | 2 |
| 2 | 4 | 3 |
| 2 | 5 | 3 |
| 2 | 6 | 3 |
+---+---+---+
and one table (FILTER) with information which columns of DATASTART should be filtered and the corresponding filter values.
+--------+--------+
| Column | Filter |
+--------+--------+
| A | 1 |
| B | 2 |
+--------+--------+
With static Power Query code
= Table.SelectRows(DATASTART, each ([A] = 1) and ([B] = 2))
the result would be this table (DATARESULT).
+---+---+---+
| A | B | C |
+---+---+---+
| 1 | 2 | 2 |
+---+---+---+
How about this?
let
condition = (record as record) as logical =>
List.AllTrue(
List.Transform(
Table.ToRecords(FILTER),
each Record.Field(record, [Column]) = [Filter]
)
)
in
Table.SelectRows(DATASTART, condition)

Sum named range consists of several columns and rows

I have a list that is divided into countries vertical and years horizontal like below.
I need to sum all numbers for 2020 respective for each country. Each country have several lines divided into different months.
2020 2021
J | F | M | A | M |...| J | F | M | A | M |...
-------------------------------------------------------
Denmark | | | 15| | 12| | | | | | |
Norway | | | | | | | | | 10| | |
Germany | | | | 11| | | | | | | |
Each year have been called a named range, e.g. Year2020.
I have tried using =SUMPRODUCT(SUMIFS(Year2020;CountryRNG;Country)), MATCH/INDEX and also =SUM(INDEX(Year2020;0;MATCH(1E+99;INDEX(Year2020;1;0)))).
How can I do this with one formula?
You can use SUMPRODUCT:
=SUMPRODUCT((Country=CountryRNG)*Year2020)
With a few notes:
CountryRNG and Year2020 have the same number of rows
Year2020 is only the data. No Text or Errors in the data field
Both ranges are limited to the data and does not include full column references. This is to limit the number of iterations that will slow down the calcs. It will work with extra rows, but the more unneeded iteration will cause extra work.

How to divide two cells based on match?

In a table 1, I have,
+---+---+----+
| | A | B |
+---+---+----+
| 1 | A | 30 |
| 2 | B | 20 |
| 3 | C | 15 |
+---+---+----+
On table 2, I have
+---+---+---+----+
| | A | B | C |
+---+---+---+----+
| 1 | A | 2 | 15 |
| 2 | A | 5 | 6 |
| 3 | B | 4 | 5 |
+---+---+---+----+
I want the number in second column to divide the number in table 1, based on match, and the result in third column.
The number present in the bracket is the result needed. What is the formula that I must apply in third column in table 2?
Please help me on this.
Thanks in advance
You can use a vlookup() formula to go get the dividend. (assuming table 1 is on Sheet1 and table 2 in Sheet2 where we are doing this formula):
=VLOOKUP(A1,Sheet1!A:B, 2, FALSE)/Sheet2!B1
Since you mention table, with structured references, though it seems you are not applying those here:
=VLOOKUP([#Column1],Table1[#All],2,0)/[#Column2]

Resources