Transform data in Excel (rows and columns) - excel

Current input:
+-------------------------+
| col A |
+-------------------------+
| Test(read) |
| - ABC(dsakl) |
| - ABC(dslø) Write(T) |
| - ABC(dsjkl) |
| Test(true) |
| - ABC(DSdsa) |
| - ABC(dsalk**) write(F) |
+-------------------------+
Wanted output:
+-------------------------+-------------------------+
| col A | col B |
+-------------------------+-------------------------+
| Test(read) | - ABC(dsakl) |
| Test(read) | - ABC(dslø) Write(T) |
| Test(read) | - ABC(dsjkl) |
| Test(true) | - ABC(DSdsa) |
| Test(true) | - ABC(dsalk**) write(F) |
+-------------------------+-------------------------+
Hence, I want to transform my data so everytime "Test" occurs it creates a new row in column "A" corresponding to the number of rows until the next "Test" occurs. The cells containing "ABC" should be written in column "B" in the output file.
Really appreciate input on this one. Thanks!

There may be other approaches to this, but if you want to do it purely with regular formulas you could try this:
Do col B with Aggregate to find the first row with a "-", then the second etc.
=IFERROR(INDEX(A$1:A$10,AGGREGATE(15,6,ROW(A$1:A$10)/(LEFT(A$1:A$10)="-"),ROW())),"")
Col A is more difficult - this finds the nth row starting with "-" as before, then looks before this to find the last row starting with T:
=IFERROR(INDEX(A$1:A$10,AGGREGATE(14,6,ROW($A$1:INDEX(A$1:A$10,AGGREGATE(15,6,ROW(A$1:A$10)/(LEFT(A$1:A$10)="-"),ROW())))
/(LEFT($A$1:INDEX(A$1:A$10,AGGREGATE(15,6,ROW(A$1:A$10)/(LEFT(A$1:A$10)="-"),ROW())))="T"),1)),"")

Related

Count matches of entire record in two separate tables - excel

Ive been trying to find a formula which would count me the match between two tables (like inner join) in excel.
I have a table1 with columns(ID,UserName,Function) and table2 (UserName,Function, etc...) need to count an explicit matches of table1(UserName&Function) and table2(UserName&Function)
tried sumproduct(--(table1[UserName:Function]=table2[UserName:Function]) but it seems like it compares it column by column and returns incorrect value, i tried to concatenate those columns within sumproduct, but still doesnt work.
Is it possible to make it in one formula or shall i build udf with sql query?
Would it be possible to return the records and list it as an array by using FILTERXML formula?
sample data:
table1:
| ID | UserName | Function |
| -- | -------- | ----------|
| 1 | oopz | FCA4001 |
| 2 | oopz | FCA4002 |
| 3 | arronT | FCA4001 |
table2:
| UserName | Function |
| -------- | ----------|
| randalO | FCA4001 |
| oopz | FCA4001 |
| arronT | FCA4005 |
Thanks in advance!:)

Filter filter criteria and then apply in countif statement in Excel

I have a table of filter criteria like this:
+----------+----------+------+------+------+
| Category | SpecName | Spec | Pass | Fail |
+----------+----------+------+------+------+
| A | S1 | 3 | | |
| A | S2 | 4 | | |
| B | S1 | 5 | | |
| C | S1 | 2 | | |
+----------+----------+------+------+------+
I have a table I want to apply the filter criteria to like this:
+----------+----+----+
| Category | S1 | S2 |
+----------+----+----+
| A | 5 | 3 |
| B | 4 | |
| A | 5 | 5 |
| C | 2 | |
| A | 2 | 6 |
+----------+----+----+
I want to fill the Pass and Fail columns in the filter criteria table with a count of items in second table with values >= the corresponding spec, like so.
+----------+----------+------+------+------+
| Category | SpecName | Spec | Pass | Fail |
+----------+----------+------+------+------+
| A | S1 | 3 | 1 | 2 |
| A | S2 | 4 | 1 | 2 |
| B | S1 | 5 | 0 | 1 |
| C | S1 | 2 | 1 | 0 |
+----------+----------+------+------+------+
Here are steps for how I might do it in a scripting language:
Filter first table to get all spec filter criteria for the Category on that row, as follows for the first row.
+----------+----------+------+
| Category | SpecName | Spec |
+----------+----------+------+
| A | S1 | 3 |
| A | S2 | 4 |
+----------+----------+------+
Copy table 2 to a variable iTable
+----------+----+----+
| Category | S1 | S2 |
+----------+----+----+
| A | 5 | 3 |
| B | 4 | |
| A | 5 | 5 |
| C | 2 | |
| A | 2 | 6 |
+----------+----+----+
For each spec name:
Find column in iTable with spec name
Filter spec name column in iTable by spec
After all filters applied, we would have:
+----------+----+----+
| Category | S1 | S2 |
+----------+----+----+
| A | 5 | 5 |
+----------+----+----+
Then just count the rows in iTable and assign to the cell in Pass column of the criteria table
Is this possible with Excel formulas?
If not, does anyone know how to do it with VBA?
Looking at an alternative layout for you spec criteria. Expand you columns to suit your need.
With each spec criteria being its own column life gets really easy. You just need to adjust your formula to match the number of criteria you have.
Based on the table at the end for layout, place the following formula in D3 and copy down as required.
=SUMPRODUCT(($G$2:$G$6=A3)*($H$2:$H$6>=B3)*($I$2:$I$6>=C3))
That will give you a count of passing all criteria. Its also a function that performs array like calcs. It could be repeated in the next column but in order to reduce dependency on array calculation and potentially speed things up depending on the amount of data to check, place the following in the top of the fail column and copy down as required:
=COUNTIF($G$2:$G$6,A3)-D3
Basically it subtracts the passes from the total count. This assumes you can only have PASS and FAIL as options.

Return unique column headers matching criteria

Consider the following data below:
| 1st | 2nd | A | B | C | D | E | F | G | H |
|-----|-----|---|---|---|---|---|---|---|---|
| y | x | | | 1 | | | | | |
| y | x | | | 1 | | | | | |
| y | x | | | | 1 | | | | |
| | x | 1 | | | | | | | |
| y | | 1 | 1 | 1 | | | | | |
| y | x | | | | | | 1 | | |
| y | | | | | | | | 1 | |
| | x | | | | | 1 | | | |
| | x | | | | | | | | 1 |
| y | x | | | | | | | | 1 |
What I wish to do is to return all column headers (from A to H) that meets the following condition: it should have a value of 1 that is both aligned with a y and x value from the first two columns.
I already have a working array formula to do this, which is as follows:
{=INDEX($C$1:$J$1,SMALL(IF(($A$2:$A$11="y")*($B$2:$B$11="x")*($C$2:$J$11=1),COLUMN($C$1:$J$1)-COLUMN($B$1)),ROW(1:1)))}
However, while I drag this down, it returns two C values and one for D, F and H.
This is since there are two 1's under header C that meets the said condition. What I want is to return unique values, so C should only be returned once. I tried to make use of MATCH and additional COUNTIF instead of the SMALL function, but it is returning an error, and the 'Evaluate formula' feature of Excel isn't helping. Below if the erroneous formula I experimented with:
{=INDEX($C$1:$J$1,MATCH(0,IF(($A$2:$A$11="y")*($B$2:$B$11="x")*($C$2:$J$11=1),COUNTIF($N$1:N1,COLUMN($C$1:$J$1)-COLUMN($B$1))),0))}
A workaround I am currently doing is to make my first formula a "helper column" and then create another formula based from the first formula's result to return only the unique values. However, the double array formula is heavily affecting the efficiency of Excel's calculation due to the huge volume of data I'm dealing with.
Any help/suggestions will do please (no VBA please, since I believe it's not needed here). Thanks!
Insert a helper row. I did it just under your header row before your data. In this row you check to see if there is a 1 that lines up with an x and a y. I assumed this to be non blank, but if its specific values change the formula from <>"" to ="y" or =134 as the case may be. Place the following formula under your first column header you are interested in and copy right.
=--(0<SUMPRODUCT(($B$3:$B$12<>"")*($C$3:$C$12<>"")*(D3:D12=1)))
Then where you want to generate your list in a column without space and sorted in the order the appear in from left to right in the headings, use the following formula and copy down as required:
=IFERROR(INDEX($1:$1,AGGREGATE(15,6,COLUMN($D$2:$K$2)/$D$2:$K$2,ROW(A1))),"")
The above formula put in a blank value when no column heading applies are you have copied the formula down beyond the number of applicable columns.
The above formulas are based on the proof of concept image below. Adjust ranges to suit your needs.
Have you tried without the use of an array formula? I don't know how large the data actually is. But, this might be what you are looking for:
=IF(COUNTIFS($A:$A,"y",$B:$B,"x",C:C,1)>0,C1,"")
Assuming column A is "1st" and "H" is your last column at colunm J. Try pasting the formula at "K1" and drag it to your right until "S1".

Index/Match with Varied Offset

Running into some trouble performing an Index/Match where the offset rows could be spaced 1 row apart, 2 rows apart, or 3 rows apart. Below is an example of the setup:
Sheet1:
| A | B | C | D | E | F |
-------------------------------------------------
| | | | | Apple | |
-------------------------------------------------
| Ser1 | | | | | |
-------------------------------------------------
| | | | | Orange| |
-------------------------------------------------
| Ser2 | | Ser3 | | Ser4 | |
-------------------------------------------------
| Ser5 | | | | | |
Sheet2:
| A |
---------
| Ser1 |
---------
| Ser2 |
---------
| Ser3 |
---------
| Ser4 |
---------
| Ser5 |
I have a list of the serial numbers (ser1, ser2, etc) in another sheet, and I need to match values in Column E where the correct value is above the serial number by 1, 2, or 3 rows. As you can see, serial numbers could be in column A, C, or E.
Ex: Ser1 should match on Apple. Ser2, Ser3, ser4, and Ser5 should match on Orange.
I can't seem to figure out the correct Index/Match that work completely since the offset at the end of the formula is either + or - by a static row number.
Rough solution:
In your sheet Make a structure like this:
| A | Row | Lookup
---------
| Ser1 |
---------
| Ser2 |
---------
| Ser3 |
---------
| Ser4 |
---------
| Ser5 |
For column Row put:
=SUMPRODUCT((Sheet1!A$1$:E$5$=A2)*ROW(Sheet1!A$1$:E$5$))
this will calculate row in which serial number occures in your data range. Then you may put in Lookup column:
=LOOKUP(2,(1/(INDEX(Sheet1!$E$1:$E$5,1,1):INDEX(Sheet1!$E$1:$E$5,B2-1,1)<>"")),Sheet1!$E$1:$E$5)
This formula will lookup for last non-empty cell in column E (please refer here for more comments) in range that is above selected serial number.
This is a partial solution as for "Ser5" you will receive "Ser4". To overcome that issue you may perform additional Vlookup on achieved results.

PySpark getting distinct values over a wide range of columns

I have data with a large number of custom columns, the content of which I poorly understand. The columns are named evar1 to evar250. What I'd like to get is a single table with all distinct values, and a count how often these occur and the name of the column.
------------------------------------------------
| columnname | value | count |
|------------|-----------------------|---------|
| evar1 | en-GB | 7654321 |
| evar1 | en-US | 1234567 |
| evar2 | www.myclient.com | 123 |
| evar2 | app.myclient.com | 456 |
| ...
The best way I can think of doing this feels terrible, as I believe I have to read this data once per column (there are actually about 400 such columns.
i = 1
df_evars = None
while i <= 30:
colname = "evar" + str(i)
df_temp = df.groupBy(colname).agg(fn.count("*").alias("rows"))\
.withColumn("colName", fn.lit(colname))
if df_evars:
df_evars = df_evars.union(df_temp)
else:
df_evars = df_temp
display(df_evars)
Am I missing a better solution?
Update
This has been marked as a duplicate but the two responses IMO only solve part of my question.
I am looking at potentially very wide tables with potentially a large number of values. I need a simple way (ie. 3 columns that show the source column, the value and the count of the value in the source column.
The first of the responses only gives me an approximation of the number of distinct values. Which is pretty useless to me.
The second response seems less relevant than the first. To clarify, source data like this:
-----------------------
| evar1 | evar2 | ... |
|---------------|-----|
| A | A | ... |
| B | A | ... |
| B | B | ... |
| B | B | ... |
| ...
Should result in the output
--------------------------------
| columnname | value | count |
|------------|-------|---------|
| evar1 | A | 1 |
| evar1 | B | 3 |
| evar2 | A | 2 |
| evar2 | B | 2 |
| ...
Using melt borrowed from here:
from pyspark.sql.functions import col
melt(
df.select([col(c).cast("string") for c in df.columns]),
id_vars=[], value_vars=df.columns
).groupBy("variable", "value").count()
Adapted from the answer by user6910411.

Resources