I'd like to transpose a data that is in 2 columns (A:B). for example in below input, values A,B,C and D in column A appears 3 times each of them.
With this input (input 1) I use this formula and get the transposition correctly like the image below:
=INDEX($B$1:$B$12,4*ROWS(D$2:D2)+COLUMNS($D2:D2)-4)
INPUT1
+---+----+
| A | 2 |
+---+----+
| B | 3 |
+---+----+
| C | 4 |
+---+----+
| D | 1 |
+---+----+
| A | 6 |
+---+----+
| B | 12 |
+---+----+
| C | 4 |
+---+----+
| D | 76 |
+---+----+
| A | 1 |
+---+----+
| B | 2 |
+---+----+
| C | 37 |
+---+----+
| D | 9 |
+---+----+
But if input change (input2) in such way that A, B appears less times than C and D, my output is incorrect.
INPUT2
+---+----+
| A | 2 |
+---+----+
| B | 3 |
+---+----+
| C | 4 |
+---+----+
| D | 1 |
+---+----+
| C | 4 |
+---+----+
| D | 76 |
+---+----+
| C | 37 |
+---+----+
| D | 9 |
+---+----+
| A | 47 |
+---+----+
| B | 2 |
+---+----+
| C | 37 |
+---+----+
| D | 9 |
+---+----+
I show in image below the incorrect output and expected output.
Thanks in advance for any help.
Here is a fairly simple approach based on counting how many cells have been filled in so far:
=IF(INDEX($A:$A,COUNT($D$1:$G1)+COUNT($C2:C2)+1)=D$1,
INDEX($B:$B,COUNT($D$1:$G1)+COUNT($C2:C2)+1),"")
copied down and across starting from D2. Assumes that a blank column is available in column C.
If you want to make it more dynamic but also want it to work for earlier versions of Excel than Microsoft 365, it gets a bit ugly unfortunately. You can use a conventional way of listing out the unique values in column A in alphabetical order to get the headers:
=IFERROR(INDEX($A:$A, MATCH(SMALL(IF((COUNTIF(C$1:$C1, $A$1:INDEX($A:$A,COUNTA($A:$A)))=0), COUNTIF($A$1:INDEX($A:$A,COUNTA($A:$A)), "<"&$A$1:INDEX($A:$A,COUNTA($A:$A))), ""), 1), COUNTIF($A$1:INDEX($A:$A,COUNTA($A:$A)), "<"&$A$1:INDEX($A:$A,COUNTA($A:$A))), 0)),"")
adapted from this, pulled across as required (say, to column Z).
Then a slight modification to the main formula to avoid zeroes appearing under the blank headers:
=IF(AND(INDEX($A:$A,COUNT($D$1:$Z1)+COUNT($C2:C2)+1)=D$1,D$1<>""),
INDEX($B:$B,COUNT($D$1:$Z1)+COUNT($C2:C2)+1),"")
Copied down and across as far as column Z.
Related
I am ingesting a dataframe and I want to append a monotonically increasing column that increases whenever another column matches a certain value. For example I have the following table
+------+-------+
| Col1 | Col2 |
+------+-------+
| B | 543 |
| A | 1231 |
| B | 14234 |
| B | 34234 |
| B | 3434 |
| A | 43242 |
| B | 43242 |
| B | 56453 |
+------+-------+
I would like to append a column that increases in value whenever "A" in col1 is present. So the result would look like
+------+-------+------+
| Col1 | Col2 | Col3 |
+------+-------+------+
| B | 543 | 0 |
| A | 1231 | 1 |
| B | 14234 | 1 |
| B | 34234 | 1 |
| B | 3434 | 1 |
| A | 43242 | 2 |
| B | 43242 | 2 |
| B | 56453 | 2 |
+------+-------+------+
Keeping the initial order is important.
I tried zippering but that doesn't seem to produce the right result. Splitting it up into individual seqs manually and doing it that way is not going to be performant enough (think 100+ GB tables).
I looked into trying this with a map function that would keep a counter somewhere but couldn't get that to work.
Any advice or pointer in the right direction would be greatly appreciated.
spark does not provide any default functions to achieve this kind of functionality
I would do like to do most probably in this way
//inputDF contains Col1 | Col2
val df = inputDF.select("Col1").distinct.rdd.zipWithIndex().toDF("Col1","Col2")
val finalDF = inputDF.join(df,df("Col1") === inputDF("Col1"),"left").select(inputDF("*"),"Col3")
but the problem here I can see is (join which will result in the shuffle).
you can also check other autoincrement API's here.
Use window and sum over the window of the value 1 when Col1 = A.
import pyspark.sql.functions as f
from pyspark.sql import Window
w = Window.partitionBy().rowsBetween(Window.unboundedPreceding, Window.currentRow)
df.withColumn('Col3', f.sum(f.when(f.col('Col1') == f.lit('A'), 1).otherwise(0)).over(w)).show()
+----+-----+----+
|Col1| Col2|Col3|
+----+-----+----+
| B| 543| 0|
| A| 1231| 1|
| B|14234| 1|
| B|34234| 1|
| B| 3434| 1|
| A|43242| 2|
| B|43242| 2|
| B|56453| 2|
+----+-----+----+
I need to filter a table. The challenge for me is that the filter information (column names, number of columns, as well as filter values) can change.
After doing some research I think List.Generate() could help me here. The idea is to create a loop that in each loop pass applies one filter condition that is dynamically passed to the loop.
Unfortunately I don't understand List.Generate() well enough to build this myself. Hence any help would be greatly appreciated!
Here is my setup:
I have one table with data (DATASTART)
+---+---+---+
| A | B | C |
+---+---+---+
| 1 | 1 | 2 |
| 1 | 2 | 2 |
| 1 | 3 | 2 |
| 2 | 4 | 3 |
| 2 | 5 | 3 |
| 2 | 6 | 3 |
+---+---+---+
and one table (FILTER) with information which columns of DATASTART should be filtered and the corresponding filter values.
+--------+--------+
| Column | Filter |
+--------+--------+
| A | 1 |
| B | 2 |
+--------+--------+
With static Power Query code
= Table.SelectRows(DATASTART, each ([A] = 1) and ([B] = 2))
the result would be this table (DATARESULT).
+---+---+---+
| A | B | C |
+---+---+---+
| 1 | 2 | 2 |
+---+---+---+
How about this?
let
condition = (record as record) as logical =>
List.AllTrue(
List.Transform(
Table.ToRecords(FILTER),
each Record.Field(record, [Column]) = [Filter]
)
)
in
Table.SelectRows(DATASTART, condition)
I have two Spark dataframes that share the same ID column:
df1:
+------+---------+---------+
|ID | Name1 | Name2 |
+------+---------+---------+
| 1 | A | B |
| 2 | C | D |
| 3 | E | F |
+------+---------+---------+
df2:
+------+-------+
|ID | key |
+------+-------+
| 1 | w |
| 1 | x |
| 2 | y |
| 3 | z |
+------+-------+
Now, I want to create a new column in df1 that contains all key values denoted in df2. So, I aim for the result:
+------+---------+---------+---------+
|ID | Name1 | Name2 | keys |
+------+---------+---------+---------+
| 1 | A | B | w,x |
| 2 | C | D | y |
| 3 | E | F | z |
+------+---------+---------+---------+
Ultimately, I want to find a solution for an arbitrary amount of keys.
My attempt in PySpark:
def get_keys(id):
x = df2.where(df2.ID == id).select('key')
return x
df_keys = df1.withColumn("keys", get_keys(col('ID')))
In the above code, x is a dataframe. Since the second argument of the .withColumn function needs to be an Column type variable, I am not sure how to mutate x correctly.
You are looking for collect_list function.
from pyspark.sql.functions import collect_list
df3 = df1.join(df2, df1.ID == df2.ID).drop(df2.ID)
df3.groupBy('ID','Name1','Name2').agg(collect_list('key').alias('keys')).show()
#+---+-----+-----+------+
#| ID|Name1|Name2| keys|
#+---+-----+-----+------+
#| 1| A| B|[w, x]|
#| 3| C| F| [z]|
#| 2| B| D| [y]|
#+---+-----+-----+------+
If you want only unique keys you can use collect_set
I am working on some assignment, where we have two tables with same/different columns.If a record of table A has some column values as null then that has to update to value in table B,Vice versa.
table A
id | code | type
1 | null | A
2 | null | null
3 | 123 | C
table B
id | code | type
1 | 456 | A
2 | 789 | A1
3 | null | C
what I have worked so far
Dataset<Row> df1 = spark.read().format("csv").option("header", "true").load("C:\\Users\\System2\\Videos\\1199_data\\d1_1.csv");
Dataset<Row> df2 = spark.read().format("csv").option("header", "true").load("C:\\Users\\System2\\Videos\\1199_data\\d2_1.csv");
df1
.as("a").join(df2.as("b"))
.where("a.id== b.id")
.withColumn("a.code",
functions.when(
df1.col("code").isNull(),
df2.col("code") )
).show();
Required Output
table C
id | code | type
1 | 456 | A
2 | 789 | A1
3 | 123 | C
You can use the coalesce function?
df1.join(df2, "id")
.select(df1("id"),
coalesce(df1("code"),
df2("code")).as("code"),
coalesce(df1("type"),
df2("type")).as("type"))
And output:
+---+----+----+
| id|code|type|
+---+----+----+
| 1| 456| A|
| 2| 789| A1|
| 3| 123| C|
+---+----+----+
Following this question, I would like to ask.
I have 2 tables:
The first table - MajorRange
row | From | To | Group ....
-----|--------|---------|---------
1 | 1200 | 1500 | A
2 | 2200 | 2700 | B
3 | 1700 | 1900 | C
4 | 2100 | 2150 | D
...
The second table - SubRange
row | From | To | Group ....
-----|--------|---------|---------
1 | 1208 | 1300 | E
2 | 1400 | 1600 | F
3 | 1700 | 2100 | G
4 | 2100 | 2500 | H
...
The output table should be the all the SubRange groups who has overlap over the MajorRange groups. In the following example the result table is:
row | Major | Sub |
-----|--------|------|-
1 | A | E |
2 | A | F |
3 | B | H |
4 | C | G |
5 | D | H |
In case there is no overlapping between the ranges the Major will not appear.
Both tables are big data tables.How can I do it using Hive/ Spark in most efficient way?
With spark, maybe a non equi join like this?
val join_expr = major_range("From") < sub_range("To") && major_range("To") > sub_range("From")
(major_range.join(sub_range, join_expr)
.select(
monotonically_increasing_id().as("row"),
major_range("Group").as("Major"),
sub_range("Group").as("Sub")
)
).show
+---+-----+---+
|row|Major|Sub|
+---+-----+---+
| 0| A| E|
| 1| A| F|
| 2| B| H|
| 3| C| G|
| 4| D| H|
+---+-----+---+