I have a dataframe with a large number of columns that I would like to consolidate into more rows and less columns it has a similar structure to the example below:
| 1_a | 1_b | 1_c | 2_a | 2_b | 2_c | d |
|-----|-----|-----|-----|-----|-----|-----|
| 1 | 2 | 3 | 1 | 2 | 6 | z |
| 2 | 2 | 2 | 3 | 2 | 5 | z |
| 3 | 2 | 1 | 4 | 1 | 4 | z |
I want to combine some of the rows so they look like below:
| 1 | 2 | letter | d |
|---|---|--------|---|
| 1 | 1 | a | z |
| 2 | 3 | a | z |
| 3 | 4 | a | z |
| 2 | 2 | b | z |
| 2 | 2 | b | z |
| 2 | 1 | b | z |
| 3 | 6 | c | z |
| 2 | 5 | c | z |
| 1 | 4 | c | z |
I have created a new dataframe with the new headings, but am unsure how to map my original headings to the new headings when appending.
Thanks
Try
df = df.set_index('d')
df.columns = pd.MultiIndex.from_tuples([tuple(c.split('_')) for c in df.columns])
df = df.stack().reset_index().rename(columns = {'level_1' : 'letter'})
d letter 1 2
0 z a 1 1
1 z b 2 2
2 z c 3 6
3 z a 2 3
4 z b 2 2
5 z c 2 5
6 z a 3 4
7 z b 2 1
8 z c 1 4
For the most part, if you need to dynamically select column names you probably need to just write a Python loop. Just run through each letter manually then concat them together:
dfs = []
for letter in ('a', 'b', 'c'):
group = df[['d']]
group['1'] = df['1_' + letter]
group['2'] = df['2_' + letter]
group['letter'] = letter
dfs.append(group)
result = pd.concat(dfs)
Related
I have a pandas dataframe like below :
| ID | Value |
+----------+--------+
|1C16 | 34 |
|1C1 | 45 |
|7P.75 | 23 |
|7T1 | 34 |
|1C10DG | 34 |
+----------+--------+
I want to split the ID column (its a string column) in a way that looks like below:
| ID | Value | Code | Core |size |
+----------+--------+-------+------+-----+
|1C16 | 34 | C | 1 | 16 |
|1C1 | 45 | C | 1 | 1 |
|7P.75 | 23 | P | 7 | .75 |
|7T1 | 34 | T | 7 | 1 |
|1C10DG | 34 | C | 1 | 10 |
+----------+--------+-------+------+-----+
So how can this be achieved ? Thanks
You can try .str.extract with regex (?P<Code>\d+)(?P<Core>[A-Z])(?P<size>[.0-9]+) to capture the patterns:
df.ID.str.extract(r'(?P<Code>\d+)(?P<Core>[A-Z])(?P<size>[.0-9]+)')
# Code Core size
#0 1 C 16
#1 1 C 1
#2 7 P .75
#3 7 T 1
#4 1 C 10
use .str.extract() with multiple capturing groups & join
df.join(
df['ID'].str.extract('(\d)(\w)(\d+|.\d+)').rename(
columns={0 : 'Core', 1 : 'Code', 2 : 'Size'}))
ID Value Core Code Size
1 1C16 34.0 1 C 16
2 1C1 45.0 1 C 1
3 7P.75 23.0 7 P .75
4 7T1 34.0 7 T 1
5 1C10DG 34.0 1 C 10
This is a question about Window Functions in Spark.
Assume I have this DF
DATE_S | ID | STR | VALUE
-------------------------
1 | 1 | A | 0.5
1 | 1 | A | 1.23
1 | 1 | A | -0.4
2 | 1 | A | 2.0
3 | 1 | A | -1.2
3 | 1 | A | 0.523
1 | 2 | A | 1.0
2 | 2 | A | 2.5
3 | 2 | A | 1.32
3 | 2 | A | -3.34
1 | 1 | B | 1.5
1 | 1 | B | 0.23
1 | 1 | B | -0.3
2 | 1 | B | -2.0
3 | 1 | B | 1.32
3 | 1 | B | 523.0
1 | 2 | B | 1.3
2 | 2 | B | -0.5
3 | 2 | B | 4.3243
3 | 2 | B | 3.332
This is just an example! Assume that there are many more DATE_S for each (ID, STR), many more IDs and STRs, and many more entries per (DATE_S, ID, STR). Obviously there are multiple values per Combination (DATE_S, ID, STR)
Now I do this:
val w = Window.partitionBy("ID", "STR").orderBy("DATE_S").rangeBetween(-N, -1)
df.withColumn("RESULT", function("VALUE").over(w))
where N might lead to the inclusion of a large range of rows, from 100 to 100000 and more, depending on ("ID", "STR")
The result will be something like this
DATE_S | ID | STR | VALUE | RESULT
----------------------------------
1 | 1 | A | 0.5 | R1
1 | 1 | A | 1.23 | R1
1 | 1 | A | -0.4 | R1
2 | 1 | A | 2.0 | R2
3 | 1 | A | -1.2 | R3
3 | 1 | A | 0.523 | R3
1 | 2 | A | 1.0 | R4
2 | 2 | A | 2.5 | R5
3 | 2 | A | 1.32 | R6
3 | 2 | A | -3.34 | R7
1 | 1 | B | 1.5 | R8
1 | 1 | B | 0.23 | R8
1 | 1 | B | -0.3 | R9
2 | 1 | B | -2.0 | R10
3 | 1 | B | 1.32 | R11
3 | 1 | B | 523.0 | R11
1 | 2 | B | 1.3 | R12
2 | 2 | B | -0.5 | R13
3 | 2 | B | 4.3243| R14
3 | 2 | B | 3.332 | R14
There are identical "RESULT"s because for every row with identical (DATE_S, ID, ST), the values that go into the calculation of "function" are the same.
My question is this:
Does spark call "function" for each ROW (recalculating the same value multiple times) or calculate it once per range (frame?) of values and just pastes them on all rows that fall in the range?
Thanks for reading :)
From your data the result may not be the same if run twice from what I can see as there is no distinct ordering possibility. But we leave that aside.
Whilst there is codegen optimization, it is nowhere to be found that it checks in the way you state for if the next invocation is the same set of data to process for the next row. I have never read of that type of optimization. There is fusing due to lazy evaluation approach, but that is another matter. So, per row it calculates again.
From a great source: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-functions-windows.html
... At its core, a window function calculates a return value for every
input row of a table based on a group of rows, called the frame. Every
input row can have a unique frame associated with it. ...
... In other words, when executed, a window function computes a value
for each and every row in a window (per window specification). ...
The biggest issue is to have suitable number of partitions for parallel processing, which is expensive, but this is big data. partitionBy("ID", "STR") is the clue here and that is a good thing.
I have a data frame with a multilevel index. I would like to sort this data frame based on a specific column and extract the first n rows for each group of the first index, but n is different for each group.
For example:
| Index1| Index2| Sort_In_descending_order | How_manyRows_toChoose |
-----------------------------------------------------------------------
| 1 | 20 | 3 | 2 |
| | 40 | 2 | 2 |
| | 10 | 1 | 2 |
| 2 | 20 | 2 | 1 |
| | 50 | 1 | 1 |
the result should look like this:
| Index1| Index2| Sort_In_descending_order | How_manyRows_toChoose |
-----------------------------------------------------------------------
| 1 | 20 | 3 | 2 |
| | 40 | 2 | 2 |
| 2 | 20 | 2 | 1 |
I got this far:
df.groupby(level[0,1]).sum().sort_values(['Index1','Sort_In_descending_order'],ascending=False).groupby('Index1').head(2)
However the .head(2) picks 2 element of each group independent of the number in the column "How_manyRows_toChoose".
Some pice of code would be great!
Thank you!
Use lambda function in GroupBy.apply with head and add parameter group_keys=False for avoid duplicated index values:
#original code
df = (df.groupby(level[0,1])
.sum()
.sort_values(['Index1','Sort_In_descending_order'],ascending=False))
df = (df.groupby('Index1', group_keys=False)
.apply(lambda x: x.head(x['How_manyRows_toChoose'].iat[0])))
print (df)
Sort_In_descending_order How_manyRows_toChoose
Index1 Index2
1 20 3 2
40 2 2
2 20 2 1
I have a pandas dataframe which has a column "user" containing categorical values(a,b,c,d). I only care about the ordering of two categories in ascending order (a, d). So (a,b,c,d) and (a,c,b,d) both are fine for me.
How to create the ordering is the first part of the question?
Secondly I have another column which contains "timestamps". I want to order my rows first by "timestamps" and then for the rows with same timestamps I want to sort with the above ordering of categorical values.
Lets say my data frame looks like this.
+-----------+------+
| Timestamp | User |
+-----------+------+
| 1 | b |
| 2 | d |
| 1 | a |
| 1 | c |
| 1 | d |
| 2 | a |
| 2 | b |
+-----------+------+
I want first this kind of sorting to happen
+-----------+------+
| Timestamp | User |
+-----------+------+
| 1 | b |
| 1 | a |
| 1 | c |
| 1 | d |
| 2 | d |
| 2 | a |
| 2 | b |
+-----------+------+
Followed by the categorical ordering of "user"
+-----------+------+
| Timestamp | User |
+-----------+------+
| 1 | a |
| 1 | b |
| 1 | c |
| 1 | d |
| 2 | a |
| 2 | b |
| 2 | d |
+-----------+------+
OR
+-----------+------+
| Timestamp | User |
+-----------+------+
| 1 | a |
| 1 | c |
| 1 | b |
| 1 | d |
| 2 | a |
| 2 | b |
| 2 | d |
+-----------+------+
As you can see the "c" and "b"'s order do not matter.
You can specify order in ordered categorical by categories and then call DataFrame.sort_values:
df['User'] = pd.Categorical(df['User'], ordered=True, categories=['a','b','c','d'])
df = df.sort_values(['Timestamp','User'])
print (df)
Timestamp User
2 1 a
0 1 b
3 1 c
4 1 d
5 2 a
6 2 b
1 2 d
If there is many values of User is possible dynamically create categories:
vals = ['a', 'd']
cats = vals + np.setdiff1d(df['User'], vals).tolist()
print (cats)
['a', 'd', 'b', 'c']
df['User'] = pd.Categorical(df['User'], ordered=True, categories=cats)
df = df.sort_values(['Timestamp','User'])
print (df)
Timestamp User
2 1 a
4 1 d
0 1 b
3 1 c
5 2 a
1 2 d
6 2 b
I have a sheet something like this
A B C D
1 2 2
2 3 3
4 5 5
5 7 9
10
11
12
I would like column D to show values of col A if col B values exist in col C
Example:
A B C D
1 2 2 1
5 7 9 -
D would have a value of 1 since Col b val is in Col C and in row 4 Col D would have no value at all
Yes A,B,C,D are labels as per the comments
You don't need VLOOKUP here. I think MATCH is a better choice.
Try this:
D1:D4 =IF(ISERROR(MATCH(B1,$C$1:$C$7,0)),"",A1)
(This assumes that your numerical values start in row 1.)
The output looks like this:
+---+---+---+----+---+
| | A | B | C | D |
+---+---+---+----+---+
| 1 | 1 | 2 | 2 | 1 |
| 2 | 2 | 3 | 3 | 2 |
| 3 | 4 | 5 | 5 | 4 |
| 4 | 5 | 7 | 9 | |
| 5 | | | 10 | |
| 6 | | | 11 | |
| 7 | | | 12 | |
+---+---+---+----+---+
You can do this with a combination of vlookup, offset and iserror like so:
=IFERROR(IF(VLOOKUP(B2,C:C,1,0)=B2,OFFSET(B2,0,-1)),"-")
offset used with the -1 parameter will return the cell one column to the left, so you do not need to rearrange the columns in your actual worksheet. iserror will check if the lookup failed, and return the specified default value. Finally, you can also specify the exact range to be looked up, in this case as
VLOOKUP(B2,$C$2:$C$8,1,0)