Splitting ID column in pandas dataframe to multiple columns - python-3.x

I have a pandas dataframe like below :
| ID | Value |
+----------+--------+
|1C16 | 34 |
|1C1 | 45 |
|7P.75 | 23 |
|7T1 | 34 |
|1C10DG | 34 |
+----------+--------+
I want to split the ID column (its a string column) in a way that looks like below:
| ID | Value | Code | Core |size |
+----------+--------+-------+------+-----+
|1C16 | 34 | C | 1 | 16 |
|1C1 | 45 | C | 1 | 1 |
|7P.75 | 23 | P | 7 | .75 |
|7T1 | 34 | T | 7 | 1 |
|1C10DG | 34 | C | 1 | 10 |
+----------+--------+-------+------+-----+
So how can this be achieved ? Thanks

You can try .str.extract with regex (?P<Code>\d+)(?P<Core>[A-Z])(?P<size>[.0-9]+) to capture the patterns:
df.ID.str.extract(r'(?P<Code>\d+)(?P<Core>[A-Z])(?P<size>[.0-9]+)')
# Code Core size
#0 1 C 16
#1 1 C 1
#2 7 P .75
#3 7 T 1
#4 1 C 10

use .str.extract() with multiple capturing groups & join
df.join(
df['ID'].str.extract('(\d)(\w)(\d+|.\d+)').rename(
columns={0 : 'Core', 1 : 'Code', 2 : 'Size'}))
ID Value Core Code Size
1 1C16 34.0 1 C 16
2 1C1 45.0 1 C 1
3 7P.75 23.0 7 P .75
4 7T1 34.0 7 T 1
5 1C10DG 34.0 1 C 10

Related

filter and get rows between the conditions in a dataframe

My DataFrame looks something like this:
+----------------------------------+---------+
| Col1 | Col2 |
+----------------------------------+---------+
| Start A | 1 |
| value 1 | 2 |
| value 2 | 3 |
| value 3 | 4 |
| value 5 | 5 |
| End A | 6 |
| value 6 | 3 |
| value 7 | 4 |
| value 8 | 5 |
| Start B | 1 |
| value 1 | 2 |
| value 2 | 3 |
| value 3 | 4 |
| value 5 | 5 |
| End B | 6 |
| value 6 | 3 |
| value 7 | 4 |
| value 8 | 5 |
| Start C | 1 |
| value 1 | 2 |
| value 2 | 3 |
| value 3 | 4 |
| value 5 | 5 |
| End C | 6 |
+----------------------------------+---------+
What I am trying to acheive is if substring start and end is present I want the rows between them.
Expected Result is:
+----------------------------------+---------+
| Col1 | Col2 |
+----------------------------------+---------+
| Start A | 1 |
| value 1 | 2 |
| value 2 | 3 |
| value 3 | 4 |
| value 5 | 5 |
| End A | 6 |
| Start B | 1 |
| value 1 | 2 |
| value 2 | 3 |
| value 3 | 4 |
| value 5 | 5 |
| End B | 6 |
| Start C | 1 |
| value 1 | 2 |
| value 2 | 3 |
| value 3 | 4 |
| value 5 | 5 |
| End C | 6 |
+----------------------------------+---------+
I tried the code from this How to filter dataframe columns between two rows that contain specific string in column?
m = df['To'].isin(['Start A', 'End A']).cumsum().eq(1)
df[m|m.shift()]
But this only returns the first set of start and end, also it expects the exact string.
output:
+----------------------------------+---------+
| Col1 | Col2 |
+----------------------------------+---------+
| Start A | 1 |
| value 1 | 2 |
| value 2 | 3 |
| value 3 | 4 |
| value 5 | 5 |
| End A | 6 |
+----------------------------------+---------+
The answer you linked to was designed to work with a single pair of Start/End.
A more generic variant of it would be to check for the parity of the group (assuming strictly alternating Start/End):
m1 = df['Col1'].str.match(r'Start|End').cumsum().mod(2).eq(1)
# boolean indexing
out = df[m1|m1.shift()]
Alternatively, use each Start as a flag to keep the following rows and each End as a flag to drop them. This wouldn't however consider the A/B/C letter after the Start/End like the nice answer of #Quang does:
# extract Start/End
s = df['Col1'].str.extract(r'^(Start|End)', expand=False)
# set flags and ffill
m1 = s.replace({'Start': True, 'End': False}).ffill()
# boolean slicing
out = df[m1|m1.shift()]
Output:
Col1 Col2
0 Start A 1
1 value 1 2
2 value 2 3
3 value 3 4
4 value 5 5
5 End A 6
9 Start B 1
10 value 1 2
11 value 2 3
12 value 3 4
13 value 5 5
14 End B 6
18 Start C 1
19 value 1 2
20 value 2 3
21 value 3 4
22 value 5 5
23 End C 6
Let's try:
# extract the label after `Start/End`
groups = df['Col1'].str.extract('[Start|End] (.*)', expand=False)
# keep rows with equal forward fill and backward fill
df[groups.bfill() == groups.ffill()]
Output:
Col1 Col2
0 Start A 1
1 value 1 2
2 value 2 3
3 value 3 4
4 value 5 5
5 End A 6
9 Start B 1
10 value 1 2
11 value 2 3
12 value 3 4
13 value 5 5
14 End B 6
18 Start C 1
19 value 1 2
20 value 2 3
21 value 3 4
22 value 5 5
23 End C 6
One option is with an interval index:
Get the positions of the starts and ends:
starts = df.Col1.str.startswith("Start").to_numpy().nonzero()[0]
ends = df.Col1.str.startswith("End").to_numpy().nonzero()[0]
Build an interval index, and get matches where the index lies between Start and End:
intervals = pd.IntervalIndex.from_arrays(starts, ends, closed='both')
intervals = intervals.get_indexer(df.index)
Filter the original dataframe with the intervals, where intervals are not less than 0:
df.loc[intervals >= 0]
Col1 Col2
0 Start A 1
1 value 1 2
2 value 2 3
3 value 3 4
4 value 5 5
5 End A 6
9 Start B 1
10 value 1 2
11 value 2 3
12 value 3 4
13 value 5 5
14 End B 6
18 Start C 1
19 value 1 2
20 value 2 3
21 value 3 4
22 value 5 5
23 End C 6

Spark Window Functions: calculated once per frame/range?

This is a question about Window Functions in Spark.
Assume I have this DF
DATE_S | ID | STR | VALUE
-------------------------
1 | 1 | A | 0.5
1 | 1 | A | 1.23
1 | 1 | A | -0.4
2 | 1 | A | 2.0
3 | 1 | A | -1.2
3 | 1 | A | 0.523
1 | 2 | A | 1.0
2 | 2 | A | 2.5
3 | 2 | A | 1.32
3 | 2 | A | -3.34
1 | 1 | B | 1.5
1 | 1 | B | 0.23
1 | 1 | B | -0.3
2 | 1 | B | -2.0
3 | 1 | B | 1.32
3 | 1 | B | 523.0
1 | 2 | B | 1.3
2 | 2 | B | -0.5
3 | 2 | B | 4.3243
3 | 2 | B | 3.332
This is just an example! Assume that there are many more DATE_S for each (ID, STR), many more IDs and STRs, and many more entries per (DATE_S, ID, STR). Obviously there are multiple values per Combination (DATE_S, ID, STR)
Now I do this:
val w = Window.partitionBy("ID", "STR").orderBy("DATE_S").rangeBetween(-N, -1)
df.withColumn("RESULT", function("VALUE").over(w))
where N might lead to the inclusion of a large range of rows, from 100 to 100000 and more, depending on ("ID", "STR")
The result will be something like this
DATE_S | ID | STR | VALUE | RESULT
----------------------------------
1 | 1 | A | 0.5 | R1
1 | 1 | A | 1.23 | R1
1 | 1 | A | -0.4 | R1
2 | 1 | A | 2.0 | R2
3 | 1 | A | -1.2 | R3
3 | 1 | A | 0.523 | R3
1 | 2 | A | 1.0 | R4
2 | 2 | A | 2.5 | R5
3 | 2 | A | 1.32 | R6
3 | 2 | A | -3.34 | R7
1 | 1 | B | 1.5 | R8
1 | 1 | B | 0.23 | R8
1 | 1 | B | -0.3 | R9
2 | 1 | B | -2.0 | R10
3 | 1 | B | 1.32 | R11
3 | 1 | B | 523.0 | R11
1 | 2 | B | 1.3 | R12
2 | 2 | B | -0.5 | R13
3 | 2 | B | 4.3243| R14
3 | 2 | B | 3.332 | R14
There are identical "RESULT"s because for every row with identical (DATE_S, ID, ST), the values that go into the calculation of "function" are the same.
My question is this:
Does spark call "function" for each ROW (recalculating the same value multiple times) or calculate it once per range (frame?) of values and just pastes them on all rows that fall in the range?
Thanks for reading :)
From your data the result may not be the same if run twice from what I can see as there is no distinct ordering possibility. But we leave that aside.
Whilst there is codegen optimization, it is nowhere to be found that it checks in the way you state for if the next invocation is the same set of data to process for the next row. I have never read of that type of optimization. There is fusing due to lazy evaluation approach, but that is another matter. So, per row it calculates again.
From a great source: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-functions-windows.html
... At its core, a window function calculates a return value for every
input row of a table based on a group of rows, called the frame. Every
input row can have a unique frame associated with it. ...
... In other words, when executed, a window function computes a value
for each and every row in a window (per window specification). ...
The biggest issue is to have suitable number of partitions for parallel processing, which is expensive, but this is big data. partitionBy("ID", "STR") is the clue here and that is a good thing.

How can I create a time variable with Stata or Excel?

I have a dataset that I am editing so it can be used for a time series regression since the time is not currently in a format that is usable. The format of the existing data is as follows:
--------------------------------------------------
| id|size |2017price|2016price|2015price|2014price| ...
-------------------------------------------------
| 1 | 3 | 50 | 80 | 21 | 56 | ...
--------------------------------------------------
| 2 | 5 | 78 | 85 | 54 | 67 | ...
--------------------------------------------------
| 3 | 2 | 18 | 22 | 34 | 54 | ...
--------------------------------------------------
...
...
...
I would like to add a time variable that accounts for each year and gives the corresponding value as a price variable;
---------------------------
| id | size |t | price|
--------------------------
| 1 | 3 |2017| 50 |
--------------------------
| 1 | 3 |2016| 80 |
--------------------------
| 1 | 3 |2015| 21 |
--------------------------
| 1 | 3 |2014| 21 |
--------------------------
| 2 | 5 |2017| 78 |
--------------------------
| 2 | 5 |2016| 85 |
--------------------------
| 2 | 5 |2015| 54 |
--------------------------
| 2 | 5 |2014| 67 |
--------------------------
| 3 | 2 |2017| 18 |
--------------------------
| 3 | 2 |2016| 22 |
--------------------------
| 3 | 2 |2015| 34 |
--------------------------
| 3 | 2 |2014| 54 |
--------------------------
...
...
...
Is there a function in Stata or Excel that can do this automatically? I have data for 20 years with over 35,000 entries so manually editing won't work.
Your data example as given is not quite suitable as Stata data as variable names cannot begin with numeric characters.
That fixed, this is an exercise for the reshape command (not function).
clear
input id size price2017 price2016 price2015 price2014
1 3 50 80 21 56
2 5 78 85 54 67
3 2 18 22 34 54
end
reshape long price, i(id size) j(year)
sort id size year
list , sepby(id)
+--------------------------+
| id size year price |
|--------------------------|
1. | 1 3 2014 56 |
2. | 1 3 2015 21 |
3. | 1 3 2016 80 |
4. | 1 3 2017 50 |
|--------------------------|
5. | 2 5 2014 67 |
6. | 2 5 2015 54 |
7. | 2 5 2016 85 |
8. | 2 5 2017 78 |
|--------------------------|
9. | 3 2 2014 54 |
10. | 3 2 2015 34 |
11. | 3 2 2016 22 |
12. | 3 2 2017 18 |
+--------------------------+

Multiple Lookup Criteria

I have this data below in Excel. What I want is to return the No.of Inactive months and the Inactive months themselves.
ACTIVITY MONTH
Jan17 Feb17 Mar17 Apr17 Reg Month No.Inactive months Months Inactive
User ID
1 5 38 0 60 Jan17
2 0 242 203 20 Feb17
3 30 0 0 30 Jan17
4 0 0 0 40 Apr17
5 0 0 16 0 Mar17
To count the inactive months you can use the following.
+---+------+--------+--------+--------+--------+--+-----------------+
| | A | B | C | D | E | F| G |
+---+------+--------+--------+--------+--------+--+-----------------+
| 1 | User | Jan 17 | feb-17 | mar-17 | apr-17 | | Inactive months |
| 2 | 1 | 5 | 38 | 0 | 60 | | 1 |
| 3 | 2 | 0 | 242 | 203 | 20 | | 1 |
| 4 | 3 | 30 | 0 | 0 | 30 | | 2 |
| 5 | 4 | 0 | 0 | 0 | 40 | | 3 |
| 6 | 5 | 0 | 0 | 16 | 0 | | 3 |
+---+------+--------+--------+--------+--------+--+-----------------+
where in cell G2 the is this formula =COUNTIF(B2:E2,0)
To show the list of inactive months it's a little bit harder.
The point is that you have to explain how you want to see these results.
The easier way is to use the conditional formatting anc color the cell with zero (but this is not so useful). Others way could be to traspose the table and filter the column with zero. Another one could be to use a VBA macro....

Insert Cell data in one row ( from A1 to An )

i have been try to solved this using tutorial from google ( everywhere ) but i'm not find any answer, i hope i can found the answer here.
here is my data
A B C
+-------+----+-----+
1 | 123 | 4 | 5 |
2 | 678 | 9 | 10 |
+-------+----+-----+
the result that i need :
A B C
+-------+----+-----+
1 | 123 | | |
2 | 678 | | |
3 | 4 | | |
4 | 5 | | |
5 | 9 | | |
6 | 10 | | |
+-------+----+-----+
or in other order like :
123
4
5
678
...
any one know how to solved this ?

Resources