Dataframe to pivot using pandas - python-3.x

I am converting my data frame to pivot table.
Here's my Data frame.
+----+---------------------+----
| | A| B| C | D |
|----+---------------------+-----
| 0 | a| OK| one | col1 |
| 1 | b| OK| two | col1 |
| 2 | c| OK| two | col2 |
| 3 | d| OK| Four | NaN |
| 4 | e| OK| Five | NaN |
| 5 | f| OK| Six | NaN |
| 6 | g| NaN| NaN | Col3 |
| 7 | h| NaN| NaN | Col4 |
| 8 | i| NaN| NaN | Col5 |
+----+---------------------+-----
I.m doing-->
pivot_data = df.pivot(index='C', columns = 'D', values = 'B')
This is my output.
+------------------+-------+-----------+-------------+-----
| | NaN| Col1| col2 | col3 | col4 | col5 |
|------------------+-------+-----------+-------------+------
| NaN | NaN| NaN| NaN| NaN| NaN| NaN|
| four | OK| NaN| NaN| NaN| NaN| NaN|
| six | OK| NaN| NaN| NaN| NaN| NaN|
| one | NaN| OK| NaN| NaN| NaN| NaN|
| two | NaN| OK| OK| NaN| NaN| NaN|
| five | OK | NaN| NaN| NaN| NaN| NaN|
+------------------+-------+-----------+-------------+------
This is my desired output.
When I'm using pivot_table instead of pivot I'm not getting rows and cols with all values NaN. But it is important to have all those rows/cols.
How can I achieve the below desired output.
+------------------+-------+-----------+-----------
| | Col1| col2 | col3 | col4 | col5 |
|------------------+-------+-----------+------------
| four | NaN| NaN| NaN| NaN| NaN|
| six | NaN| NaN| NaN| NaN| NaN|
| one | OK| NaN| NaN| NaN| NaN|
| two | OK| OK| NaN| NaN| NaN|
| five | NaN| NaN| NaN| NaN| NaN|
+------------------+-------+-----------+------------
Thank you .
Update:
Updated data set which giving Value error: Index contains duplicate entries. Cannot reshape.
+----+---------------------+-----------+-----------
| | A | B| C| D |
|----+---------------------+-----------+------------
| 0 | 3957 | OK| One | TM-009.4 |
| 1 | 3957 | OK| two | TM-009.4 |
| 2 | 4147 | OK| three| CERT008 |
| 3 | 3816 | OK| four | FITEYE-04 |
| 4 | 3955 | OK| five | TM-009.2 |
| 5 | 4147 | OK| six | CERT008 |
| 6 | 4147 | OK| seven| CERT008 |
| 7 | 3807 | OK| seven| EMT-038.4 |
| 8 | nan | OK| eight| nan |
| 9 | nan | OK| nine | nan |
| 10 | nan | OK| ten | nan |
| 11 | nan | OK| 11 | nan |
| 12 | nan | OK| 12 | nan |
| 13 | nan | OK| 13 | nan |
| 14 | nan | OK| 14 | nan |
| 15 | nan | OK| 14 | nan |
| 16 | 3814 | nan | nan | FITEYE-02 |
| 17 | 3819 | nan | nan | FITEYE-08 |
| 18 | 3884 | nan | nan | TG-000.8 |
| 19 | 4087 | nan | nan | TM-042.1 |
+----+---------------------+-----------+-------------

You were almost there; after pivot, we just need to rename the axis using rename_axis and drop columns and index using drop which are not required.
Code
df[['C','D']] = df[['C','D']].fillna('NA') # To keep things simple while dropping col and index
df.pivot(index='C', columns = 'D',
values = 'B').rename_axis(index=None, columns=None).drop(columns='NA', index='NA')
Output
col1 col2 col3 col4 col5
five NaN NaN NaN NaN NaN
four NaN NaN NaN NaN NaN
one OK NaN NaN NaN NaN
six NaN NaN NaN NaN NaN
two OK OK NaN NaN NaN
UPDATE
Issue is because of duplicate NaNs in the C column as we are dropping NaNs anyways from index we can drop duplicates or drop them completely at first. I have dropped duplicates in below solution, you can even drop them completely as per requirements.
Code
df[['C','D']] = df[['C','D']].fillna('NA')
df = df.drop_duplicates(['C'])
df.pivot(index = 'C', columns = 'D', values='B').rename_axis(index=None, columns=None).drop(columns='NA', index='NA')
Output
CERT008 FITEYE-02 FITEYE-04 TM-009.2 TM-009.4
11 NaN NaN NaN NaN NaN
12 NaN NaN NaN NaN NaN
13 NaN NaN NaN NaN NaN
14 NaN NaN NaN NaN NaN
One NaN NaN NaN NaN OK
eight NaN NaN NaN NaN NaN
five NaN NaN NaN OK NaN
four NaN NaN OK NaN NaN
nine NaN NaN NaN NaN NaN
seven OK NaN NaN NaN NaN
six OK NaN NaN NaN NaN
ten NaN NaN NaN NaN NaN
three OK NaN NaN NaN NaN
two NaN NaN NaN NaN OK

Related

Calculating the difference in value between columns

I have a dataframe with YYYYMM columns that contain monthly totals on the row level:
| yearM | feature | 201902 | 201903 | 201904 | 201905 |... ... ... 202009
|-------|----------|--------|--------|--------|--------|
| 0 | feature1 | Nan | Nan | 9.0 | 32.0 |
| 1 | feature2 | 1.0 | 1.0 | 1.0 | 4.0 |
| 2 | feature3 | Nan | 1.0 | 4.0 | 8.0 |
| 3 | feature4 | 9.0 | 15.0 | 19.0 | 24.0 |
| 4 | feature5 | 33.0 | 67.0 | 99.0 | 121.0 |
| 5 | feature6 | 12.0 | 15.0 | 17.0 | 19.0 |
| 6 | feature7 | 1.0 | 8.0 | 15.0 | 20.0 |
| 7 | feature8 | Nan | Nan | 1.0 | 9.0 |
I would like to convert the totals to the monthly change. The feature column should be excluded as I need to keep the feature names. The yearM in the index is a result of pivoting a dataframe to get the YYYYMM on the column level.
This is how the output would look like:
| yearM | feature | 201902 | 201903 | 201904 | 201905 |... ... ... 202009
|-------|----------|--------|--------|--------|--------|
| 0 | feature1 | Nan | 0.0 | 9.0 | 23.0 |
| 1 | feature2 | 1.0 | 0.0 | 0.0 | 3.0 |
| 2 | feature3 | Nan | 1.0 | 3.0 | 5.0 |
| 3 | feature4 | 9.0 | 6.0 | 4.0 | 5 |
| 4 | feature5 | 33.0 | 34.0 | 32.0 | 22.0 |
| 5 | feature6 | 12.0 | 3.0 | 2.0 | 2.0 |
| 6 | feature7 | 1.0 | 7.0 | 7.0 | 5.0 |
| 7 | feature8 | Nan | 0.0 | 1.0 | 8.0 |
The row level values now represent the change compared to the previous month instead of having the total for the month.
I know that I should start by filling the NaN rows in the starting column 201902 with 0:
df['201902'] = df['201902'].fillna(0)
I could also calculate them one by one with something similar to this:
df['201902'] = df['201902'].fillna(0) - df['201901'].fillna(0)
df['201903'] = df['201903'].fillna(0) - df['201902'].fillna(0)
df['201904'] = df['201904'].fillna(0) - df['201903'].fillna(0)
...
...
Hopefully there's a smarter solution though
use iloc or drop to access the other columns, then diff with axis=1 for row-wise differences.
monthly_change = df.iloc[:, 1:].fillna(0).diff(axis=1)
# or
# monthly_change = df.drop(['feature'], axis=1).fillna(0).diff(axis=1)

How to create new columns in pandas dataframe using column values?

I'm working in Python with a pandas DataFrame similar to:
REQUESET_ID | DESCR | TEST | TEST_DESC | RESULT |
1 | 1 | T1 | TEST_1 | 2.0 |
1 | 2 | T2 | TEST_2 | 92.0 |
2 | 1 | T1 | TEST_1 | 8.0 |
3 | 3 | T3 | TEST_3 | 12.0 |
3 | 4 | T4 | TEST_4 | 45.0 |
What I want is a final dataframe like this:
REQUESET_ID | DESCR_1 | TEST_1 | TEST_DESC_1 | RESULT_1 | DESCR_2 | TEST_2 | TEST_DESC_2 | RESULT_2 |
1 | 1 | T1 | TEST_1 | 2.0 | 2 | T2 | TEST_2 | 92.0 |
2 | 1 | T1 | TEST_1 | 8.0 | NaN | NaN | NaN | Nan |
3 | 3 | T3 | TEST_3 | 12.0 | 4 | T4 | TEST_4 | 45.0 |
How I should implement that as a method working with DataFrames. I understand that if I try to do it with a merge instead of having 4x2 columns added beacuse the value_counts method of the REQUEST_ID will return 2, will add the 4 columns for each entry in the request column.
Assign a new column with cumcount, then do stack + unstack
s=df.assign(col=(df.groupby('REQUESET_ID').cumcount()+1).astype(str)).\
set_index(['REQUESET_ID','col']).unstack().sort_index(level=1,axis=1)
s.columns=s.columns.map('_'.join)
s
DESCR_1 RESULT_1 TEST_1 ... RESULT_2 TEST_2 TEST_DESC_2
REQUESET_ID ...
1 1.0 2.0 T1 ... 92.0 T2 TEST_2
2 1.0 8.0 T1 ... NaN NaN NaN
3 3.0 12.0 T3 ... 45.0 T4 TEST_4
[3 rows x 8 columns]

Create DataFrames or Dictionaries from Unique Values in Separate Columns

I'm a Python newbie trying to model stock trades from a DataFrame containing timestamped trade executions. For example, index 0-2 below represent three executions of a single trade. I'm trying to isolate each group of executions that represent a trade. Here's a sample of the existing aggregate:
event side symbol shares price time trade_date
0 Execute Shrt TSLA 25 311.9500 10:29:51 2019-01-19
1 Execute Buy TSLA 20 310.7300 10:30:42 2019-01-19
2 Execute Buy TSLA 5 311.1173 10:31:15 2019-01-19
3 Execute Buy BYND 25 83.3027 11:06:15 2019-01-19
4 Execute Shrt BYND 13 84.0500 11:07:11 2019-01-19
5 Execute Sell BYND 12 83.2500 11:07:42 2019-01-19
6 Execute Buy NVDA 25 297.3400 12:07:42 2019-01-20
7 Execute Shrt AMZN 10 500.0100 12:09:12 2019-01-20
8 Execute Sell NVDA 25 296.7500 12:10:30 2019-01-20
9 Execute Buy AMZN 10 495.7500 12:11:15 2019-01-20
The approach in this post creates slices from unique values in a single column, but I'm unsure of how to make the second slice. With this approach applied, I have:
date_list = list(set(execs_df['trade_date'])) # Create list of dates from original DataFrame
by_date_dict = {date: execs_df.loc[execs_df['trade_date'] == date] for date in date_list}
for date in date_list:
print(by_date__dict[date])
This produces the following, date-specific dictionaries:
side symbol shares price time trade_date p & l trades value
0 Shrt TSLA 25.0 311.9500 10:29:51 2019-11-01 NaN NaN 7798.7500
1 Buy TSLA 8.0 311.2000 10:30:31 2019-11-01 NaN NaN 2489.6000
2 Buy TSLA 8.0 310.7300 10:30:42 2019-11-01 NaN NaN 2485.8400
3 Buy TSLA 4.0 311.1173 10:31:15 2019-11-01 NaN NaN 1244.4692
4 Buy TSLA 5.0 311.5500 10:35:39 2019-11-01 NaN NaN 1557.7500
5 Shrt BYND 25.0 83.3027 11:06:15 2019-11-01 NaN NaN 2082.5675
6 Buy BYND 12.0 83.0500 11:06:43 2019-11-01 NaN NaN 996.6000
7 Buy BYND 13.0 83.2400 11:07:49 2019-11-01 NaN NaN 1082.1200
In terms of final output, I need the following:
side symbol shares price time trade_date p & l trades value
0 Shrt TSLA 25.0 311.9500 10:29:51 2019-11-01 NaN NaN 7798.7500
1 Buy TSLA 8.0 311.2000 10:30:31 2019-11-01 NaN NaN 2489.6000
2 Buy TSLA 8.0 310.7300 10:30:42 2019-11-01 NaN NaN 2485.8400
3 Buy TSLA 4.0 311.1173 10:31:15 2019-11-01 NaN NaN 1244.4692
4 Buy TSLA 5.0 311.5500 10:35:39 2019-11-01 NaN NaN 1557.7500
side symbol shares price time trade_date p & l trades value
0 Shrt BYND 25.0 83.3027 11:06:15 2019-11-01 NaN NaN 2082.5675
1 Buy BYND 12.0 83.0500 11:06:43 2019-11-01 NaN NaN 996.6000
2 Buy BYND 13.0 83.2400 11:07:49 2019-11-01 NaN NaN 1082.1200
etc...
Any pointers would be greatly appreciated.
Given your current dictionary of dataframes by_date_dict
The following code we be a dict of dict of dataframes
Top key is still the date
Under each data key is a key for the symbol (e.g. updated_df['2019-11-01']['BYND'])
updated_df = {k: {sym: v[v.symbol == sym] for sym in v.symbol.unique()} for k, v in by_date_dict.items()}
# structure
{date: {symbol: df,
symbol: df,
symbol: df}
date: {symbol, df,
symbol, df,
symbol, df}}
for k, v in updated_df.items():
print(k)
for x, y in v.items():
print(x)
print(y.to_markdown())
2019-11-01
TSLA
| | side | symbol | shares | price | time | trade_date | pl | trades | value |
|---:|:-------|:---------|---------:|--------:|:---------|:-------------|-----:|---------:|--------:|
| 0 | Shrt | TSLA | 25 | 311.95 | 10:29:51 | 2019-11-01 | nan | nan | 7798.75 |
| 1 | Buy | TSLA | 8 | 311.2 | 10:30:31 | 2019-11-01 | nan | nan | 2489.6 |
| 2 | Buy | TSLA | 8 | 310.73 | 10:30:42 | 2019-11-01 | nan | nan | 2485.84 |
| 3 | Buy | TSLA | 4 | 311.117 | 10:31:15 | 2019-11-01 | nan | nan | 1244.47 |
| 4 | Buy | TSLA | 5 | 311.55 | 10:35:39 | 2019-11-01 | nan | nan | 1557.75 |
BYND
| | side | symbol | shares | price | time | trade_date | pl | trades | value |
|---:|:-------|:---------|---------:|--------:|:---------|:-------------|-----:|---------:|--------:|
| 5 | Shrt | BYND | 25 | 83.3027 | 11:06:15 | 2019-11-01 | nan | nan | 2082.57 |
| 6 | Buy | BYND | 12 | 83.05 | 11:06:43 | 2019-11-01 | nan | nan | 996.6 |
| 7 | Buy | BYND | 13 | 83.24 | 11:07:49 | 2019-11-01 | nan | nan | 1082.12 |
Access specific key
updated_df['2019-11-01']['BYND']
| | side | symbol | shares | price | time | trade_date | pl | trades | value |
|---:|:-------|:---------|---------:|--------:|:---------|:-------------|-----:|---------:|--------:|
| 5 | Shrt | BYND | 25 | 83.3027 | 11:06:15 | 2019-11-01 | nan | nan | 2082.57 |
| 6 | Buy | BYND | 12 | 83.05 | 11:06:43 | 2019-11-01 | nan | nan | 996.6 |
| 7 | Buy | BYND | 13 | 83.24 | 11:07:49 | 2019-11-01 | nan | nan | 1082.12 |

Pandas - Copying to Index and Then Sorting

I am working on a large dataset of stock data. I've been able to create a multi-indexed dataframe, but now I can't configure it the way I want it.
Basically, I am trying to make an index called 'DATE' and then sort each smaller set against the index.
Right now it looks like this:
+------------+----------+-------+-------------+-------+
| DATE | AAPL | | GE | |
+------------+----------+-------+-------------+-------+
| DATE | date | close | date | close |
| 05-31-2019 | 05/31/19 | 203 | 04-31-2019 | 9.3 |
| 05-30-2019 | 05/30/19 | 202 | 04-30-2019 | 9.3 |
| 05-29-2019 | 05/29/19 | 4 | 04-29-2019 | 9.6 |
| | | | | |
| ... | | | | |
| | | | | |
| NaN | NaN | NaN | 01/30/1970 | 0.77 |
| NaN | NaN | NaN | 01/29/1970 | 0.78 |
| NaN | NaN | NaN | 01/28/1970 | 0.76 |
+------------+----------+-------+-------------+-------+
Where DATE is the index.
And I want it to look like this:
+------------+----------+-------+----------+-------+
| DATE | AAPL | | GE | |
+------------+----------+-------+----------+-------+
| DATE | date | close | date | close |
| 05-31-2019 | 05/31/19 | 203 | NaN | NaN |
| 05-30-2019 | 05/30/19 | 202 | NaN | NaN |
| 05-29-2019 | 05/29/19 | 4 | NaN | NaN |
| | | | | |
| ... | | | | |
| | | | | |
| 01/30/1970 | NaN | NaN |01/30/1970| 0.77 |
| 01/29/1970 | NaN | NaN |01/29/1970| 0.78 |
| 01/28/1970 | NaN | NaN |01/28/1970| 0.76 |
+------------+----------+-------+----------+-------+
Where the index (DATE) has taken all of the unique values, and then all of the rows within stock symbols have moved to match the index where 'date' = 'DATE'.
I've tried so many attempts at this, but no matter what I can't figure out either one. I can't figure out how to make the Index a list of all of the unique 'date' values. And I can't figure out how to reformat the symbol data to match the new index.
A lot of my troubles (I suspect) have to do with the fact that I am using a multi-index for this, which makes everything more difficult as Pandas needs to know what level to be using.
I made the initial Index using this code:
df['DATE','DATE'] = df.xs(('AAPL', 'date'), level=('symbol', 'numbers'), axis=1)
df.set_index('DATE', inplace=True)
I tried to make one that kept adding unique values to the column, like this:
for f in filename_wo_ext:
data = df.xs([f, 'date'], level=['symbol', 'numbers'], axis=1)
df.append(data, ignore_index=True)
df['DATE','DATE'] = data
pd.concat([pd.DataFrame([df], columns=['DATE']) for f in filename_wo_ext], ignore_index=True)
But that didn't cycle and append in the for loop that I wanted it to, it just made a column based on the last symbol.
Then in terms of sorting the symbol frame to match the index, I still haven't been able to figure that out.
Thank you so much!

Detect consecutive timestamps with all rows with NaN values in pandas

I would like to detect in a dataframe the start and end (Datetime) of consecutive sets of rows with all the values being NaN.
What is the best way to store the results in a array of tuples with the start and end of each set of datetimes with NaN values?
For example using the dataframe bellow the tuple should be like this:
missing_datetimes = [('2018-10-10 22:00:00', '2018-10-11 00:00:00 '),
('2018-10-11 02:00:00','2018-10-11 02:00:00'), ('2018-10-11 04:00:00', '2018-10-11 04:00:00')
Example of dataframe:
-------------+---------------------+------------+------------+
| geo_id | Datetime | Variable1 | Variable2 |
+------------+---------------------+------------+------------+
| 1 | 2018-10-10 18:00:00 | 20 | 10 |
| 2 | 2018-10-10 18:00:00 | 22 | 10 |
| 1 | 2018-10-10 19:00:00 | 20 | nan |
| 2 | 2018-10-10 19:00:00 | 21 | nan |
| 1 | 2018-10-10 20:00:00 | 30 | nan |
| 2 | 2018-10-10 20:00:00 | 30 | nan |
| 1 | 2018-10-10 21:00:00 | nan | 5 |
| 2 | 2018-10-10 21:00:00 | nan | 5 |
| 1 | 2018-10-10 22:00:00 | nan | nan |
| 1 | 2018-10-10 23:00:00 | nan | nan |
| 1 | 2018-10-11 00:00:00 | nan | nan |
| 1 | 2018-10-11 01:00:00 | 5 | 2 |
| 1 | 2018-10-11 02:00:00 | nan | nan |
| 1 | 2018-10-11 03:00:00 | 2 | 1 |
| 1 | 2018-10-11 04:00:00 | nan | nan |
+------------+---------------------+------------+------------+
Update: And what if some datetimes are duplicated?
You may need to using groupby with condition
s=df.set_index('Datetime').isnull().all(axis=1)
df.loc[s,'Datetime'].groupby((~s).cumsum()[s]).agg(['first','last']).apply(tuple,1).tolist()
# find the all nan value and if they are consecutive we pull them into one group
Out[89]:
[('2018-10-1022:00:00', '2018-10-1100:00:00'),
('2018-10-1102:00:00', '2018-10-1102:00:00'),
('2018-10-1104:00:00', '2018-10-1104:00:00')]

Resources