I am working on a large dataset of stock data. I've been able to create a multi-indexed dataframe, but now I can't configure it the way I want it.
Basically, I am trying to make an index called 'DATE' and then sort each smaller set against the index.
Right now it looks like this:
+------------+----------+-------+-------------+-------+
| DATE | AAPL | | GE | |
+------------+----------+-------+-------------+-------+
| DATE | date | close | date | close |
| 05-31-2019 | 05/31/19 | 203 | 04-31-2019 | 9.3 |
| 05-30-2019 | 05/30/19 | 202 | 04-30-2019 | 9.3 |
| 05-29-2019 | 05/29/19 | 4 | 04-29-2019 | 9.6 |
| | | | | |
| ... | | | | |
| | | | | |
| NaN | NaN | NaN | 01/30/1970 | 0.77 |
| NaN | NaN | NaN | 01/29/1970 | 0.78 |
| NaN | NaN | NaN | 01/28/1970 | 0.76 |
+------------+----------+-------+-------------+-------+
Where DATE is the index.
And I want it to look like this:
+------------+----------+-------+----------+-------+
| DATE | AAPL | | GE | |
+------------+----------+-------+----------+-------+
| DATE | date | close | date | close |
| 05-31-2019 | 05/31/19 | 203 | NaN | NaN |
| 05-30-2019 | 05/30/19 | 202 | NaN | NaN |
| 05-29-2019 | 05/29/19 | 4 | NaN | NaN |
| | | | | |
| ... | | | | |
| | | | | |
| 01/30/1970 | NaN | NaN |01/30/1970| 0.77 |
| 01/29/1970 | NaN | NaN |01/29/1970| 0.78 |
| 01/28/1970 | NaN | NaN |01/28/1970| 0.76 |
+------------+----------+-------+----------+-------+
Where the index (DATE) has taken all of the unique values, and then all of the rows within stock symbols have moved to match the index where 'date' = 'DATE'.
I've tried so many attempts at this, but no matter what I can't figure out either one. I can't figure out how to make the Index a list of all of the unique 'date' values. And I can't figure out how to reformat the symbol data to match the new index.
A lot of my troubles (I suspect) have to do with the fact that I am using a multi-index for this, which makes everything more difficult as Pandas needs to know what level to be using.
I made the initial Index using this code:
df['DATE','DATE'] = df.xs(('AAPL', 'date'), level=('symbol', 'numbers'), axis=1)
df.set_index('DATE', inplace=True)
I tried to make one that kept adding unique values to the column, like this:
for f in filename_wo_ext:
data = df.xs([f, 'date'], level=['symbol', 'numbers'], axis=1)
df.append(data, ignore_index=True)
df['DATE','DATE'] = data
pd.concat([pd.DataFrame([df], columns=['DATE']) for f in filename_wo_ext], ignore_index=True)
But that didn't cycle and append in the for loop that I wanted it to, it just made a column based on the last symbol.
Then in terms of sorting the symbol frame to match the index, I still haven't been able to figure that out.
Thank you so much!
Related
I have the following dataframe. I would like to get the rows where the date is max for each pipeline_name
Here is the dataframe:
+----+-----------------+--------------------------------------+----------------------------------+
| | pipeline_name | runid | run_end_dt |
|----+-----------------+--------------------------------------+----------------------------------|
| 0 | test_pipeline | test_pipeline_run_101 | 2021-03-10 20:01:26.704265+00:00 |
| 1 | test_pipeline | test_pipeline_run_102 | 2021-03-13 20:08:31.929038+00:00 |
| 2 | test_pipeline2 | test_pipeline2_run_101 | 2021-03-10 20:13:53.083525+00:00 |
| 3 | test_pipeline2 | test_pipeline2_run_102 | 2021-03-12 20:14:51.757058+00:00 |
| 4 | test_pipeline2 | test_pipeline2_run_103 | 2021-03-13 20:17:00.285573+00:00 |
Here is the result I want to achieve:
+----+-----------------+--------------------------------------+----------------------------------+
| | pipeline_name | runid | run_end_dt |
|----+-----------------+--------------------------------------+----------------------------------|
| 0 | test_pipeline | test_pipeline_run_102 | 2021-03-13 20:08:31.929038+00:00 |
| 1 | test_pipeline2 | test_pipeline2_run_103 | 2021-03-13 20:17:00.285573+00:00 |
In the expected result, we have only the runid against each pipeline_name with the max run_end_dt
Thanks
Suppose your dataframe stored in a variable named df
Just use groupby() method:-
df.groupby('pipeline_name',as_index=False)[['runid','run_end_dt']].max()
Use groupby followed by a transform. Get the indices of the rows which have the max value in each group.
idx = (df.groupby(['pipeline_name'], sort=False)['run_end_dt'].transform('max') == df['run_end_dt'])
df = df.loc[idx]
The data frame I am working on has a column named "Phone" and I want to split in on / or , in a way such that I get the data frame as shown below in separate columns. For example, the first row is 0674-2537100/101 and I want to split it on "/" into two columns having values as 0674-2537100 and 0674-2537101.
Input:
+-------------------------------+
| Phone |
+-------------------------------+
| 0674-2537100/101 |
| 0674-2725627 |
| 0671 – 2647509 |
| 2392229 |
| 2586198/2583361 |
| 0663-2542855/2405168 |
| 0674 – 2563832/0674-2590796 |
| 0671-6520579/3200479 |
+-------------------------------+
Output:
+-----------------------------------+
| Phone | Phone1 |
+-----------------------------------+
| 0674-2537100 | 0674-2537101 |
| 0674-2725627 | |
| 0671 – 2647509 | |
| 2392229 | |
| 2586198 | 2583361 |
| 0663-2542855 | 0663-2405168 |
| 0674 – 2563832 | 0674-2590796 |
| 0671-6520579 | 0671-3200479 |
+-----------------------------------+
Here I came up with a solution where I can take out the length of strings on both sides of the separator(/). Take out their difference. Copy the substring from the first column from character position [:difference-1] to the second column.
So far my progress is,
df['Phone'] = df['Phone'].str.replace(' ', '')
df['Phone'] = df['Phone'].str.replace('–', '-')
df[['Phone','Phone1']] = df['Phone'].str.split("/",expand=True)
df["Phone1"].fillna(value=np.nan, inplace=True)
m2 = (df["Phone1"].str.len() < 12) & (df["Phone"].str.len() > 7)
m3 = df["Phone"].str.len() - df["Phonenew"].str.len()
df.loc[m2, "Phone1"] = df["Phone"].str[:m3-1] + df["Phonenew"]
It gives an error and the column has only nan values after I run this. PLease help me out here.
Considering you're only going to have 2 '/' in the 'Phone' column. Here's what you can do:
'''
This fucntion takes in rows of a dataframe as an input and returns row with appropriate values.
'''
def split_phone_number(row):
split_str=row['Phone'].split('/')
# Considering that you're only going to have 2 or lesser values, update
# the passed row's columns with appropriate values.
if len(split_str)>1:
row['Phone']=split_str[0]
row['Phone1']=split_str[1]
else:
row['Phone']=split_str[0]
row['Phone1']=''
# Return the updated row.
return row
# Making a dummy dataframe.
d={'Phone':['0674-2537100/101','0674-257349','0671-257349','257349','257349/100','101/100','5688343/438934']}
dataFrame= pd.DataFrame(data=d)
# Considering you're only going to have one extra column. adding that column to dataframe.
dataFrame=dataFrame.assign(Phone1=['' for i in range(dataFrame.shape[0])])
# applying the split_phone_number function to dataframe.
dataFrame=dataFrame.apply(split_phone_number,axis=1)
# Prinitng dataframe.
print(dataFrame)
Input:
+---------------------+
| Phone |
+---------------------+
| 0 0674-2537100/101 |
| 1 0674-257349 |
| 2 0671-257349 |
| 3 257349 |
| 4 257349/100 |
| 5 101/100 |
| 6 5688343/438934 |
+---------------------+
Output:
+----------------------------+
| Phone Phone1 |
+----------------------------+
| 0 0674-2537100 101 |
| 1 0674-257349 |
| 2 0671-257349 |
| 3 257349 |
| 4 257349 100 |
| 5 101 100 |
| 6 5688343 438934 |
+----------------------------+
For further reading:
dataframe.apply()
Hope this helps. Cheers!
Trying to figure out how to replace a specific column in Pyspark with null values randomly. So changing a dataframe such as this:
| A | B |
|----|----|
| 1 | 2 |
| 3 | 4 |
| 5 | 6 |
| 7 | 8 |
| 9 | 10 |
| 11 | 12 |
and randomly change 25% of the values in column 'B' to null values:
| A | B |
|----|------|
| 1 | 2 |
| 3 | NULL |
| 5 | 6 |
| 7 | NULL |
| 9 | NULL |
| 11 | 12 |
thanks to #pault I was able to answer my own question using the question he posted that you can find here
Essentially I ran something like this:
import pyspark.sql.functions as f
df1 = df.withColumn('Val', f.when(f.rand() > 0.25, df1['Val']).otherwise(f.lit(None))
Which will randomly select values with the column 'Val' and make it into a None value
This started its life as a list of activities. I first built a matrix similar to the one below to represent all activities, which I inverted to show all inactivity, before building the following matrix, where zero indicates an activity, and anything greater than zero indicates the number of days before the next activity.
+------+------------+------------+------------+------------+------------+------------+------------+------------+------------+
| Item | 01/08/2020 | 02/08/2020 | 03/08/2020 | 04/08/2020 | 05/08/2020 | 06/08/2020 | 07/08/2020 | 08/08/2020 | 09/08/2020 |
+------+------------+------------+------------+------------+------------+------------+------------+------------+------------+
| A | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| B | 3 | 2 | 1 | 0 | 0 | 3 | 2 | 1 | 0 |
| C | 0 | 2 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
| D | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | 0 |
| E | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 |
+------+------------+------------+------------+------------+------------+------------+------------+------------+------------+
Now I need to find suitable intervals for each Item. For instance, in this case I want to find all intervals with a minimum duration of 3-days.
+------+------------+------------+------------+------------+
| Item | 1_START | 1_END | 2_START | 2_END |
+------+------------+------------+------------+------------+
| A | NaN | NaN | NaN | NaN |
| B | 01/08/2020 | 03/08/2020 | 06/08/2020 | 08/08/2020 |
| C | NaN | NaN | NaN | NaN |
| D | 01/08/2020 | 07/08/2020 | NaN | NaN |
| E | 01/08/2020 | NaN | NaN | NaN |
+------+------------+------------+------------+------------+
In reality the data is 700+ columns wide and 1,000+ rows. How can I do this efficiently?
I have a large table in Excel, which is output of a data-gathering tool, that looks more or less like this:
DateA | ValueA | DateB | ValueB | ... | DateZ | ValueZ
---------------------------------------------------------------------------
2019-01-01 | 3 | 2019-01-01 | 6 | ... | 2019-01-04 | 7
2019-01-02 | 1 | 2019-01-04 | 2 | ... | 2019-01-05 | 3
And I'd like to process it so it would like this:
Date | Value | Type
-----------------------------
2019-01-01 | 3 | A
2019-01-02 | 1 | A
2019-01-01 | 6 | B
2019-01-04 | 2 | B
...
2019-01-04 | 7 | Z
2019-01-05 | 3 | Z
Because this is the format, that is used on our sql database.
How to do this in the least tedious way, preferably using PowerQuery? I'd like to avoid brute-force coping and pasting with vba loop.
The number of columns is fixed, but would be nice to have an option to add another one later on, the number of rows would however vary around some value (like 20, 21, 20, 22, 19, 20) day-to-day
Columns are harder to work with, so I'd first transform each column into a new row as a list.
ColumnsToRows =
Table.FromColumns(
{
Table.ToColumns(Source),
Table.ColumnNames(Source)
},
{"ColumnValues","ColumnName"}
)
This should give you a table as follows where each list consists of values in the corresponding column. For example, the top list is {1/1/2019,1/2/2019}. (The from columns part is to add the ColumnName column.)
| ColumnValues | ColumnName |
|--------------|------------|
| [List] | DateA |
| [List] | ValueA |
| [List] | DateB |
| [List] | ValueB |
| [List] | DateZ |
| [List] | ValueZ |
We can then filter this based on the data type in each list. To get the date rows you can write:
DataRows =
Table.SelectRows(
ColumnsToRows,
each Value.Type(List.First([ColumnValues])) = type date
)
Which gets you the following filtered table:
| ColumnValues | ColumnName |
|--------------|------------|
| [List] | DateA |
| [List] | DateB |
| [List] | DateZ |
If you expand the first column with Table.ExpandListColumn(DataRows, "ColumnValues"), then you get
| ColumnValues | ColumnName |
|--------------|------------|
| 1/1/2019 | DateA |
| 1/2/2019 | DateA |
| 1/1/2019 | DateB |
| 1/4/2019 | DateB |
| 1/4/2019 | DateZ |
| 1/5/2019 | DateZ |
The logic is analogous to filter and expand the value rows.
ValueRows =
Table.ExpandListColumn(
Table.SelectRows(
ColumnsToRows,
each Value.Type(List.First([ColumnValues])) = type number
),
"ColumnValues"
)
Which gets you a similar looking table:
| ColumnValues | ColumnName |
|--------------|------------|
| 3 | ValueA |
| 1 | ValueA |
| 6 | ValueB |
| 2 | ValueB |
| 7 | ValueZ |
| 3 | ValueZ |
Now we just need to combine together the columns we want into a single table:
Combine Columns =
Table.FromColumns(
{
DateRows[ColumnValues],
ValueRows[ColumnValues],
ValueRows[ColumnName]
},
{"Date", "Value", "Type"}
)
and then extract the text following Value in the column names.
ExtractType =
Table.TransformColumns(
CombineColumnns,
{{"Type", each Text.AfterDelimiter(_, "Value"), type text}}
)
The final table should be just as specified:
| Date | Value | Type |
|----------|-------|------|
| 1/1/2019 | 3 | A |
| 1/2/2019 | 1 | A |
| 1/1/2019 | 6 | B |
| 1/4/2019 | 2 | B |
| 1/4/2019 | 7 | Z |
| 1/5/2019 | 3 | Z |
All in a single query, the M code looks like this:
let
Source = <Source Goes Here>,
ColumnsToRows = Table.FromColumns({Table.ToColumns(Source), Table.ColumnNames(Source)}, {"ColumnValues","ColumnName"}),
DateRows = Table.ExpandListColumn(Table.SelectRows(ColumnsToRows, each Value.Type(List.First([ColumnValues])) = type date), "ColumnValues"),
ValueRows = Table.ExpandListColumn(Table.SelectRows(ColumnsToRows, each Value.Type(List.First([ColumnValues])) = type number), "ColumnValues"),
CombineColumnns = Table.FromColumns({DateRows[ColumnValues], ValueRows[ColumnValues], ValueRows[ColumnName]},{"Date", "Value", "Type"}),
ExtractType = Table.TransformColumns(CombineColumnns, {{"Type", each Text.AfterDelimiter(_, "Value"), type text}})
in
ExtractType