Subtract a subset of columns from a key column in Pandas Pivot - python-3.x

I have a pivot table with multiple columns of data in a time series:
A B C D
11/1/2018 1 5 5 7
11/2/2018 2 6 6 8
11/3/2018 3 7 7 9
The values in the data columns are not important for this example. I would like to subtract the value in the "key" column (column A in this case) from a subset of columns: B & C in this case. I would then like to drop any columns not in the subset or the key column. Result would be:
A B C
11/1/2018 1 4 4
11/2/2018 2 4 4
11/3/2018 3 4 4
I have subtracted columns in the past via code like this:
df['dif'] = df['B'] -df['A']
But this will add the "dif" column. I would like to replace column B with B-A values. Also, instead of passing the instructions one at a time (B-A, C-A), would like to pass the list something like "if column in list, subtract key column, else drop column."
Thanks

pandas.DataFrame.sub with axis=0
When subtracting a Series from a DataFrame Pandas will align the columns of the DataFrame with the index of the Series by default. This is what happens when you use the - operator. However, when you use the pandas.DataFrame.sub method, you can override that default and specify that the DataFrame should align its index with the index of the Series.
def f(d, key, subset):
return d[[key]].join(d[subset].sub(d[key], axis=0))
f(df, 'A', ['B', 'C'])
A B C
11/1/2018 1 4 4
11/2/2018 2 4 4
11/3/2018 3 4 4

You can use apply to substract A from the subset columns that you choose and finally join again with A.
df['A'].to_frame().join(df[['B','C']].apply(lambda x: x - df['A']))
A B C
11/1/2018 1 4 4
11/2/2018 2 4 4
11/3/2018 3 4 4

Related

sum of column if an other colum matches a value

I have a dataframe, i want to have a new column with the sum of all count where foo is equal than the current row.
It would be possble to create a new dataframe and group sum it there and merge it back. but I guess there is a much simpler solution.
Any hints are highly appreciated
Input:
foo
count
a
3
a
7
b
1
b
2
Output:
foo
count
sum_of_count
a
3
10
a
7
10
b
1
3
b
2
3

Identify and count alternating parts of a column in a (timeseries) dataframe

I am analyzing trades done in a futures contract, based on a csv file with a list of trades (columns are Side, Qty, Price, Date).
I have imported the file and sorted the trades chronologically by time. The column "Side" (BUY/SELL) is now:
B
S
S
B
B
S
S
B
B
B
B
I want to give each row of B's and each row of S's a unique number, in order for me to group each individual parts of B's and S's for further analysis. I want for example to find out what the average price of each row of Bs and each row of Ss are.
In the example above there are 5 rows/parts in total, 3 B's and 2 S's. The first row of B's should be 1. The second row of B's should be 3 and the last row of B's should be 5. Basically I want to add a column with this output:
1
2
2
3
3
4
4
5
5
5
5
Now I should be able to find the average price of the four B's in row number 5 using groupby with the new column as argument and mean().
But how can I make the counter needed for this new column? I am able to identify each change using somehing like np.where(), diff(), abs() + cumsum() and 1 and -1, but I dont see how I can add +1 to each alternation.
Use Series.shift with compare not equal and cumulative sum by Series.cumsum:
df['new'] = df['Side'].ne(df['Side'].shift()).cumsum()
How it working:
df = df.assign(shifted = df['Side'].shift(),
mask = df['Side'].ne(df['Side'].shift()),
new = df['Side'].ne(df['Side'].shift()).cumsum())
print (df)
Side shifted mask new
0 B NaN True 1
1 S B True 2
2 S S False 2
3 B S True 3
4 B B False 3
5 S B True 4
6 S S False 4
7 B S True 5
8 B B False 5
9 B B False 5
10 B B False 5

Percentage of values when one column has values and other column is null

May be this is the duplicate of other question but I am not able to solve the problem.
I have transaction data having 100 features and 2.3 million rows. I want to find percentage of values present in one column and Null in other column for every combination of columns.
Example:
A B C D
1 NA 2 3
2 4 5 6
NA 5 6 7
8 2 NA NA
9 8 7 6
So output should be:
When A has values B has Null 1/4=0.25 times
When A has values C has Null 1/4=0.25 times
Similarly for every other combination of columns and create a dataframe for it.
I tried combination of columns function in Python but it's not giving the desired result.
itertools.combinations(daf.columns, n)
You can write 2 for loops to iterate for individual columns and then compare.

How to remove duplicates rows by same values in different order in dataframe by pandas

How to remove the duplicates in the df? df only has 1 column. In this case "60,25" and "25,60" is a pair of duplicated rows. The output should be the new df. For each pair of duplicated row, the kept row in format "A,B" where A < B, the removed row should be the one A>B. In this case, "25,60" and "80,123" should be kept. For unique row, it should stay whatever it is.
IIUC, using get_dummies with duplicated
df[~df.A.str.get_dummies(sep=',').duplicated()]
Out[956]:
A
0 A,C
1 A,B
4 X,Y,Z
Data input
df
Out[957]:
A
0 A,C
1 A,B
2 C,A
3 B,A
4 X,Y,Z
5 Z,Y,X
Update op change the question totally to different question
newdf=df.A.str.get_dummies(sep=',')
newdf[~newdf.duplicated()].dot(newdf.columns+',').str[:-1]
Out[976]:
0 25,60
1 123,37
dtype: object
I'd do a combination of things.
Use pandas.Series.str.split to split by commas
Use apply(frozenset) to get a hashable set such that I can use duplicated
Use pandas.Series.duplicated with keep='last'
df[~df.A.str.split(',').apply(frozenset).duplicated(keep='last')]
A
1 123,17
3 80,123
4 25,60
5 25,42
Addressing comments
df.A.apply(
lambda x: tuple(sorted(map(int, x.split(','))))
).drop_duplicates().apply(
lambda x: ','.join(map(str, x))
)
0 25,60
1 17,123
2 80,123
5 25,42
Name: A, dtype: object
Setup
df = pd.DataFrame(dict(
A='60,25 123,17 123,80 80,123 25,60 25,42'.split()
))

Filter columns based on a value (Pandas): TypeError: Could not compare ['a'] with block values

I'm trying filter a DataFrame columns based on a value.
In[41]: df = pd.DataFrame({'A':['a',2,3,4,5], 'B':[6,7,8,9,10]})
In[42]: df
Out[42]:
A B
0 a 6
1 2 7
2 3 8
3 4 9
4 5 10
Filtering columns:
In[43]: df.loc[:, (df != 6).iloc[0]]
Out[43]:
A
0 a
1 2
2 3
3 4
4 5
It works! But, When I used strings,
In[44]: df.loc[:, (df != 'a').iloc[0]]
I'm getting this error: TypeError: Could not compare ['a'] with block values
You are trying to compare string 'a' with numeric values in column B.
If you want your code to work, first promote dtype of column B as numpy.object, It will work.
df.B = df.B.astype(np.object)
Always check data types of the columns before performing the operations using
df.info()
You could do this with masks instead, for example:
df[df.A!='a'].A
and to filter from any column:
df[df.apply(lambda x: sum([x_=='a' for x_ in x])==0, axis=1)]
The problem is due to the fact that there are numeric and string objects in the dataframe.
You can loop through each column and check each column as a series for a specific value using
(Series=='a').any()

Resources