xarray equivalent to pandas subtract/add - python-3.x

I'm looking for a concise way to do arithmetics on a single dimension of a DataArray, and then have the result returned as a new DataArray (both the changed and unchanged parts). In pandas, I would do this using df.subtract(), but I haven't found the way to do this with xarray.
Here's how I would subtract the value 2 from the x dimension in pandas:
data = np.arange(0,6).reshape(2,3)
xc = np.arange(0, data.shape[0])
yc = np.arange(0, data.shape[1])
df1 = pd.DataFrame(data, index=xc, columns=yc)
df2 = df1.subtract(2, axis='columns')
For xarray though I don't know:
da1 = xr.DataArray(data, coords={'x': xc, 'y': yc}, dims=['x' , 'y'])
da2 = ?

In xarray, you can subtract from the rows or columns of an array by using broadcasting by dimension name.
For example:
>>> foo = xarray.DataArray([[1, 2, 3], [4, 5, 6]], dims=['x', 'y'])
>>> bar = xarray.DataArray([1, 4], dims='x')
# subtract along 'x'
>>> foo - bar
<xarray.DataArray (x: 2, y: 3)>
array([[0, 1, 2],
[0, 1, 2]])
Dimensions without coordinates: x, y
>>> baz = xarray.DataArray([1, 2, 3], dims='y')
# subtract along 'y'
>>> foo - baz
<xarray.DataArray (x: 2, y: 3)>
array([[0, 0, 0],
[3, 3, 3]])
Dimensions without coordinates: x, y
This works similar to axis='columns' vs axis='index' options that pandas provides, except the desired dimension is referenced by name.

When you do:
df1 = pd.DataFrame(data, index=xc, columns=yc)
df2 = df1.subtract(2, axis='columns')
You really are just subtracting 2 from the entire dataset...
Here is your output from above:
In [15]: df1
Out[15]:
0 1 2
0 0 1 2
1 3 4 5
In [16]: df2
Out[16]:
0 1 2
0 -2 -1 0
1 1 2 3
Which is equivalent to:
df3 = df1.subtract(2)
In [20]: df3
Out[20]:
0 1 2
0 -2 -1 0
1 1 2 3
And equivalent to:
df4 = df1 -2
In [22]: df4
Out[22]:
0 1 2
0 -2 -1 0
1 1 2 3
Therefore, for an xarray data array:
da1 = xr.DataArray(data, coords={'x': xc, 'y': yc}, dims=['x' , 'y'])
da2 = da1-2
In [24]: da1
Out[24]:
<xarray.DataArray (x: 2, y: 3)>
array([[0, 1, 2],
[3, 4, 5]])
Coordinates:
* y (y) int64 0 1 2
* x (x) int64 0 1
In [25]: da2
Out[25]:
<xarray.DataArray (x: 2, y: 3)>
array([[-2, -1, 0],
[ 1, 2, 3]])
Coordinates:
* y (y) int64 0 1 2
* x (x) int64 0 1
Now, if you would like to subtract from a specific column, that's a different problem, which I believe would require assignment indexing.

Related

How to convert list like value in each row into pure value in python dataframe?

I have a dataframe that has a value that looks as belows
colA desired
[0] 0
[3, 1] 3,1
[3, 1, 2] 3,1,2
[3, 1] 3,1
The type for colA is object.
Is there a way to do it?
Thanks
without the lambda as that will be slower for larger data sets, you can simply cast the list to a string type then strip unwanted characters.
import pandas as pd
df = pd.DataFrame(data={'colA':[[0], [3,1],[3,1,2],[3,1]]})
df['desired'] = df.colA.astype(str).str.replace('\[|\]|\'', '')
df
Output:
colA desired
0 [0] 0
1 [3, 1] 3, 1
2 [3, 1, 2] 3, 1, 2
3 [3, 1] 3, 1
Try:
df = pd.DataFrame(data={'colA':[[0], [3,1],[3,1,2],[3,1], [4]]})
df['desired'] = df['colA'].apply(lambda x: ','.join(map(str, x)))
OUTPUT:
colA desired
0 [0] 0
1 [3, 1] 3,1
2 [3, 1, 2] 3,1,2
3 [3, 1] 3,1
4 [4] 4
If colA is obj:
from ast import literal_eval
df = pd.DataFrame(data={'colA':["[0]", "[3,1]","[3,1,2]","[3,1]", "[4]"]})
df['desired'] = df['colA'].apply(lambda x: ','.join(map(str, literal_eval(x))))
You can use str.replace:
df['desired'] = df['colA'].str.replace(r'[][]', '', regex=True)
Prints:
colA desired
0 [0] 0
1 [3, 1] 3, 1
2 [3, 1, 2] 3, 1, 2
3 [3, 1] 3, 1
You can use the regex demo to play with it.

In Pandas dataframe remove duplicate strings in a list column and remove the corresponding ids in the other list column

I have Pandas dataframe like this:
df =
A B
[1, 5, 8, 10] [str1, str_to_be_removed*, str_to_be_removed*, str2]
[4, 7, 10] [str2, str5, str10]
[5, 2, 7, 9, 15] [str6, str2, str_to_be_removed*, str_to_be_removed*, str_to_be_removed*]
... ...
Given, str_to_be_removed, I would like to keep only the 1st instance of the string which contains str_to_be_removed, in column B and remove the other ones. In addition, I would also like to remove the corresponding ids from A. Size of lists contained in A and B in each row is the same.
How to do this?
Desired Output:
df =
A B
[1, 5, 10] [str1, str_to_be_removed*, str2]
[4, 7, 10] [str2, str5, str10]
[5, 2, 7] [str6, str2, str_to_be_removed*]
EDIT:
So, this is the sample df:
df =
df = pd.DataFrame({'A':[[1,2],[2,3],[3,4,5]],[10,11,12]\
'B':[['str_to_be_removed_bla','str_to_be_removed_bla'],['str_to_be_removed_bla','b'],['b','c','d'],\
['str_to_be_removed_bla_bla','str_to_be_removed_bla','f']\
]})
Desired Output:
df =
A B
[1] [str_to_be_removed_bla]
[2,3] [str_to_be_removed_bla, b]
[3,4,5] [b, c, d]
[10, 12] [str_to_be_removed_bla_bla,f]
steps:
use zip to combine relate elements together.
unnest the zip list, use explode
drop duplicates, keep first
goupby index, agg as list.
df = pd.DataFrame({'A':[
[1,2],[2,3],[3,4,5],[10, 11, 12]
],
'B':[['str_to_be_removed_bla','str_to_be_removed_bla'],
['str_to_be_removed_bla','b'],
['b', 'c', 'd'],
['str_to_be_removed_bla_bla','str_to_be_removed_bla','f']
]})
# zip and explode step
# df_obj = df.apply(lambda x: list(zip(x.A, x.B)),
# axis = 1
# ).to_frame()
# df_obj = df_obj.explode(0).reset_index()
# df_obj['A'] = df_obj[0].str[0]
# df_obj['B'] = df_obj[0].str[1]
update with #Joe Ferndz's solution, simplify the steps.
# explode the columns
dfTemp = df.apply(lambda x: x.explode())
df_obj = dfTemp.reset_index()
# can add more conditions to filter and handle to drop_duplicates
cond = df_obj['B'].astype(str).str.contains("str_to_be_removed")
df1 = df_obj[cond].drop_duplicates(['index'], keep='first')
df2 = df_obj[~cond]
df3 = pd.concat([df1, df2]).sort_index()
result = df3.groupby('index').agg({'A':list, 'B':list})
result
A B
index
0 [1] [str_to_be_removed_bla]
1 [2, 3] [str_to_be_removed_bla, b]
2 [3, 4, 5] [b, c, d]
3 [10, 12] [str_to_be_removed_bla_bla, f]
extend this solution to multiple patterns in 1 step: str_to_be_removed1, str_to_be_removed2,
df = pd.DataFrame({'A':[
[1,2],[2,3],[3,4,5, 6],[10, 11, 12]
],
'B':[['str_to_be_removed1_bla','str_to_be_removed1_bla'],
['str_to_be_removed1_bla','b'],
['b', 'b', 'c', 'd'],
['str_to_be_removed2_bla_bla','str_to_be_removed2_bla','f']
]})
print(df)
# A B
# 0 [1, 2] [str_to_be_removed1_bla, str_to_be_removed1_bla]
# 1 [2, 3] [str_to_be_removed1_bla, b]
# 2 [3, 4, 5, 6] [b, b, c, d]
# 3 [10, 11, 12] [str_to_be_removed2_bla_bla, str_to_be_removed2_bla, f]
df_obj = df.apply(lambda x: x.explode()).reset_index()
df_obj['tag'] = df_obj['B'].str.extract(r'(str_to_be_removed1|str_to_be_removed2)')
print(df_obj)
# index A B tag
# 0 0 1 str_to_be_removed1_bla str_to_be_removed1
# 1 0 2 str_to_be_removed1_bla str_to_be_removed1
# 2 1 2 str_to_be_removed1_bla str_to_be_removed1
# 3 1 3 b NaN
# 4 2 3 b NaN
# 5 2 4 b NaN
# 6 2 5 c NaN
# 7 2 6 d NaN
# 8 3 10 str_to_be_removed2_bla_bla str_to_be_removed2
# 9 3 11 str_to_be_removed2_bla str_to_be_removed2
# 10 3 12 f NaN
cond = df_obj['tag'].notnull()
df1 = df_obj[cond].drop_duplicates(['index', 'tag'], keep='first')
df2 = df_obj[~cond]
df3 = pd.concat([df1, df2]).sort_index()
print(df3.groupby('index').agg({'A':list, 'B':list}))
# A B
# index
# 0 [1] [str_to_be_removed1_bla]
# 1 [2, 3] [str_to_be_removed1_bla, b]
# 2 [3, 4, 5, 6] [b, b, c, d]
# 3 [10, 12] [str_to_be_removed2_bla_bla, f]
Here's an another approach to do this. Here' you can give the search string and it will remove it from the list.
import pandas as pd
pd.set_option('display.max_colwidth', 500)
df = pd.DataFrame({'A':[[1,2],[2,3],[3,4,5],[10,11,12]],
'B':[['str_to_be_removed_bla','str_to_be_removed_bla'],
['str_to_be_removed_bla','b'],
['b','c','d'],
['str_to_be_removed_bla_bla','str_to_be_removed_bla','f']]})
print (df)
search_str = 'str_to_be_removed'
#explode each column into its own rows
dfTemp = df.apply(lambda x: x.explode())
#flag columns that contains search string
dfTemp['Found'] = dfTemp.B.str.contains(search_str)
#cumcount by Index and by search strings to get duplicates
dfTemp['Bx'] = dfTemp.groupby([dfTemp.index,'Found']).cumcount()
#Exclude all records where search string was found and count is more than 1
dfTemp = dfTemp[~(dfTemp['Found'] & dfTemp['Bx'] > 0)]
#Extract back all the records into the dataframe and store as list
df_final = dfTemp.groupby(dfTemp.index).agg({'A':list, 'B':list})
del dfTemp
print (df_final)
The output of this is:
Original dataframe:
A B
0 [1, 2] [str_to_be_removed_bla, str_to_be_removed_bla]
1 [2, 3] [str_to_be_removed_bla, b]
2 [3, 4, 5] [b, c, d]
3 [10, 11, 12] [str_to_be_removed_bla_bla, str_to_be_removed_bla, f]
search string: 'str_to_be_removed'
Final dataframe:
A B
0 [1] [str_to_be_removed_bla]
1 [2, 3] [str_to_be_removed_bla, b]
2 [3, 4, 5] [b, c, d]
3 [10, 12] [str_to_be_removed_bla_bla, f]
Let us try dict
out = pd.DataFrame([[list(dict(zip(y[::-1],x[::-1])).values())[::-1],list(dict(zip(y[::-1],x[::-1])).keys())[::-1]] for x , y in zip(df.A,df.B)])
out
0 1
0 [1] [a]
1 [2, 2] [a, b]
Sample dataframe
df = pd.DataFrame({'A':[[1,2],[2,2]],'B':[['a','a'],['a','b']]})
Don't think a native pandas solution is possible for this case. And even there is, you are not likely to have a performance gain with it. A traditional for loop might work better for your case:
def remove_dupes(a, b, dupe_pattern):
seen_dupe = False
_a, _b = [], []
for x, y in zip(a, b):
if dupe_pattern in y:
if not seen_dupe:
seen_dupe = True
_a.append(x)
_b.append(y)
else:
_a.append(x)
_b.append(y)
return _a, _b
_A, _B = [], []
for a, b in zip(df.A, df.B):
_a, _b = remove_dupes(a, b, 'str_to_be_removed')
_A.append(_a)
_B.append(_b)
df['A'], df['B'] = _A, _B
print(df)
# A B
#0 [1] [str_to_be_removed_bla]
#1 [2, 3] [str_to_be_removed_bla, b]
#2 [3, 4, 5] [b, c, d]
#3 [10, 12] [str_to_be_removed_bla_bla, f]
Try run it here.

Compare panda data frame indices and update the rows

I have two excel files which I read by pandas. I am comparing the index in file 1 with the index in file 2 (not the same length (ex: 10,100) and if they match, the row[index] in the second file will be zeros and else will not change. I am using for and if loops for this, but the more data I want to process(1e3,5e3), the run time becomes longer. So, is there a better way to perform such a comparison?. Here's an example of what I am using.
df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]],
index=[4, 5, 6], columns=['A', 'B', 'C'])
df1 = pd.DataFrame([['w'], ['y' ], ['z']],
index=[4, 5, 1])
for j in df1.index:
for i in df.index:
if i == j:
df.loc[i, :] = 0
else:
df.loc[i, :] = df.loc[i, :]
print(df)
Here loops are not necessary, you can set values to 0 per rows by DataFrame.mask with Series.isin (necessary convert index to Series for avoid ValueError: Array conditional must be same shape as self):
df = df.mask(df.index.to_series().isin(df1.index), 0)
Or with Index.isin and numpy.where if want improve performance:
arr = np.where(df.index.isin(df1.index)[:, None], 0, df)
df = pd.DataFrame(arr, index=df.index, columns=df.columns)
print(df)
A B C
4 0 0 0
5 0 0 0
6 10 20 30

Get the most repeated value from columns of list other than zero in pandas data frame

I have a following data frame,
df1=
mac gw_mac ibeaconMajor ibeaconMinor
ac233f264920 ac233fc015f6 [1, 0, 1] [1, 0]
ac233f26492b ac233fc015f6 [0, 0, 0] [0, 0]
ac233f264933 ac233fc015f6 [0, 1, 1] [0, 2]
If all the values in a list(from the columns "ibeaconMajor" & "ibeaconMinor") is "0" it should return as "0" or else it should return frequently occurred non-zero values from a list as like below,
df1=
mac gw_mac ibeaconMajor ibeaconMinor
ac233f264920 ac233fc015f6 1 1
ac233f26492b ac233fc015f6 0 0
ac233f264933 ac233fc015f6 1 2
Idea is use DataFrame.applymap for elementwise apply lambda function - first remve 0 values in list comprehension, get top values by Counter and add next with iter for possible add 0 if all 0 values - here in tuple for possible select first value of tuples:
from collections import Counter
cols = ['ibeaconMajor','ibeaconMinor']
f = lambda x: next(iter(Counter([y for y in x if y != 0]).most_common(1)), (0,))[0]
#alternative
#f = lambda x: next(iter(Counter(filter(lambda y: y != 0, x)).most_common(1)), (0,))[0]
df[cols] = df[cols].applymap(f)
print (df)
mac gw_mac ibeaconMajor ibeaconMinor
0 ac233f264920 ac233fc015f6 1 1
1 ac233f26492b ac233fc015f6 0 0
2 ac233f264933 ac233fc015f6 1 2

python 3 functools.reduce gives 0 if not literal list given

I am trying reduce in python and getting 0 when expecting other values:
In [1]: from functools import reduce
In [2]: reduce( (lambda x, y: x * y), [1, 2, 3, 4] )
Out[2]: 24
In [3]: def multiply(x, y): return x * y
In [4]: reduce( multiply, [1, 2, 3, 4] )
Out[4]: 24
In [5]: reduce( multiply, range(5) )
Out[5]: 0
In [6]: reduce( multiply, list(range(5)) )
Out[6]: 0
[...]
In [11]: L = list(range(5))
In [12]: L
Out[12]: [0, 1, 2, 3, 4]
In [13]: reduce(multiply, L)
Out[13]: 0
Why am I getting 0 when not entering a literal list? How can I reduce arbitrary lists? What am I missing?
Python's range starts with 0, not 1. 0 times anything causes the 0 you're always getting...

Resources