xarray equivalent to pandas subtract/add

xarray equivalent to pandas subtract/add - python-3.x

I'm looking for a concise way to do arithmetics on a single dimension of a DataArray, and then have the result returned as a new DataArray (both the changed and unchanged parts). In pandas, I would do this using df.subtract(), but I haven't found the way to do this with xarray.
Here's how I would subtract the value 2 from the x dimension in pandas:
data = np.arange(0,6).reshape(2,3)
xc = np.arange(0, data.shape[0])
yc = np.arange(0, data.shape[1])
df1 = pd.DataFrame(data, index=xc, columns=yc)
df2 = df1.subtract(2, axis='columns')
For xarray though I don't know:
da1 = xr.DataArray(data, coords={'x': xc, 'y': yc}, dims=['x' , 'y'])
da2 = ?

In xarray, you can subtract from the rows or columns of an array by using broadcasting by dimension name.
For example:
>>> foo = xarray.DataArray([[1, 2, 3], [4, 5, 6]], dims=['x', 'y'])
>>> bar = xarray.DataArray([1, 4], dims='x')
# subtract along 'x'
>>> foo - bar
<xarray.DataArray (x: 2, y: 3)>
array([[0, 1, 2],
[0, 1, 2]])
Dimensions without coordinates: x, y
>>> baz = xarray.DataArray([1, 2, 3], dims='y')
# subtract along 'y'
>>> foo - baz
<xarray.DataArray (x: 2, y: 3)>
array([[0, 0, 0],
[3, 3, 3]])
Dimensions without coordinates: x, y
This works similar to axis='columns' vs axis='index' options that pandas provides, except the desired dimension is referenced by name.

When you do:
df1 = pd.DataFrame(data, index=xc, columns=yc)
df2 = df1.subtract(2, axis='columns')
You really are just subtracting 2 from the entire dataset...
Here is your output from above:
In [15]: df1
Out[15]:
0 1 2
0 0 1 2
1 3 4 5
In [16]: df2
Out[16]:
0 1 2
0 -2 -1 0
1 1 2 3
Which is equivalent to:
df3 = df1.subtract(2)
In [20]: df3
Out[20]:
0 1 2
0 -2 -1 0
1 1 2 3
And equivalent to:
df4 = df1 -2
In [22]: df4
Out[22]:
0 1 2
0 -2 -1 0
1 1 2 3
Therefore, for an xarray data array:
da1 = xr.DataArray(data, coords={'x': xc, 'y': yc}, dims=['x' , 'y'])
da2 = da1-2
In [24]: da1
Out[24]:
<xarray.DataArray (x: 2, y: 3)>
array([[0, 1, 2],
[3, 4, 5]])
Coordinates:
* y (y) int64 0 1 2
* x (x) int64 0 1
In [25]: da2
Out[25]:
<xarray.DataArray (x: 2, y: 3)>
array([[-2, -1, 0],
[ 1, 2, 3]])
Coordinates:
* y (y) int64 0 1 2
* x (x) int64 0 1
Now, if you would like to subtract from a specific column, that's a different problem, which I believe would require assignment indexing.

Related

How to convert list like value in each row into pure value in python dataframe?

I have a dataframe that has a value that looks as belows
colA desired
[0] 0
[3, 1] 3,1
[3, 1, 2] 3,1,2
[3, 1] 3,1
The type for colA is object.
Is there a way to do it?
Thanks

without the lambda as that will be slower for larger data sets, you can simply cast the list to a string type then strip unwanted characters.
import pandas as pd
df = pd.DataFrame(data={'colA':[[0], [3,1],[3,1,2],[3,1]]})
df['desired'] = df.colA.astype(str).str.replace('\[|\]|\'', '')
df
Output:
colA desired
0 [0] 0
1 [3, 1] 3, 1
2 [3, 1, 2] 3, 1, 2
3 [3, 1] 3, 1

Try:
df = pd.DataFrame(data={'colA':[[0], [3,1],[3,1,2],[3,1], [4]]})
df['desired'] = df['colA'].apply(lambda x: ','.join(map(str, x)))
OUTPUT:
colA desired
0 [0] 0
1 [3, 1] 3,1
2 [3, 1, 2] 3,1,2
3 [3, 1] 3,1
4 [4] 4
If colA is obj:
from ast import literal_eval
df = pd.DataFrame(data={'colA':["[0]", "[3,1]","[3,1,2]","[3,1]", "[4]"]})
df['desired'] = df['colA'].apply(lambda x: ','.join(map(str, literal_eval(x))))

You can use str.replace:
df['desired'] = df['colA'].str.replace(r'[][]', '', regex=True)
Prints:
colA desired
0 [0] 0
1 [3, 1] 3, 1
2 [3, 1, 2] 3, 1, 2
3 [3, 1] 3, 1
You can use the regex demo to play with it.

In Pandas dataframe remove duplicate strings in a list column and remove the corresponding ids in the other list column

I have Pandas dataframe like this:
df =
A B
[1, 5, 8, 10] [str1, str_to_be_removed*, str_to_be_removed*, str2]
[4, 7, 10] [str2, str5, str10]
[5, 2, 7, 9, 15] [str6, str2, str_to_be_removed*, str_to_be_removed*, str_to_be_removed*]
... ...
Given, str_to_be_removed, I would like to keep only the 1st instance of the string which contains str_to_be_removed, in column B and remove the other ones. In addition, I would also like to remove the corresponding ids from A. Size of lists contained in A and B in each row is the same.
How to do this?
Desired Output:
df =
A B
[1, 5, 10] [str1, str_to_be_removed*, str2]
[4, 7, 10] [str2, str5, str10]
[5, 2, 7] [str6, str2, str_to_be_removed*]
EDIT:
So, this is the sample df:
df =
df = pd.DataFrame({'A':[[1,2],[2,3],[3,4,5]],[10,11,12]\
'B':[['str_to_be_removed_bla','str_to_be_removed_bla'],['str_to_be_removed_bla','b'],['b','c','d'],\
['str_to_be_removed_bla_bla','str_to_be_removed_bla','f']\
]})
Desired Output:
df =
A B
[1] [str_to_be_removed_bla]
[2,3] [str_to_be_removed_bla, b]
[3,4,5] [b, c, d]
[10, 12] [str_to_be_removed_bla_bla,f]

steps:
use zip to combine relate elements together.
unnest the zip list, use explode
drop duplicates, keep first
goupby index, agg as list.
df = pd.DataFrame({'A':[
[1,2],[2,3],[3,4,5],[10, 11, 12]
],
'B':[['str_to_be_removed_bla','str_to_be_removed_bla'],
['str_to_be_removed_bla','b'],
['b', 'c', 'd'],
['str_to_be_removed_bla_bla','str_to_be_removed_bla','f']
]})
# zip and explode step
# df_obj = df.apply(lambda x: list(zip(x.A, x.B)),
# axis = 1
# ).to_frame()
# df_obj = df_obj.explode(0).reset_index()
# df_obj['A'] = df_obj[0].str[0]
# df_obj['B'] = df_obj[0].str[1]
update with #Joe Ferndz's solution, simplify the steps.
# explode the columns
dfTemp = df.apply(lambda x: x.explode())
df_obj = dfTemp.reset_index()
# can add more conditions to filter and handle to drop_duplicates
cond = df_obj['B'].astype(str).str.contains("str_to_be_removed")
df1 = df_obj[cond].drop_duplicates(['index'], keep='first')
df2 = df_obj[~cond]
df3 = pd.concat([df1, df2]).sort_index()
result = df3.groupby('index').agg({'A':list, 'B':list})
result
A B
index
0 [1] [str_to_be_removed_bla]
1 [2, 3] [str_to_be_removed_bla, b]
2 [3, 4, 5] [b, c, d]
3 [10, 12] [str_to_be_removed_bla_bla, f]
extend this solution to multiple patterns in 1 step: str_to_be_removed1, str_to_be_removed2,
df = pd.DataFrame({'A':[
[1,2],[2,3],[3,4,5, 6],[10, 11, 12]
],
'B':[['str_to_be_removed1_bla','str_to_be_removed1_bla'],
['str_to_be_removed1_bla','b'],
['b', 'b', 'c', 'd'],
['str_to_be_removed2_bla_bla','str_to_be_removed2_bla','f']
]})
print(df)
# A B
# 0 [1, 2] [str_to_be_removed1_bla, str_to_be_removed1_bla]
# 1 [2, 3] [str_to_be_removed1_bla, b]
# 2 [3, 4, 5, 6] [b, b, c, d]
# 3 [10, 11, 12] [str_to_be_removed2_bla_bla, str_to_be_removed2_bla, f]
df_obj = df.apply(lambda x: x.explode()).reset_index()
df_obj['tag'] = df_obj['B'].str.extract(r'(str_to_be_removed1|str_to_be_removed2)')
print(df_obj)
# index A B tag
# 0 0 1 str_to_be_removed1_bla str_to_be_removed1
# 1 0 2 str_to_be_removed1_bla str_to_be_removed1
# 2 1 2 str_to_be_removed1_bla str_to_be_removed1
# 3 1 3 b NaN
# 4 2 3 b NaN
# 5 2 4 b NaN
# 6 2 5 c NaN
# 7 2 6 d NaN
# 8 3 10 str_to_be_removed2_bla_bla str_to_be_removed2
# 9 3 11 str_to_be_removed2_bla str_to_be_removed2
# 10 3 12 f NaN
cond = df_obj['tag'].notnull()
df1 = df_obj[cond].drop_duplicates(['index', 'tag'], keep='first')
df2 = df_obj[~cond]
df3 = pd.concat([df1, df2]).sort_index()
print(df3.groupby('index').agg({'A':list, 'B':list}))
# A B
# index
# 0 [1] [str_to_be_removed1_bla]
# 1 [2, 3] [str_to_be_removed1_bla, b]
# 2 [3, 4, 5, 6] [b, b, c, d]
# 3 [10, 12] [str_to_be_removed2_bla_bla, f]

Here's an another approach to do this. Here' you can give the search string and it will remove it from the list.
import pandas as pd
pd.set_option('display.max_colwidth', 500)
df = pd.DataFrame({'A':[[1,2],[2,3],[3,4,5],[10,11,12]],
'B':[['str_to_be_removed_bla','str_to_be_removed_bla'],
['str_to_be_removed_bla','b'],
['b','c','d'],
['str_to_be_removed_bla_bla','str_to_be_removed_bla','f']]})
print (df)
search_str = 'str_to_be_removed'
#explode each column into its own rows
dfTemp = df.apply(lambda x: x.explode())
#flag columns that contains search string
dfTemp['Found'] = dfTemp.B.str.contains(search_str)
#cumcount by Index and by search strings to get duplicates
dfTemp['Bx'] = dfTemp.groupby([dfTemp.index,'Found']).cumcount()
#Exclude all records where search string was found and count is more than 1
dfTemp = dfTemp[~(dfTemp['Found'] & dfTemp['Bx'] > 0)]
#Extract back all the records into the dataframe and store as list
df_final = dfTemp.groupby(dfTemp.index).agg({'A':list, 'B':list})
del dfTemp
print (df_final)
The output of this is:
Original dataframe:
A B
0 [1, 2] [str_to_be_removed_bla, str_to_be_removed_bla]
1 [2, 3] [str_to_be_removed_bla, b]
2 [3, 4, 5] [b, c, d]
3 [10, 11, 12] [str_to_be_removed_bla_bla, str_to_be_removed_bla, f]
search string: 'str_to_be_removed'
Final dataframe:
A B
0 [1] [str_to_be_removed_bla]
1 [2, 3] [str_to_be_removed_bla, b]
2 [3, 4, 5] [b, c, d]
3 [10, 12] [str_to_be_removed_bla_bla, f]

Let us try dict
out = pd.DataFrame([[list(dict(zip(y[::-1],x[::-1])).values())[::-1],list(dict(zip(y[::-1],x[::-1])).keys())[::-1]] for x , y in zip(df.A,df.B)])
out
0 1
0 [1] [a]
1 [2, 2] [a, b]
Sample dataframe
df = pd.DataFrame({'A':[[1,2],[2,2]],'B':[['a','a'],['a','b']]})

Don't think a native pandas solution is possible for this case. And even there is, you are not likely to have a performance gain with it. A traditional for loop might work better for your case:
def remove_dupes(a, b, dupe_pattern):
seen_dupe = False
_a, _b = [], []
for x, y in zip(a, b):
if dupe_pattern in y:
if not seen_dupe:
seen_dupe = True
_a.append(x)
_b.append(y)
else:
_a.append(x)
_b.append(y)
return _a, _b
_A, _B = [], []
for a, b in zip(df.A, df.B):
_a, _b = remove_dupes(a, b, 'str_to_be_removed')
_A.append(_a)
_B.append(_b)
df['A'], df['B'] = _A, _B
print(df)
# A B
#0 [1] [str_to_be_removed_bla]
#1 [2, 3] [str_to_be_removed_bla, b]
#2 [3, 4, 5] [b, c, d]
#3 [10, 12] [str_to_be_removed_bla_bla, f]
Try run it here.

Compare panda data frame indices and update the rows

I have two excel files which I read by pandas. I am comparing the index in file 1 with the index in file 2 (not the same length (ex: 10,100) and if they match, the row[index] in the second file will be zeros and else will not change. I am using for and if loops for this, but the more data I want to process(1e3,5e3), the run time becomes longer. So, is there a better way to perform such a comparison?. Here's an example of what I am using.
df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]],
index=[4, 5, 6], columns=['A', 'B', 'C'])
df1 = pd.DataFrame([['w'], ['y' ], ['z']],
index=[4, 5, 1])
for j in df1.index:
for i in df.index:
if i == j:
df.loc[i, :] = 0
else:
df.loc[i, :] = df.loc[i, :]
print(df)

Here loops are not necessary, you can set values to 0 per rows by DataFrame.mask with Series.isin (necessary convert index to Series for avoid ValueError: Array conditional must be same shape as self):
df = df.mask(df.index.to_series().isin(df1.index), 0)
Or with Index.isin and numpy.where if want improve performance:
arr = np.where(df.index.isin(df1.index)[:, None], 0, df)
df = pd.DataFrame(arr, index=df.index, columns=df.columns)
print(df)
A B C
4 0 0 0
5 0 0 0
6 10 20 30

Get the most repeated value from columns of list other than zero in pandas data frame

I have a following data frame,
df1=
mac gw_mac ibeaconMajor ibeaconMinor
ac233f264920 ac233fc015f6 [1, 0, 1] [1, 0]
ac233f26492b ac233fc015f6 [0, 0, 0] [0, 0]
ac233f264933 ac233fc015f6 [0, 1, 1] [0, 2]
If all the values in a list(from the columns "ibeaconMajor" & "ibeaconMinor") is "0" it should return as "0" or else it should return frequently occurred non-zero values from a list as like below,
df1=
mac gw_mac ibeaconMajor ibeaconMinor
ac233f264920 ac233fc015f6 1 1
ac233f26492b ac233fc015f6 0 0
ac233f264933 ac233fc015f6 1 2

Idea is use DataFrame.applymap for elementwise apply lambda function - first remve 0 values in list comprehension, get top values by Counter and add next with iter for possible add 0 if all 0 values - here in tuple for possible select first value of tuples:
from collections import Counter
cols = ['ibeaconMajor','ibeaconMinor']
f = lambda x: next(iter(Counter([y for y in x if y != 0]).most_common(1)), (0,))[0]
#alternative
#f = lambda x: next(iter(Counter(filter(lambda y: y != 0, x)).most_common(1)), (0,))[0]
df[cols] = df[cols].applymap(f)
print (df)
mac gw_mac ibeaconMajor ibeaconMinor
0 ac233f264920 ac233fc015f6 1 1
1 ac233f26492b ac233fc015f6 0 0
2 ac233f264933 ac233fc015f6 1 2

python 3 functools.reduce gives 0 if not literal list given

I am trying reduce in python and getting 0 when expecting other values:
In [1]: from functools import reduce
In [2]: reduce( (lambda x, y: x * y), [1, 2, 3, 4] )
Out[2]: 24
In [3]: def multiply(x, y): return x * y
In [4]: reduce( multiply, [1, 2, 3, 4] )
Out[4]: 24
In [5]: reduce( multiply, range(5) )
Out[5]: 0
In [6]: reduce( multiply, list(range(5)) )
Out[6]: 0
[...]
In [11]: L = list(range(5))
In [12]: L
Out[12]: [0, 1, 2, 3, 4]
In [13]: reduce(multiply, L)
Out[13]: 0
Why am I getting 0 when not entering a literal list? How can I reduce arbitrary lists? What am I missing?

Python's range starts with 0, not 1. 0 times anything causes the 0 you're always getting...

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

xarray equivalent to pandas subtract/add - python-3.x

Related

How to convert list like value in each row into pure value in python dataframe?

In Pandas dataframe remove duplicate strings in a list column and remove the corresponding ids in the other list column

Compare panda data frame indices and update the rows

Get the most repeated value from columns of list other than zero in pandas data frame

python 3 functools.reduce gives 0 if not literal list given

Categories

Resources