Pandas: mask dataframe by a rolling window - python-3.x

I have a dataframe df_snow_or_ice which indicates whether there is snow or not in a certain day as following:
df_snow_or_ice
Out[63]:
SWE
datetime_doy
2007-01-01 0.000000
2007-01-02 0.000000
2007-01-03 0.000000
2007-01-04 0.000000
2007-01-05 0.000000
...
2019-12-27 0.000000
2019-12-28 0.000000
2019-12-29 0.000000
2019-12-30 0.000000
2019-12-31 0.000064
[4748 rows x 1 columns]
And I also have a dataframe gpi_data_tmp and want to mask it based on whether there is snow or not (whether df_snow_or_ice['SWE']>0) in a rolling window of 42 days. That is, if at day d, df_snow_or_ice.iloc[d-21:d+21]['SWE']>0 during the interval [d-21:d+21], then gpi_data_tmp.iloc[d] is masked as np.nan. If I wrote it in for-loop, it's like:
half_width = 21
for i in range(half_width,len(df_snow_or_ice)-half_width+1,1):
if df_snow_or_ice['SWE'].iloc[i] > 0 :
gpi_data_tmp.iloc[(i-half_width):(i+half_width)] = np.nan
for i in range(len(df_snow_or_ice)):
if df_snow_or_ice['SWE'].iloc[i] > 0 :
gpi_data_tmp.iloc[i] = np.nan
So how can I write it efficiently? by some functions of pandas? Thanks!

Related

Assign different colors to polydata in paraview

Trying to avoid defining multiple individual polygons/quad, so I use polydata.
I need to define multiple polydata in a Matlab generated vtk file, but each one should be assigned a different color (defined in a lookup table).
The following code gives an error and accepts only the first color which it assigns to all polydata.
# vtk DataFile Version 5.1
vtk output
ASCII
DATASET POLYDATA
POINTS 12 float
0.500000 1.000000 0.000000
0.353553 1.000000 -0.353553
0.000000 1.000000 -0.500000
-0.353553 1.000000 -0.353553
-0.500000 1.000000 0.000000
-0.353553 1.000000 0.353553
0.000000 1.000000 0.500000
0.353553 1.000000 0.353553
0. 0. 0.
1. 1. 1.
2. 2. 2.
1. 2. 1.
POLYGONS 3 12
OFFSETS vtktypeint64
0 8 12
CONNECTIVITY vtktypeint64
0 1 2 3 4 5 6 7
9 10 11 12
CELL_DATA 2
SCALARS SMEARED float 1
LOOKUP_TABLE victor
0 1
LOOKUP_TABLE victor 1
1.000000 0.000000 0.000000 1.000000
0.000000 1.000000 0.000000 1.000000
LOOKUP_TABLE victor 1
This should be LOOKUP_TABLE victor 2, as you define 2 RGBA points in your table

How to add path to texture in OBJ or MTL file?

I have next problem:
My project consists of .obj file, .mtl file and texture(.jpg).
I need to divide texture into multiple files. But, when I do it, the UV coordinates (after mapping and reverse mapping) will be the same on several files, thus it cause error watching obj using meshlab.
How can I solve my problem ?
Meshlab does support files with several texture files, just by using a separate material for each texture. It is not clear if you are generating your obj files with meshlab or other program, so I'm not sure if this is a meshlab related question.
Here is a sample of a minimal multitexture .obj file (8 vertex, 4 triangles, 2 textures)
mtllib ./TextureDouble.obj.mtl
# 8 vertices, 8 vertices normals
vn 0.000000 0.000000 1.570796
v 0.000000 0.000000 0.000000
vn 0.000000 0.000000 1.570796
v 1.000000 0.000000 0.000000
vn 0.000000 0.000000 1.570796
v 1.000000 1.000000 0.000000
vn 0.000000 0.000000 1.570796
v 0.000000 1.000000 0.000000
vn 0.000000 0.000000 1.570796
v 2.000000 0.000000 0.000000
vn 0.000000 0.000000 1.570796
v 3.000000 0.000000 0.000000
vn 0.000000 0.000000 1.570796
v 3.000000 1.000000 0.000000
vn 0.000000 0.000000 1.570796
v 2.000000 1.000000 0.000000
# 4 coords texture
vt 0.000000 0.000000
vt 1.000000 0.000000
vt 1.000000 1.000000
vt 0.000000 1.000000
# 2 faces using material_0
usemtl material_0
f 1/1/1 2/2/2 3/3/3
f 1/1/1 3/3/3 4/4/4
# 4 coords texture
vt 0.000000 0.000000
vt 1.000000 0.000000
vt 1.000000 1.000000
vt 0.000000 1.000000
# 2 faces using material_1
usemtl material_1
f 5/5/5 6/6/6 7/7/7
f 5/5/5 7/7/7 8/8/8
And here is the TextureDouble.obj.mtl file. To test the files, you must provide 2 image files named TextureDouble_A.png and TextureDouble_B.png.
newmtl material_0
Ka 0.200000 0.200000 0.200000
Kd 1.000000 1.000000 1.000000
Ks 1.000000 1.000000 1.000000
Tr 1.000000
illum 2
Ns 0.000000
map_Kd TextureDouble_A.png
newmtl material_1
Ka 0.200000 0.200000 0.200000
Kd 1.000000 1.000000 1.000000
Ks 1.000000 1.000000 1.000000
Tr 1.000000
illum 2
Ns 0.000000
map_Kd TextureDouble_B.png

I have a problem understanding sklearn's TfidfVectorizer results

Given a corpus of 3 documents, for example:
sentences = ["This car is fast",
"This car is pretty",
"Very fast truck"]
I am executing by hand the calculation of tf-idf.
For document 1, and the word "car", I can find that:
TF = 1/4
IDF = log(3/2)
TF-IDF = 1/4 * log(3/2)
Same result should apply to document 2, since it has 4 words, and one of them is "car".
I have tried to apply this in sklearn, with the code below:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
data = {'text': sentences}
df = pd.DataFrame(data)
tv = TfidfVectorizer()
tfvector = tv.fit_transform(df.text)
print(pd.DataFrame(tfvector.toarray(), columns=tv.get_feature_names()))
And the result I get is:
car fast is pretty this truck very
0 0.500000 0.50000 0.500000 0.000000 0.500000 0.000000 0.000000
1 0.459854 0.00000 0.459854 0.604652 0.459854 0.000000 0.000000
2 0.000000 0.47363 0.000000 0.000000 0.000000 0.622766 0.622766
I can understand that sklearn uses L2 normalization, but still, shouldn't the tf-idf score of "car" in the first two documents be the same? Can anyone help me understanding the results?
It is because of the normalization. If you add the parameter norm=None to the TfIdfVectorizer(norm=None), you will get the following result, which has the same value for car
car fast is pretty this truck very
0 1.287682 1.287682 1.287682 0.000000 1.287682 0.000000 0.000000
1 1.287682 0.000000 1.287682 1.693147 1.287682 0.000000 0.000000
2 0.000000 1.287682 0.000000 0.000000 0.000000 1.693147 1.693147

pandas group by row wise conditions

I have a dataframe like this
import pandas as pd
raw_data = {'ID':['101','101','101','101','101','102','102','103'],
'Week':['W01','W02','W03','W07','W08','W01','W02','W01'],
'Orders':[15,15,10,15,15,5,10,10]}
df2 = pd.DataFrame(raw_data, columns = ['ID','Week','Orders'])
i wanted row by row percentages within groups.
How can i achieve like this
Using pct_change
df2.groupby('ID').Orders.pct_change()).add(1).fillna(0)
I find it wired in my pandas version pct_change can not do with groupby object , so that we need to do with
df2['New']=sum(l,[])
df2.New=(df2.New+1).fillna(0)
df2
Out[606]:
ID Week Orders New
0 101 W01 15 0.000000
1 101 W02 15 1.000000
2 101 W03 10 0.666667
3 101 W07 15 1.500000
4 101 W08 15 1.000000
5 102 W01 5 0.000000
6 102 W02 10 2.000000
7 103 W01 10 0.000000
Carry out a window operation shifting the value by 1 position
df2['prev']=df2.groupby(by='ID').Orders.shift(1).fillna(0)
Calculate % change individually using apply()
df2['pct'] = df2.apply(lambda x : ((x['Orders'] - x['prev']) / x['prev']) if x['prev'] != 0 else 0,axis=1)
I am not sure if there is any default pd.pct_change() within a window.
ID Week Orders prev pct
0 101 W01 15 0.0 0.000000
1 101 W02 15 15.0 0.000000
2 101 W03 10 15.0 -0.333333
3 101 W07 15 10.0 0.500000
4 101 W08 15 15.0 0.000000
5 102 W01 5 0.0 0.000000
6 102 W02 10 5.0 1.000000
7 103 W01 10 0.0 0.000000

DataFrame remove useless columns

i use the following code to build and prepare my pandas dataframe:
data = pd.read_csv('statistic.csv',
parse_dates=True, index_col=['DATE'], low_memory=False)
data[['QUANTITY']] = data[['QUANTITY']].apply(pd.to_numeric, errors='coerce')
data_extracted = data.groupby(['DATE','ARTICLENO'])
['QUANTITY'].sum().unstack()
#replace string nan with numpy data type
data_extracted = data_extracted.fillna(value=np.nan)
#remove footer of csv file
data_extracted.index = pd.to_datetime(data_extracted.index.str[:-2],
errors="coerce")
#resample to one week rythm
data_resampled = data_extracted.resample('W-MON', label='left',
loffset=pd.DateOffset(days=1)).sum()
# reduce to one year
data_extracted = data_extracted.loc['2015-01-01' : '2015-12-31']
#fill possible NaNs with 1 (not 0, because of division by zero when doing
pct_change
data_extracted = data_extracted.replace([np.inf, -np.inf], np.nan).fillna(1)
data_pct_change =
data_extracted.astype(float).pct_change(axis=0).replace([np.inf, -np.inf],
np.nan).fillna(0)
# actual dropping logic if column has no values at all
data_pct_change.drop([col for col, val in data_pct_change.sum().iteritems()
if val == 0 ], axis=1, inplace=True)
normalized_modeling_data = preprocessing.normalize(data_pct_change,
norm='l2', axis=0)
normalized_data_headers = pd.DataFrame(normalized_modeling_data,
columns=data_pct_change.columns)
normalized_modeling_data = normalized_modeling_data.transpose()
kmeans = KMeans(n_clusters=3, random_state=0).fit(normalized_modeling_data)
print(kmeans.labels_)
np.savetxt('log_2016.txt', kmeans.labels_, newline="\n")
for i, cluster_center in enumerate(kmeans.cluster_centers_):
plp.plot(cluster_center, label='Center {0}'.format(i))
plp.legend(loc='best')
plp.show()
Unfurtunately there are a lot of 0's in my dataframe (the articles don't start at the same date, so so if A starts in 2015 and B starts in 2016, B will get 0 through the whole year 2015)
Here is the grouped dataframe:
ARTICLENO 205123430604 205321436644 405659844106 305336746308
DATE
2015-01-05 9.0 6.0 560.0 2736.0
2015-01-19 2.0 1.0 560.0 3312.0
2015-01-26 NaN 5.0 600.0 2196.0
2015-02-02 NaN NaN 40.0 3312.0
2015-02-16 7.0 6.0 520.0 5004.0
2015-02-23 12.0 4.0 480.0 4212.0
2015-04-13 11.0 6.0 920.0 4230.0
And here the corresponding percentage change:
ARTICLENO 205123430604 205321436644 405659844106 305336746308
DATE
2015-01-05 0.000000 0.000000 0.000000 0.000000
2015-01-19 -0.777778 -0.833333 0.000000 0.210526
2015-01-26 -0.500000 4.000000 0.071429 -0.336957
2015-02-02 0.000000 -0.800000 -0.933333 0.508197
2015-02-16 6.000000 5.000000 12.000000 0.510870
2015-02-23 0.714286 -0.333333 -0.076923 -0.158273
The factor 12 at 405659844106 is 'correct'
Here is another example from my dataframe:
ARTICLENO 305123446353 205423146377 305669846421 905135949255
DATE
2015-01-05 2175.0 200.0 NaN NaN
2015-01-19 2550.0 NaN NaN NaN
2015-01-26 925.0 NaN NaN NaN
2015-02-02 675.0 NaN NaN NaN
2015-02-16 1400.0 200.0 120.0 NaN
2015-02-23 6125.0 320.0 NaN NaN
And the corresponding percentage change:
ARTICLENO 305123446353 205423146377 305669846421 905135949255
DATE
2015-01-05 0.000000 0.000000 0.000000 0.000000
2015-01-19 0.172414 -0.995000 0.000000 -0.058824
2015-01-26 -0.637255 0.000000 0.000000 0.047794
2015-02-02 -0.270270 0.000000 0.000000 -0.996491
2015-02-16 1.074074 199.000000 119.000000 279.000000
2015-02-23 3.375000 0.600000 -0.991667 0.310714
As you can see, there are changes of factor 200-300 which comefrome the change of the replaced NaN to a real value.
This data is used to do a kmeans-clustering and such 'nonsense'-data ruins my kmeans-centers.
Does anyone have an idea how to remove such columns?
I used the following statement to drop the nonsense columns:
max_nan_value_count = 5
data_extracted = data_extracted.drop(data_extracted.columns[data_extracted.apply(lambda
col: col.isnull().sum() > max_nan_value_count)], axis=1)

Resources