I have a sparse matrix of size (n x m):
sparse_dtm = dok_matrix((num_documents, vocabulary_size), dtype=np.float32)
for doc_index, document in enumerate(data):
document_counter = Counter(document)
for word in set(document):
sparse_dtm[doc_index, word_to_index[word]] = document_counter[word]
Where:
num_documents = n
vocabulary_size = m
data = list of tokenized lists
Also, I have a list with length n:
sums = sparse_dtm.sum(1).tolist()
Now, I want to do an element-wise division in which each cell of row_i in sparse_dtm is divided by sums[i].
A naive approach, using the traditition Python element-wise division:
sparse_dtm / sums
Leads into the following error:
TypeError: unsupported operand type(s) for /: 'csr_matrix' and 'list'
How can I perform this element-wise division?
If I correctly understand, you need to divide each row by the sum of row, is that correct?
In this case, you'd need to reshape the sum
sparse_dtm / sparse_dtm.sum(1).reshape(-1, 1)
you can also do it with a pandas DataFrame, for example
row_num = 10
col_num = 5
sparse_dtm = np.ndarray((row_num, col_num), dtype=np.float32)
for row in range(row_num):
for col in range(col_num):
value = (row+1) * (col+2)
sparse_dtm[row, col] = value
df = pd.DataFrame(sparse_dtm)
print(df)
gives
0 1 2 3 4
0 2.0 3.0 4.0 5.0 6.0
1 4.0 6.0 8.0 10.0 12.0
2 6.0 9.0 12.0 15.0 18.0
3 8.0 12.0 16.0 20.0 24.0
4 10.0 15.0 20.0 25.0 30.0
5 12.0 18.0 24.0 30.0 36.0
6 14.0 21.0 28.0 35.0 42.0
7 16.0 24.0 32.0 40.0 48.0
8 18.0 27.0 36.0 45.0 54.0
9 20.0 30.0 40.0 50.0 60.0
and then divide each row for the sum of row
df / df.sum(axis=1).values.reshape(-1, 1)
that gives
0 1 2 3 4
0 0.1 0.15 0.2 0.25 0.3
1 0.1 0.15 0.2 0.25 0.3
2 0.1 0.15 0.2 0.25 0.3
3 0.1 0.15 0.2 0.25 0.3
4 0.1 0.15 0.2 0.25 0.3
5 0.1 0.15 0.2 0.25 0.3
6 0.1 0.15 0.2 0.25 0.3
7 0.1 0.15 0.2 0.25 0.3
8 0.1 0.15 0.2 0.25 0.3
9 0.1 0.15 0.2 0.25 0.3
In [189]: M = sparse.dok_matrix([[0,1,3,0],[0,0,2,0],[1,0,0,0]])
In [190]: M
Out[190]:
<3x4 sparse matrix of type '<class 'numpy.int64'>'
with 4 stored elements in Dictionary Of Keys format>
In [191]: M.A
Out[191]:
array([[0, 1, 3, 0],
[0, 0, 2, 0],
[1, 0, 0, 0]])
sum(1) produces a (3,1) np.matrix, which can be used directly in the division:
In [192]: M.sum(1)
Out[192]:
matrix([[4],
[2],
[1]])
In [193]: M/M.sum(1)
Out[193]:
matrix([[0. , 0.25, 0.75, 0. ],
[0. , 0. , 1. , 0. ],
[1. , 0. , 0. , 0. ]])
Note that the result is a dense np.matrix, not sparse.
This could give problems if the a row sum was 0, but with your construction that might not be the possible.
We can retain the sparse result by first converting the sums to sparse. I'm using the inverse because there isn't a sparse element-wise division (re. all those 0s):
In [205]: D=sparse.csr_matrix(1/M.sum(1))
In [206]: D
Out[206]:
<3x1 sparse matrix of type '<class 'numpy.float64'>'
with 3 stored elements in Compressed Sparse Row format>
In [207]: D.A
Out[207]:
array([[0.25],
[0.5 ],
[1. ]])
In [208]: D.multiply(M)
Out[208]:
<3x4 sparse matrix of type '<class 'numpy.float64'>'
with 4 stored elements in Compressed Sparse Row format>
In [209]: _.A
Out[209]:
array([[0. , 0.25, 0.75, 0. ],
[0. , 0. , 1. , 0. ],
[1. , 0. , 0. , 0. ]])
sklearn also has some added sparse matrix utilities
In [210]: from sklearn import preprocessing
In [211]: preprocessing.normalize(M, norm='l1', axis=1)
Out[211]:
<3x4 sparse matrix of type '<class 'numpy.float64'>'
with 4 stored elements in Compressed Sparse Row format>
In [212]: _.A
Out[212]:
array([[0. , 0.25, 0.75, 0. ],
[0. , 0. , 1. , 0. ],
[1. , 0. , 0. , 0. ]])
Related
Simply we can calculate mean by axis:
import pandas as pd
df=pd.DataFrame({'A':[1,1,0,1,0,1,1,0,1,1,1],
'b':[1,1,0,1,0,1,1,0,1,1,1],
'c':[1,1,0,1,0,1,1,0,1,1,1]})
# max_of_three columns
mean= np.max(df.mean(axis=1))
How to do this same this with rolling mean ?
I tried 1:
# max_of_three columns
mean=df.rolling(2).mean(axis=1)
got this error:
UnsupportedFunctionCall: numpy operations are not valid with window objects. Use .rolling(...).mean() instead
I tried 2:
def tt(x):
x=pd.DataFrame(x)
b1=np.max(x.mean(axis=1))
return b1
# max_of_three columns
mean=df.rolling(2).apply(tt,raw=True)
But from here I get three columns in result, in real should be 1 value for each moving window.
Where I am doing mistake? or any other efficient way to doing this.
You use the axis argument in rolling as:
df.rolling(2, axis=0).mean()
>>> A b c
0 NaN NaN NaN
1 1.0 1.0 1.0
2 0.5 0.5 0.5
3 0.5 0.5 0.5
4 0.5 0.5 0.5
5 0.5 0.5 0.5
6 1.0 1.0 1.0
7 0.5 0.5 0.5
8 0.5 0.5 0.5
9 1.0 1.0 1.0
10 1.0 1.0 1.0
r = df.rolling(2, axis=1).mean()
r
>>> A b c
0 NaN 1.0 1.0
2 NaN 0.0 0.0
3 NaN 1.0 1.0
4 NaN 0.0 0.0
5 NaN 1.0 1.0
6 NaN 1.0 1.0
7 NaN 0.0 0.0
8 NaN 1.0 1.0
9 NaN 1.0 1.0
10 NaN 1.0 1.0
r.max()
>>> A NaN
b 1.0
c 1.0
dtype: float64
I am using Python 3.6 and I have the output of minisom package in below format
defaultdict(list,{(9,1):[array([0.1,0.3,0.5,0.9]),array([0.2,0.6,0.8,0.9])],(3,2):[array([1,3,5,9]),array([2,6,8,9])] })
and I would like to have my output(Pandas DataFrame) as shown below
X Y V1 V2 V3 V4
9 1 0.1 0.3 0.5 0.9
9 1 0.2 0.6 0.8 0.9
3 2 1 3 5 9
3 2 2 6 8 9
I appreciate your help on this.
I would try something like this:
>>> x
defaultdict(<class 'list'>, {(9, 1): [array([0.1, 0.3, 0.5, 0.9]), array([0.2, 0.6, 0.8, 0.9])], (3, 2): [array([1, 3, 5, 9]), array([2, 6, 8, 9])]})
>>> df=pd.DataFrame()
>>> df[["X", "Y", "V1", "V2", "V3", "V4"]]=pd.DataFrame(pd.DataFrame.from_dict(x, orient="index").stack().reset_index().drop("level_1", axis=1).rename(columns={0: "val"}, inplace=False).apply(lambda x: [el_inner for el in x.values for el_inner in el], axis=1).to_list())
>>> df
X Y V1 V2 V3 V4
0 9 1 0.1 0.3 0.5 0.9
1 9 1 0.2 0.6 0.8 0.9
2 3 2 1.0 3.0 5.0 9.0
3 3 2 2.0 6.0 8.0 9.0
>>> df.dtypes
X int64
Y int64
V1 float64
V2 float64
V3 float64
V4 float64
dtype: object
Alternatively:
>>> df=pd.DataFrame.from_dict(x, orient="index").stack().reset_index().drop("level_1", axis=1).rename(columns={0: "val"}, inplace=False).apply(lambda x: pd.Series({"x": x.level_0[0], "y": x.level_0[1], "v1": x.val[0], "v2": x.val[1], "v3": x.val[2], "v4": x.val[3]}), axis=1)
>>> df
x y v1 v2 v3 v4
0 9.0 1.0 0.1 0.3 0.5 0.9
1 9.0 1.0 0.2 0.6 0.8 0.9
2 3.0 2.0 1.0 3.0 5.0 9.0
3 3.0 2.0 2.0 6.0 8.0 9.0
>>> df.dtypes
x float64
y float64
v1 float64
v2 float64
v3 float64
v4 float64
dtype: object
If you want to convert x and y to int:
>>> df[["x", "y"]]=df[["x", "y"]].astype(int)
>>> df
x y v1 v2 v3 v4
0 9 1 0.1 0.3 0.5 0.9
1 9 1 0.2 0.6 0.8 0.9
2 3 2 1.0 3.0 5.0 9.0
3 3 2 2.0 6.0 8.0 9.0
>>> df.dtypes
x int32
y int32
v1 float64
v2 float64
v3 float64
v4 float64
dtype: object
i have a pandas dataframe structured as follow:
In[1]: df = pd.DataFrame({"A":[10, 15, 13, 18, 0.6],
"B":[20, 12, 16, 24, 0.5],
"C":[23, 22, 26, 24, 0.4],
"D":[9, 12, 17, 24, 0.8 ]})
Out[1]: df
A B C D
0 10.0 20.0 23.0 9.0
1 15.0 12.0 22.0 12.0
2 13.0 16.0 26.0 17.0
3 18.0 24.0 24.0 24.0
4 0.6 0.5 0.4 0.8
From here my goal is to filter multiple columns based on the last row (index 4) values. More in detail i need to keep those columns that has a value < 0.06 in the last row. The output should be a df structured as follow:
B C
0 20.0 23.0
1 12.0 22.0
2 16.0 26.0
3 24.0 24.0
4 0.5 0.4
I'm trying this:
In[2]: df[(df[["A", "B", "C", "D"]] < 0.6)]
but i get the as follow:
Out[2]:
A B C D
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN 0.5 0.4 NaN
I even try:
df[(df[["A", "B", "C", "D"]] < 0.6).all(axis=0)]
but It gives me error, It doesn't work.
Is there anybody whom can help me?
Use DataFrame.loc with : for return all rows by condition - compare last row by DataFrame.iloc:
df1 = df.loc[:, df.iloc[-1] < 0.6]
print (df1)
B C
0 20.0 23.0
1 12.0 22.0
2 16.0 26.0
3 24.0 24.0
4 0.5 0.4
Python Version:3.6
Pandas Version:0.21.1
How do I get from
print(df_raw)
device_id temp_a temp_b temp_c
0 0 0.2 0.8 0.6
1 0 0.1 0.9 0.4
2 1 0.3 0.7 0.2
3 2 0.5 0.5 0.1
4 2 0.1 0.9 0.4
5 2 0.7 0.3 0.9
to
print(df_except2)
device_id temp_a temp_b temp_c temp_a_1 temp_b_1 temp_c_1 temp_a_2 \
0 0 0.2 0.8 0.6 0.1 0.9 0.4 NaN
1 1 0.3 0.7 0.2 NaN NaN NaN NaN
2 2 0.5 0.5 0.1 0.1 0.9 0.4 0.7
temp_b_2 temp_c_2
0 NaN NaN
1 NaN NaN
2 0.3 0.9
Code of data:
df_raw = pd.DataFrame({'device_id' : ['0','0','1','2','2','2'],
'temp_a' : [0.2,0.1,0.3,0.5,0.1,0.7],
'temp_b' : [0.8,0.9,0.7,0.5,0.9,0.3],
'temp_c' : [0.6,0.4,0.2,0.1,0.4,0.9],
})
print(df_raw)
df_except = pd.DataFrame({'device_id' : ['0','1','2'],
'temp_a':[0.2,0.3,0.5],
'temp_b':[0.8,0.7,0.5],
'temp_c':[0.6,0.2,0.1],
'temp_a_1':[0.1,None,0.1],
'temp_b_1':[0.9,None,0.9],
'temp_c_1':[0.4,None,0.4],
'temp_a_2':[None,None,0.7],
'temp_b_2':[None,None,0.3],
'temp_c_2':[None,None,0.9],
})
df_except2 = df_except[['device_id','temp_a','temp_b','temp_c','temp_a_1','temp_b_1','temp_c_1','temp_a_2','temp_b_2','temp_c_2']]
print(df_except2)
Note:
1. Number of Multiple rows is unknow.
2. I refer to the following answer :
Pandas Dataframe - How to combine multiple rows to one
But this answer just can deal with one column.
Use:
g = df_raw.groupby('device_id').cumcount()
df = df_raw.set_index(['device_id', g]).unstack().sort_index(axis=1, level=1)
df.columns = ['{}_{}'.format(i,j) if j != 0 else '{}'.format(i) for i, j in df.columns]
df = df.reset_index()
print (df)
device_id temp_a temp_b temp_c temp_a_1 temp_b_1 temp_c_1 temp_a_2 \
0 0 0.2 0.8 0.6 0.1 0.9 0.4 NaN
1 1 0.3 0.7 0.2 NaN NaN NaN NaN
2 2 0.5 0.5 0.1 0.1 0.9 0.4 0.7
temp_b_2 temp_c_2
0 NaN NaN
1 NaN NaN
2 0.3 0.9
Explanation:
First count groups by cumcount by column device_id
Create MultiIndex by set_index and Series g
Reshape by unstack
Sort second level of MultiIndex in columns by sort_index
Change columns names by list comprehension
Last reset_index for column from index
code:
import numpy as np
device_id_list = df_raw['device_id'].tolist()
device_id_list = list(np.unique(device_id_list))
append_df = pd.DataFrame()
for device_id in device_id_list:
tmp_df = df_raw.query('device_id=="%s"'%(device_id))
if len(tmp_df)>1:
one_raw_list=[]
for i in range(0,len(tmp_df)):
one_raw_df = tmp_df.iloc[i:i+1]
one_raw_list.append(one_raw_df)
tmp_combine_df = pd.DataFrame()
for i in range(0,len(one_raw_list)-1):
next_raw = one_raw_list[i+1].drop(columns=['device_id']).reset_index(drop=True)
new_name_list=[]
for old_name in list(next_raw.columns):
new_name_list.append(old_name+'_'+str(i+1))
next_raw.columns = new_name_list
if i==0:
current_raw = one_raw_list[i].reset_index(drop=True)
tmp_combine_df = pd.concat([current_raw, next_raw], axis=1)
else:
tmp_combine_df = pd.concat([tmp_combine_df, next_raw], axis=1)
tmp_df = tmp_combine_df
tmp_df_columns = tmp_df.columns
append_df_columns = append_df.columns
append_df = pd.concat([append_df,tmp_df],ignore_index =True)
if len(tmp_df_columns) > len(append_df_columns):
append_df = append_df[tmp_df_columns]
else:
append_df = append_df[append_df_columns]
print(append_df)
Output:
device_id temp_a temp_b temp_c temp_a_1 temp_b_1 temp_c_1 temp_a_2 \
0 0 0.2 0.8 0.6 0.1 0.9 0.4 NaN
1 1 0.3 0.7 0.2 NaN NaN NaN NaN
2 2 0.5 0.5 0.1 0.1 0.9 0.4 0.7
temp_b_2 temp_c_2
0 NaN NaN
1 NaN NaN
2 0.3 0.9
df = pd.DataFrame({'device_id' : ['0','0','1','2','2','2'],
'temp_a' : [0.2,0.1,0.3,0.5,0.1,0.7],
'temp_b' : [0.8,0.9,0.7,0.5,0.9,0.3],
'temp_c' : [0.6,0.4,0.2,0.1,0.4,0.9],
})
cols_of_interest = df.columns.drop('device_id')
df["C"] = "C_" + (df.groupby("device_id").cumcount() + 1).astype(str)
df.pivot_table(index="device_id", values=cols_of_interest, columns="C")
Output:
temp_a temp_b temp_c
C C_1 C_2 C_3 C_1 C_2 C_3 C_1 C_2 C_3
device_id
0 0.2 0.1 NaN 0.8 0.9 NaN 0.6 0.4 NaN
1 0.3 NaN NaN 0.7 NaN NaN 0.2 NaN NaN
2 0.5 0.1 0.7 0.5 0.9 0.3 0.1 0.4 0.9
In R programming language I can do following:
x <- c(1, 8, 3, 5, 6)
y <- rep("Down",5)
y[x>5] <- "Up"
This would result in a y vector being ("Down", "Up", "Down", "Down", "Up")
Now my x sequence is an output of the predict function on a linear model fit. The predict function in R returns a sequence while the predict function in Spark returns a DataFrame containing the columns of the test-dataset + the columns label and prediction.
By running
y[x$prediction > .5]
I get the error:
Error in y[x$prediction > 0.5] : invalid subscript type 'S4'
How would I solve this problem?
On selecting rows:
Your approach will not work, since y, as a product of Spark predict, is a Spark (and not R) dataframe; you should use the filter function of SparkR. Here is a reproducible example using the iris dataset:
library(SparkR)
sparkR.version()
# "2.2.1"
df <- as.DataFrame(iris)
df
# SparkDataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, Petal_Width:double, Species:string]
nrow(df)
# 150
# Let's keep only the records with Petal_Width > 0.2:
df2 <- filter(df, df$Petal_Width > 0.2)
nrow(df2)
# 116
Check also the example in the docs.
On replacing row values:
The standard practice for replacing row values in Spark dataframes is first to create a new column with the required condition, and then possibly dropping the old column; here is an example, where we replace values of Petal_Width greater than 0.2 with 0's in the df we have defined above:
newDF <- withColumn(df, "new_PetalWidth", ifelse(df$Petal_Width > 0.2, 0, df$Petal_Width))
head(newDF)
# result:
Sepal_Length Sepal_Width Petal_Length Petal_Width Species new_PetalWidth
1 5.1 3.5 1.4 0.2 setosa 0.2
2 4.9 3.0 1.4 0.2 setosa 0.2
3 4.7 3.2 1.3 0.2 setosa 0.2
4 4.6 3.1 1.5 0.2 setosa 0.2
5 5.0 3.6 1.4 0.2 setosa 0.2
6 5.4 3.9 1.7 0.4 setosa 0.0 # <- value changed
# drop the old column:
newDF <- drop(newDF, "Petal_Width")
head(newDF)
# result:
Sepal_Length Sepal_Width Petal_Length Species new_PetalWidth
1 5.1 3.5 1.4 setosa 0.2
2 4.9 3.0 1.4 setosa 0.2
3 4.7 3.2 1.3 setosa 0.2
4 4.6 3.1 1.5 setosa 0.2
5 5.0 3.6 1.4 setosa 0.2
6 5.4 3.9 1.7 setosa 0.0
The method also works along different columns; here is an example of a new column taking values 0 or Petal_Width, depending on a condition for Petal_Length:
newDF2 <- withColumn(df, "something_here", ifelse(df$Petal_Length > 1.4, 0, df$Petal_Width))
head(newDF2)
# result:
Sepal_Length Sepal_Width Petal_Length Petal_Width Species something_here
1 5.1 3.5 1.4 0.2 setosa 0.2
2 4.9 3.0 1.4 0.2 setosa 0.2
3 4.7 3.2 1.3 0.2 setosa 0.2
4 4.6 3.1 1.5 0.2 setosa 0.0
5 5.0 3.6 1.4 0.2 setosa 0.2
6 5.4 3.9 1.7 0.4 setosa 0.0