Apache Spark - calculate correlation - apache-spark

I am trying to calculate correlation between user ratings. I came up with a simple program and now trying to understand the result of pearson correlation.
val user1 = Vectors.dense(10, 2, 3, 3)
val user2 = Vectors.dense(10, 3, 2, 2)
val user3 = Vectors.dense(1, 8, 9, 1)
val user4 = Vectors.dense(3, 9, 8, 2)
val user5 = Vectors.dense(1, 1, 1, 1)
val user6 = Vectors.dense(2, 2, 2, 2)
val users = spark.sparkContext.parallelize(Array(user1, user2, user3, user4, user5, user6))
val corr = Statistics.corr(users)
And this is the matrix result for reference:
1.0 -0.30336465877348895 -0.33033040622002124 0.7679896586280794
-0.30336465877348895 1.0 0.9660056657223798 -0.21945076948288175
-0.33033040622002124 0.9660056657223798 1.0 -0.21945076948288175
0.7679896586280794 -0.21945076948288175 -0.21945076948288175 1.0
Could someone help me interpret this matrix? I was surprised that it contains 4 columns and 4 rows (I have six users as the input)?

There is not much to explain here. As you can read in the API docs corr(X: RDD[Vector]) returns:
Pearson correlation matrix comparing columns in X.
So four columns means 4*4 matrix.

Related

Python optimization of time-series data re-indexing based on multiple-parameter multi-varialbe input and singular value output

I am trying to optimize a funciton that is trying to maximize the correlation between two (pandas) time series arrays (X and Y). This is done by using three parameters (a, b, c) and a third time series array (Z). The Z array is used to reindex the values in the X array (based on the parameters a, b, c) in such a way as to maximize the correlation of the reindexed X array (Xnew) with the Y array.
Below is some pseudo-code to demonstrate what I amy trying to do. I have attempted this using LMfit and scipy optimize but I am not sure how to make this task work in those packages. For example in LMfit if I tried to minimize the MyOpt function (which passes back a single value of the correlation metric) then it complains that I have more parameters than outputs. However, if I pass back the time series of the corrlation metric (diff) the the parameter values remain fixed at their input values.
I know the reindexing function I am using works because using the rather crude methods similar to the code below give signifianct changes in the mean (diff) metric passed back.
My knowledge of these optimizaiton packages is not up to scratch for this job so if anyone has a suggestion on how to tackle this, I would be greatfull.
def GetNewIndex(Z, a, b ,c):
old_index = np.arange(0, len(Z))
index_adj = some_func(a,b,c)
new_index = old_index + index_adj
max_old = np.max(old_index)
new_index[new_index > max_old] = max_old
new_index[new_index < 0] = 0
return new_index
def MyOpt(params, X, Y ,Z):
a = params['A']
b = params['B']
c = params['C']
# estimate lag (in samples) based on ambient RH
new_index = GetNewIndex(Z, a, b, c)
# assign old values to new locations and convert back to pandas series
Xnew = np.take(X.values, new_index)
Xnew = pd.Series(Xnew, index=X.index)
cc = Y.rolling(1201, center=True).corr(Xnew)
cc = cc.interpolate(limit_direction='both', limit_area=None)
diff = 1-np.abs(cc)
return np.mean(diff)
#==================================================
X = some long pandas time series data
Y = some long pandas time series data
Z = some long pandas time series data
As = [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2]
Bs = [0, 0 ,0, 1, 1, 1, 0, 0, 0, 1, 1, 1]
Cs = [5, 6, 5, 6, 5, 6, 5, 6, 5, 6, 5, 6]
outs = []
for A, B, C in zip(As, Bs, Cs):
params={'A':A, 'B':B, 'C':C}
out = MyOpt(params, X, Y, Z)
outs.append(out)

How to filter time series if data exists at least data every 6 hours?

I'd like to verify if there is data at least once every 6 hours per ID, and filter out the IDs that do not meet this criteria.
essentially a filter: "if ID's data not at least every 6h, drop id from dataframe"
I try to use the same method for filtering one per day, but having trouble adapting the code.
# add day column from datetime index
df['1D'] = df.index.day
# reset index
daily = df.reset_index()
# count per ID per day. Result is per ID data of non-zero
a = daily.groupby(['1D', 'id']).size()
# filter by right join
filtered = a.merge(df, on = id", how = 'right')
I cannot figure out how to adapt this for the following 6hr periods each day: 00:01-06:00, 06:01-12:00, 12:01-18:00, 18:01-24:00.
Groupby ID and then integer divide hour by 6 and get unique counts. In your case it should be greater than or equal to 4 because there are 4 - 6 hour bins in 24 hours and each day has 4 unique bins i.e.
Bins = 4
00:01-06:00
06:01-12:00
12:01-18:00
18:01-24:00
Code
mask = df.groupby('id')['date'].transform(lambda x: (x.dt.hour // 6).nunique() >= 4)
df = df[mask]
I propose to use pivot_table with resample which allows to change to arbitrary frequencies. Please see comments for further explanations.
# build test data. I need a dummy column to use pivot_table later. Any column with numerical values will suffice
data = [[datetime(2020, 1, 1, 1), 1, 1],
[datetime(2020, 1, 1, 6), 1, 1],
[datetime(2020, 1, 1, 12), 1, 1],
[datetime(2020, 1, 1, 18), 1, 1],
[datetime(2020, 1, 1, 1), 2, 1],
]
df = pd.DataFrame.from_records(data=data, columns=['date', 'id', 'dummy'])
df = df.set_index('date')
# We need a helper dataframe df_tmp.
# Transform id entries to columns. resample with 6h = 360 minutes = 360T.
# Take mean() because it will produce nan values
# WARNING: It will only work if you have at least one id with observations for every 6h.
df_tmp = pd.pivot_table(df, columns='id', index=df.index).resample('360T').mean()
# Drop MultiColumnHierarchy and drop all columns with NaN values
df_tmp.columns = df_tmp.columns.get_level_values(1)
df_tmp.dropna(axis=1, inplace=True)
# Filter values in original dataframe where
mask_id = df.id.isin(df_tmp.columns.to_list())
df = df[mask_id]
I kept your requirements on timestamps but I believe you want to use the commented lines in my solution.
import pandas as pd
period = pd.to_datetime(['2020-01-01 00:01:00', '2020-01-01 06:00:00'])
# period = pd.to_datetime(['2020-01-01 00:00:00', '2020-01-01 06:00:00'])
shift = pd.to_timedelta(['6H', '6H'])
id_with_data = set(df['ID'])
for k in range(4): # for a day (00:01 --> 24:00)
period_mask = (period[0] <= df.index) & (df.index <= period[1])
# period_mask = (period[0] <= df.index) & (df.index < period[1])
present_ids = set(df.loc[period_mask, 'ID'])
id_with_data = id_with_data.intersection(present_ids)
period += shift
df = df.loc[df['ID'].isin(list(id_with_data))]

Iterating over a grouped dataset in Spark 1.6

In an ordered dataset, I want to aggregate data until a condition is met, but grouped by a certain key.
To set some context to my question I simplify my problem to the below problem statement:
In spark I need to aggregate strings, grouped by key when a user stops
"shouting" (the 2nd char in a string is not uppercase).
Dataset example:
ID, text, timestamps
1, "OMG I like bananas", 123
1, "Bananas are the best", 234
1, "MAN I love banana", 1235
2, "ORLY? I'm more into grapes", 123565
2, "BUT I like apples too", 999
2, "unless you count veggies", 9999
2, "THEN don't forget tomatoes", 999999
The expected result would be:
1, "OMG I like bananas Bananas are the best"
2, "ORLY? I'm more into grapes BUT I like apples too unless you count veggies"
via groupby and agg I can't seem to set a condition to "stop when an uppercase char" is found.
This only works in Spark 2.1 or above
What you want to do is possible, but it may be very expensive.
First, let's create some test data. As general advice, when you ask something on Stackoverflow please provide something similar to this so people have somewhere to start.
import spark.sqlContext.implicits._
import org.apache.spark.sql.functions._
val df = List(
(1, "OMG I like bananas", 1),
(1, "Bananas are the best", 2),
(1, "MAN I love banana", 3),
(2, "ORLY? I'm more into grapes", 1),
(2, "BUT I like apples too", 2),
(2, "unless you count veggies", 3),
(2, "THEN don't forget tomatoes", 4)
).toDF("ID", "text", "timestamps")
In order to get a column with the collected texts in order, we need to add a new column using a window function.
Using the spark shell:
scala> val df2 = df.withColumn("coll", collect_list("text").over(Window.partitionBy("id").orderBy("timestamps")))
df2: org.apache.spark.sql.DataFrame = [ID: int, text: string ... 2 more fields]
scala> val x = df2.groupBy("ID").agg(max($"coll").as("texts"))
x: org.apache.spark.sql.DataFrame = [ID: int, texts: array<string>]
scala> x.collect.foreach(println)
[1,WrappedArray(OMG I like bananas, Bananas are the best, MAN I love banana)]
[2,WrappedArray(ORLY? I'm more into grapes, BUT I like apples too, unless you count veggies, THEN don't forget tomatoes)]
To get the actual text we may need a UDF. Here's mine (I'm far from an expert in Scala, so bear with me)
import scala.collection.mutable
val aggText: Seq[String] => String = (list: Seq[String]) => {
def tex(arr: Seq[String], accum: Seq[String]): Seq[String] = arr match {
case Seq() => accum
case Seq(single) => accum :+ single
case Seq(str, xs #_*) => if (str.length >= 2 && !(str.charAt(0).isUpper && str.charAt(1).isUpper))
tex(Nil, accum :+ str )
else
tex(xs, accum :+ str)
}
val res = tex(list, Seq())
res.mkString(" ")
}
val textUDF = udf(aggText(_: mutable.WrappedArray[String]))
So, we have a dataframe with the collected texts in the proper order, and a Scala function (wrapped as a UDF). Let's piece it together:
scala> val x = df2.groupBy("ID").agg(max($"coll").as("texts"))
x: org.apache.spark.sql.DataFrame = [ID: int, texts: array<string>]
scala> val y = x.select($"ID", textUDF($"texts"))
y: org.apache.spark.sql.DataFrame = [ID: int, UDF(texts): string]
scala> y.collect.foreach(println)
[1,OMG I like bananas Bananas are the best]
[2,ORLY? I'm more into grapes BUT I like apples too unless you count veggies]
scala>
I think this is the result you want.

How can I combine two Arrays by parallel collections in Spark?

Suppose we have two arrays Array1(1, 2, 3) and Array2(4, 5, 6).
I want to combine them to a new Array3((1,4),(2,5),(3,6))
While when I try that in Spark it becomes.
code
val data1 = Array(1, 2, 3, 4, 5)
val data2 = Array(2, 3, 4, 5, 6)
val distData1 = sc.parallelize(data1)
val distData2 = sc.parallelize(data2)
val distData3 = distData1 ++ distData2
distData3.foreach(println)
output
1
2
3
4
5
6
How can I combine them correctly?
//Update*
In my program(different from the example). I want to label.zip(features). My features are features: Array[String] and my label are also Array[String]. Why it won't work?
<console>:98: error: type mismatch;
found : org.apache.spark.rdd.RDD[Array[String]]
required: scala.collection.GenIterable[?]
You can data1.zip(data2) but it won't work if distributions are different.

Social Graph of common users of groups

Trying to create Social graph using NetworkX in theory(as i think) everything is good works, but in practice works wrong.
So i've got information about some groups in such format:
members={'Group Name 1':[User 1 ID, User ID 2...],...,'Group Name N' : [User 1 ID,...,User K Id]}
For example:
members={'Group 1' : [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Group 2' : [10, 11, 12, 13, 14, 9],
'Group 3':[21,22,23,24] }
In outcome i need graph in which:
Vertices - Social Group
Edges - the existence of common subscribers (User IDs)
Vertices Size - Users Count
distance between Vertices - common Subscribers (User IDs)
My code:
matrix={}
for i in members:
for j in members:
if i!=j:
matrix[i+j]=len(set(members[i]) & set(members[j]))*1.0/min(len(set(members[i])),len(set(members[j])))
max_matrix = max(matrix.values())
min_matrix = min(matrix.values())
for i in matrix:
matrix[i] = (matrix[i] - min_matrix) / (max_matrix - min_matrix)
g = networkx.Graph(directed=False)
for i in members:
for j in members:
if i != j:
g.add_edge(i, j, weight=matrix[i+j])
members_count = {x:len(members[x]) for x in members}
max_value = max(members_count.values()) * 1.0
size = []
max_size = 900
min_size = 100
for node in g.nodes():
size.append(((members_count[node]/max_value)*max_size + min_size)*10)
import matplotlib.pyplot as plt
pos=networkx.spring_layout(g)
plt.figure(figsize=(20,20))
networkx.draw_networkx(g, pos, node_size=size, width=0.5, font_size=8)
plt.axis('off')
plt.show()
BUT, i can't understand why Edges drawing for groups which have no common IDs.
NetworkX only use weight as an attribute of edges. Whether there is an edge or not doesn't depend on edges' weights.
In other word, Those edges with weight 0 are also count as edges and it will be displayed by drawing function.

Resources