Vega-lite to Altair - Sorting a chart based on the dynamically updated axis - altair

I'm trying to recreate the pyLDAvis chart in Altair. I have a VL spec with a lot of transforms in it and I'm having trouble converting that to Altair. All credit goes here and here for helping me to get this far.
I think I'm getting close but I get the following error:
Altair.vegalite.v4.schema.channels.ColorValue, validating 'additionalProperties'
Additional properties are not allowed ('selection' was unexpected)
In the end, I'm most concerned about whether or not I translated all the transforms correctly from VL to Altair.
Any help is much appreciated as I think this would be a nice contribution to the NLP/Topic Modeling community.
import altair as alt
import pandas as pd
import numpy as np
data={
'Term': ['algorithm','learning','learning','algorithm','algorithm','learning'],
'Freq_x': [1330,1353,304.42,296.69,157.59,140.35],
'Total': [1330, 1353,1353.7,1330.47,1330.47,1353.7],
'Category': ['Default', 'Default', 'Topic1', 'Topic1', 'Topic2', 'Topic2'],
'logprob': [30.0, 27.0, -5.116, -5.1418, -5.4112, -5.5271],
'loglift': [30.0, 27.0, 0.0975, 0.0891, -0.1803, -0.3135],
'saliency_ind': [0, 3, 76, 77, 181, 186],
'x': [np.nan,np.nan,-0.0080,-0.0080,-0.0053,-0.0053],
'y': [np.nan,np.nan,-0.0056,-0.0056, 0.0003,0.0003],
'topics': [np.nan, np.nan, 1.0, 1.0, 2.0, 2.0],
'cluster': [np.nan, np.nan, 1.0, 1.0, 1.0, 1.0],
'Freq_y': [np.nan,np.nan,20.39,20.39,14.18,14.18]}
df=pd.DataFrame(data)
pts = alt.selection(type="single", fields=['Category'], empty='none')
points=alt.Chart().mark_circle(tooltip=True).encode(
x='mean(x)',
y='mean(y)',
size='Freq_y',
tooltip=['topics', 'cluster'],
detail='Category',
color=alt.condition(pts, alt.value('#F28E2B'), alt.value('#4E79A7'))
).add_selection(pts)
trans=alt.Chart(
).transform_joinaggregate(
max_fx='max(Freq_x)'
).transform_calculate(
filterCategory="selector046['Category'] ? selector046['Category'] : []"
).transform_calculate(
filtered_Freq_x="indexof(datum.filterCategory,datum['Category']) > -1 ? datum['Freq_x'] : null"
).transform_window(
Sorted='rank()',
sort=[{'field': "filtered_Freq_x:Q", "order": "descending"}]
)
b1=alt.Chart().mark_bar().encode(
x='Freq_x',
y=alt.Y('Term', sort=alt.SortField("Sorted")),
tooltip=['Total'],
)
b2=alt.Chart().mark_bar(color='#F28E2B').encode(
x='filtered_Freq_x:Q',
y=alt.Y('Term', sort=alt.SortField("Sorted")),
tooltip=['Total'],
)
bars_1=trans+b1
bars_2=trans+b2
alt.hconcat(points,bars_1+bars_2, data=df).resolve_legend(
color="independent",
size="independent"
)

Related

How to use fit_transform with an array?

Example of array content:
[
[4.9, 3.0, 1.4, 0.2, 0.0, 2.0],
[4.7, 3.2, 1.3, 0.2, 0.0, 2.0],
[4.6, 3.1, 1.5, 0.2, 0.0, 2.0],
...
]
model = TSNE(learning_rate=100)
transformed = model.fit_transform(data)
I'm trying to apply tSNE to a float array, but I get an error. What should I change?
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (149,) + inhomogeneous part.
Try this example:
from sklearn.manifold import TSNE
import numpy as np
X = np.array([[4.9, 3.0, 1.4, 0.2, 0.0, 2.0], [4.7, 3.2, 1.3, 0.2, 0.0, 2.0]])
model = TSNE(learning_rate=100)
transformed = model.fit_transform(X)
print(transformed)

Pyspark RDD unexpected change in standard deviation

I'm following Raju Kumar's PySpark recipes and on recipe 4-5 I found that when you do rdd.stats() and rdd.stats().asDict() you get different values for the standard deviation. This goes unnoticed in the book BTW.
Here is the code to reproduce the finding
import pyspark
sc = pyspark.SparkContext()
air_speed = [12,13,15,12,11,12,11]
air_rdd = sc.parallelize(air_speed)
print(air_rdd.stats())
print(air_rdd.stats().asDict())
An this is the output
(count: 7, mean: 12.285714285714286, stdev: 1.2777531299998799, max: 15.0, min: 11.0)
{'count': 7, 'mean': 12.285714285714286, 'sum': 86.0, 'min': 11.0, 'max': 15.0, 'stdev': 1.3801311186847085, 'variance': 1.904761904761905}
Now, I know the stdev on the first case is the "population" stdev formula while the second is the
unbiased estimator of the population standard dev (AKA the "sample standard deviation"). See an article for reference. But what I don't understand is why do they change from one output to the other, I mean it looks like
.asDict() should simply change the format of the output, not it's meaning.
So, does anybody understand the logic of this change?
I mean it looks like .asDict() should simply change the format of the output, not it's meaning.
I doesn't really change the meaning. pyspark.statcounter.StatCounter provides both sample and population variant
>>> stats = air_rdd.stats()
>>> stats.stdev()
1.2777531299998799
>>> stats.sampleStdev()
1.3801311186847085
and you can choose which one should be used when converting to dictionary:
>>> stats.asDict()
{'count': 7, 'mean': 12.285714285714286, 'sum': 86.0, 'min': 11.0, 'max': 15.0, 'stdev': 1.3801311186847085, 'variance': 1.904761904761905}
>>> stats.asDict(sample=True)
{'count': 7, 'mean': 12.285714285714286, 'sum': 86.0, 'min': 11.0, 'max': 15.0, 'stdev': 1.2777531299998799, 'variance': 1.63265306122449}

How to efficiently deal with nested data in PySpark?

I had a situation here, and found that collect_list in spark is not efficient when the item is already a list.
Basically, I tried to calculate the mean of a nested list (the size of each list is guaranteed to be the same). When the data set becomes, for example, 10 M rows, it may produce out of memory errors. Originally, I thought it has something to do with the udf (to calculate the mean). But actually, I found that the aggregation part (collect_list of lists) is the real problem.
What I am doing now is to divide the 10 M rows into multiple blocks (by 'user'), aggregate each block individually, and then union them at the end. Any better suggestion on efficiently dealing with nested data?
For example, the toy example is like this:
data = [('user1','place1', ['place1', 'place2', 'place3'], [0.0, 0.5, 0.4], [0.0, 0.4, 0.3]),
('user1','place2', ['place1', 'place2', 'place3'], [0.7, 0.0, 0.4], [0.6, 0.0, 0.3]),
('user2','place1', ['place1', 'place2', 'place3'], [0.0, 0.4, 0.3], [0.0, 0.3, 0.4]),
('user2','place3', ['place1', 'place2', 'place3'], [0.1, 0.2, 0.0], [0.3, 0.1, 0.0]),
('user3','place2', ['place1', 'place2', 'place3'], [0.3, 0.0, 0.4], [0.2, 0.0, 0.4]),
]
data_df = sparkApp.sparkSession.createDataFrame(data, ['user', 'place', 'places', 'data1', 'data2'])
data_agg = data_df.groupBy('user') \
.agg(f.collect_list('place').alias('place_list'),
f.first('places').alias('places'),
f.collect_list('data1').alias('data1'),
f.collect_list('data1').alias('data2'),
)
import numpy as np
def average_values(sim_vectors):
if len(sim_vectors) == 1:
return sim_vectors[0]
mat = np.array(sim_vectors)
mean_vector = np.mean(mat, axis=0)
return np.round(mean_vector, 3).tolist()
avg_vectors_udf = f.udf(average_values, ArrayType(DoubleType()))
data_agg_ave = data_agg.withColumn('data1', avg_vectors_udf('data1')) \
.withColumn('data2', avg_vectors_udf('data2'))
The result would be:
+-----+----------------+--------------------+-----------------+-----------------+
| user| place_list| places| data1| data2|
+-----+----------------+--------------------+-----------------+-----------------+
|user1|[place1, place2]|[place1, place2, ...|[0.35, 0.25, 0.4]|[0.35, 0.25, 0.4]|
|user3| [place2]|[place1, place2, ...| [0.3, 0.0, 0.4]| [0.3, 0.0, 0.4]|
|user2|[place1, place3]|[place1, place2, ...|[0.05, 0.3, 0.15]|[0.05, 0.3, 0.15]|
+-----+----------------+--------------------+-----------------+-----------------+

How to convert RDD of dense vector into DataFrame in pyspark?

I have a DenseVector RDD like this
>>> frequencyDenseVectors.collect()
[DenseVector([1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0]), DenseVector([1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]), DenseVector([1.0, 1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]), DenseVector([0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0])]
I want to convert this into a Dataframe. I tried like this
>>> spark.createDataFrame(frequencyDenseVectors, ['rawfeatures']).collect()
It gives an error like this
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/session.py", line 520, in createDataFrame
rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/session.py", line 360, in _createFromRDD
struct = self._inferSchema(rdd, samplingRatio)
File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/session.py", line 340, in _inferSchema
schema = _infer_schema(first)
File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/types.py", line 991, in _infer_schema
fields = [StructField(k, _infer_type(v), True) for k, v in items]
File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/types.py", line 968, in _infer_type
raise TypeError("not supported type: %s" % type(obj))
TypeError: not supported type: <type 'numpy.ndarray'>
old Solution
frequencyVectors.map(lambda vector: DenseVector(vector.toArray()))
Edit 1 - Code Reproducible
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext, Row
from pyspark.sql.functions import split
from pyspark.ml.feature import CountVectorizer
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vectors
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
from pyspark.mllib.linalg import SparseVector, DenseVector
sqlContext = SQLContext(sparkContext=spark.sparkContext, sparkSession=spark)
sc.setLogLevel('ERROR')
sentenceData = spark.createDataFrame([
(0, "Hi I heard about Spark"),
(0, "I wish Java could use case classes"),
(1, "Logistic regression models are neat")
], ["label", "sentence"])
sentenceData = sentenceData.withColumn("sentence", split("sentence", "\s+"))
sentenceData.show()
vectorizer = CountVectorizer(inputCol="sentence", outputCol="rawfeatures").fit(sentenceData)
countVectors = vectorizer.transform(sentenceData).select("label", "rawfeatures")
idf = IDF(inputCol="rawfeatures", outputCol="features")
idfModel = idf.fit(countVectors)
tfidf = idfModel.transform(countVectors).select("label", "features")
frequencyDenseVectors = tfidf.rdd.map(lambda vector: [vector[0],DenseVector(vector[1].toArray())])
frequencyDenseVectors.map(lambda x: (x, )).toDF(["rawfeatures"])
You cannot convert RDD[Vector] directly. It should be mapped to a RDD of objects which can be interpreted as structs, for example RDD[Tuple[Vector]]:
frequencyDenseVectors.map(lambda x: (x, )).toDF(["rawfeatures"])
Otherwise Spark will try to convert object __dict__ and create use unsupported NumPy array as a field.
from pyspark.ml.linalg import DenseVector
from pyspark.sql.types import _infer_schema
v = DenseVector([1, 2, 3])
_infer_schema(v)
TypeError Traceback (most recent call last)
...
TypeError: not supported type: <class 'numpy.ndarray'>
vs.
_infer_schema((v, ))
StructType(List(StructField(_1,VectorUDT,true)))
Notes:
In Spark 2.0 you have to use correct local types:
pyspark.ml.linalg when working DataFrame based pyspark.ml API.
pyspark.mllib.linalg when working RDD based pyspark.mllib API.
These two namespaces can no longer compatible and require explicit conversions (for example How to convert from org.apache.spark.mllib.linalg.VectorUDT to ml.linalg.VectorUDT).
Code provided in the edit is not equivalent to the one from the original question. You should be aware that tuple and list don't have the same semantics. If you map vector to pair use tuple and convert directly to DataFrame:
tfidf.rdd.map(
lambda row: (row[0], DenseVector(row[1].toArray()))
).toDF()
using tuple (product type) would work for nested structure as well but I doubt this is what you want:
(tfidf.rdd
.map(lambda row: (row[0], DenseVector(row[1].toArray())))
.map(lambda x: (x, ))
.toDF())
list at any other place than the top level row is interpreted as an ArrayType.
It is much cleaner to use an UDF for conversion (Spark Python: Standard scaler error "Do not support ... SparseVector").
I believe the problem here is that createDataframe does not take denseVactor as argument Please try to convert denseVector into corresponding collection [i.e. Array or List]. In scala and java
toArray()
method is available you can convert the denseVector in array or list then try to create dataFrame.

Understanding numpy formatting

I have been reading the numpy array formatting documentation and I cannot achieve what I want to do.
Given a matrix array where each column represents a different field, I want to format each column as integer or double depending on the data that that column represents.
Before marking this as duplicate, consider that I do not want to have this [(), (), (), ..., ()], I want this [[], [], [], ..., []] type of structure, exactly as it comes but with different types per column.
See my attempts below.
from numpy import array, intc, double
bus_format_str1 = [(" ", intc),
("BUS_TYPE", intc),
("PD", double),
("QD", double),
("GS", double),
("BS", double),
("BUS_AREA", intc),
("VM", double),
("VA", double),
("BASE_KV", double),
("ZONE", intc),
("VMAX", double),
("VMIN", double)]
bus_format_str2 = "|i8, i8, f8, f8, f8, f8, f8, f8, f8, i8, f8, f8"
# original array
Bus = array([[1, 1, 97.6, 44.2, 0, 0, 2, 1.0393836, -13.536602, 345, 1, 1.06, 0.94],
[2, 1, 0, 0, 0, 0, 2, 1.0484941, -9.7852666, 345, 1, 1.06, 0.94],
[3, 1, 322, 2.4, 0, 0, 2, 1.0307077, -12.276384, 345, 1, 1.06, 0.94]])
print(Bus)
# Attempts to apply the format
Bus_format1 = array(Bus, dtype=bus_format_str1)
Bus_format2 = array(Bus, dtype=bus_format_str2)
print(Bus_format1)
print(Bus_format2)
Both format strings produce structures that have nothing to do with the original.
So, how do I apply the mentioned independent format per column?
What, exactly, is the source of Bus? When I cut and paste your string
In [50]: Bus = array([[1, 1, 97.6, 44.2, 0, 0, 2, 1.0393836, -13.536602, 345, 1, 1.06, 0.94],
...345, 1, 1.06, 0.94]])
I get an array that is all floats:
In [51]: Bus
Out[51]:
array([[ 1. , 1. , 97.6 , 44.2 ,
0. , 0. , 2. , 1.0393836,
-13.536602 , 345. , 1. , 1.06 , .... ]])
Bus_format2 = array(Bus, dtype=bus_format_str2) really messes things up, replicating each element of Bus over the fields in the dtype:
array([[(1L, 1L, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1L, 1.0, 1.0),
(1L, 1L, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1L, 1.0, 1.0),
....
Notice the inner (). You don't want [(), (), (), ..., ()], but you also want different types per column. You can't have it both ways. If the array is structured, with different types per column, numpy will display it with the [()] form. The [[]] is only for arrays with the same type all around.
You may need to reread the documentation about dtypes and structured arrays.
But maybe you aren't concerned about the numpy representation of the data, but about the print style. But why is that important? Are you writing to a file that demands a particular style? Writing to share with someone else, or for publication?
A structured array with your data, and different types per column can be constructed with:
Start with a list of lists:
In [66]: Bus = [[1, 1, 97.6, 44.2, 0, 0, 2, 1.0393836, -13.536602, 345, 1, 1.06, 0.94],
[2, 1, 0, 0, 0, 0, 2, 1.0484941, -9.7852666, 345, 1, 1.06, 0.94],
[3, 1, 322, 2.4, 0, 0, 2, 1.0307077, -12.276384, 345, 1, 1.06, 0.94]]
A dtype with 13 fields (matching the length of the sublists)
In [67]: dt='i,i,f,f,i,i,i,f,f,i,i,f,f'
conversion to tuples is required for structured array input:
In [68]: A=np.array([tuple(x) for x in Bus],dtype=dt)
The result isn't particularly legible, but that's because we have 13 columns, some of which are floats:
In [69]: A
Out[69]:
array([ (1, 1, 97.5999984741211, 44.20000076293945, 0, 0, 2, 1.0393836498260498, -13.536602020263672, 345, 1, 1.059999942779541, 0.9399999976158142),
(2, 1, 0.0, 0.0, 0, 0, 2, 1.0484941005706787, -9.785266876220703, 345, 1, 1.059999942779541, 0.9399999976158142),
(3, 1, 322.0, 2.4000000953674316, 0, 0, 2, 1.0307077169418335, -12.276384353637695, 345, 1, 1.059999942779541, 0.9399999976158142)],
dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<f4'), ('f3', '<f4'), ('f4', '<i4'), ('f5', '<i4'), ('f6', '<i4'), ('f7', '<f4'), ('f8', '<f4'), ('f9', '<i4'), ('f10', '<i4'), ('f11', '<f4'), ('f12', '<f4')])
Use repr if you want to see the dtype along with the data: print(repr(A)). That's a good idea when asking questions about structured arrays.

Resources