I am new to Spark 2.
I tried Spark tfidf example
sentenceData = spark.createDataFrame([
(0.0, "Hi I heard about Spark")
], ["label", "sentence"])
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
wordsData = tokenizer.transform(sentenceData)
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=32)
featurizedData = hashingTF.transform(wordsData)
for each in featurizedData.collect():
print(each)
It outputs
Row(label=0.0, sentence=u'Hi I heard about Spark', words=[u'hi', u'i', u'heard', u'about', u'spark'], rawFeatures=SparseVector(32, {1: 3.0, 13: 1.0, 24: 1.0}))
I expected that in rawFeatures I will get term frequencies like {0:0.2, 1:0.2, 2:0.2, 3:0.2, 4:0.2}. Because terms frequency is:
tf(w) = (Number of times the word appears in a document) / (Total number of words in the document)
In our case is : tf(w) = 1/5 = 0.2 for each word, because each word apears once in a document.
If we imagine that output rawFeatures dictionary contains word index as key, and number of word appearances in a document as value, why key 1 is equal to 3.0? There no word that appears in a document 3 times.
This is confusing for me. What am I missing?
TL;DR; It is just a simple hash collision. HashingTF takes hash(word) % numBuckets to determine the bucket and with very low number of buckets like here collisions are to be expected. In general you should use much higher number of buckets or, if collisions are unacceptable, CountVectorizer.
In detail. HashingTF by default uses Murmur hash. [u'hi', u'i', u'heard', u'about', u'spark'] will be hashed to [-537608040, -1265344671, 266149357, 146891777, 2101843105]. If you follow the source you'll see that the implementation is equivalent to:
import org.apache.spark.unsafe.types.UTF8String
import org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes
Seq("hi", "i", "heard", "about", "spark")
.map(UTF8String.fromString(_))
.map(utf8 =>
hashUnsafeBytes(utf8.getBaseObject, utf8.getBaseOffset, utf8.numBytes, 42))
Seq[Int] = List(-537608040, -1265344671, 266149357, 146891777, 2101843105)
When you take non-negative modulo of these values you'll get [24, 1, 13, 1, 1]:
List(-537608040, -1265344671, 266149357, 146891777, 2101843105)
.map(nonNegativeMod(_, 32))
List[Int] = List(24, 1, 13, 1, 1)
Three words from the list (i, about and spark) hash to the same bucket, each occurs once, hence the result you get.
Related:
What hashing function does Spark use for HashingTF and how do I duplicate it?
How to get word details from TF Vector RDD in Spark ML Lib?
Related
I want to add a column of random values to a dataframe (has an id for each row) for something I am testing. I am struggling to get reproducible results across Spark sessions - same random value against each row id. I am able to reproduce the results by using
from pyspark.sql.functions import rand
new_df = my_df.withColumn("rand_index", rand(seed = 7))
but it only works when I am running it in same Spark session. I am not getting same results once I relaunch Spark and run my script.
I also tried defining a udf, testing to see if i can generate random values (integers) within an interval and using random from Python with random.seed set
import random
random.seed(7)
spark.udf.register("getRandVals", lambda x, y: random.randint(x, y), LongType())
but to no avail.
Is there a way to ensure reproducible random number generation across Spark sessions such that a row id gets same random value? I would really appreciate some guidance :)
Thanks for the help!
I suspect that you are getting the same common values for the seed, but in different order based on your partitioning which is influenced by the data distribution when reading from disk and there could be more or less data per time. But I am not privy to your code in reality.
The rand function generates the same random data (what is the point of the seed otherwise) and somehow the partitions get a slice of it. If you look you should guess the pattern!
Here is an an example of 2 different cardinality dataframes. You can see that the seed gives the same or a superset of results. So, ordering and partitioning play a role imo.
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.functions import col
df1 = spark.range(1, 5).select(col("id").cast("double"))
df1 = df1.withColumn("rand_index", rand(seed = 7))
df1.show()
df1.rdd.getNumPartitions()
print('Partitioning distribution: '+ str(df1.rdd.glom().map(len).collect()))
returns:
+---+-------------------+
| id| rand_index|
+---+-------------------+
|1.0|0.06498948189958098|
|2.0|0.41371264720975787|
|3.0|0.12030715258495939|
|4.0| 0.2731073068483362|
+---+-------------------+
8 partitions & Partitioning distribution: [0, 1, 0, 1, 0, 1, 0, 1]
The same again with more data:
...
df1 = spark.range(1, 10).select(col("id").cast("double"))
...
returns:
+---+-------------------+
| id| rand_index|
+---+-------------------+
|1.0| 0.9147159860432812|
|2.0|0.06498948189958098|
|3.0| 0.7069655052310547|
|4.0|0.41371264720975787|
|5.0| 0.1982919638208397|
|6.0|0.12030715258495939|
|7.0|0.44292918521277047|
|8.0| 0.2731073068483362|
|9.0| 0.7784518091224375|
+---+-------------------+
8 partitions & Partitioning distribution: [1, 1, 1, 1, 1, 1, 1, 2]
You can see 4 common random values - within a Spark session or out of session.
I know it's a bit late, but have you considered using hashing of IDs, dates etc. that are deterministic, instead of using built-in random functions? I'm encountering similar issue but I believe my problem can be solved using for example xxhash64, which is a PySpark built-in hash function. You can then use the last few digits, or normalize if you know the total range of the hash values, which I couldn't find in its documentations.
source_dataset = tf.data.TextLineDataset('primary.csv')
target_dataset = tf.data.TextLineDataset('secondary.csv')
dataset = tf.data.Dataset.zip((source_dataset, target_dataset))
dataset = dataset.shard(10000, 0)
dataset = dataset.map(lambda source, target: (tf.string_to_number(tf.string_split([source], delimiter=',').values, tf.int32),
tf.string_to_number(tf.string_split([target], delimiter=',').values, tf.int32)))
dataset = dataset.map(lambda source, target: (source, tf.concat(([start_token], target), axis=0), tf.concat((target, [end_token]), axis=0)))
dataset = dataset.map(lambda source, target_in, target_out: (source, tf.size(source), target_in, target_out, tf.size(target_in)))
dataset = dataset.shuffle(NUM_SAMPLES) #This is the important line of code
I would like to shuffle my entire dataset fully, but shuffle() requires a number of samples to pull, and tf.Size() does not work with tf.data.Dataset.
How can I shuffle properly?
I was working with tf.data.FixedLengthRecordDataset() and ran into a similar problem.
In my case, I was trying to only take a certain percentage of the raw data.
Since I knew all the records have a fixed length, a workaround for me was:
totalBytes = sum([os.path.getsize(os.path.join(filepath, filename)) for filename in os.listdir(filepath)])
numRecordsToTake = tf.cast(0.01 * percentage * totalBytes / bytesPerRecord, tf.int64)
dataset = tf.data.FixedLengthRecordDataset(filenames, recordBytes).take(numRecordsToTake)
In your case, my suggestion would be to count directly in python the number of records in 'primary.csv' and 'secondary.csv'. Alternatively, I think for your purpose, to set the buffer_size argument doesn't really require counting the files. According to the accepted answer about the meaning of buffer_size, a number that's greater than the number of elements in the dataset will ensure a uniform shuffle across the whole dataset. So just putting in a really big number (that you think will surpass the dataset size) should work.
As of TensorFlow 2, the length of the dataset can be easily retrieved by means of the cardinality() function.
dataset = tf.data.Dataset.range(42)
#both print 42
dataset_length_v1 = tf.data.experimental.cardinality(dataset).numpy())
dataset_length_v2 = dataset.cardinality().numpy()
NOTE: When using predicates, such as filter, the return of the length may be -2. One can consult an explanation here, otherwise just read the following paragraph:
If you use the filter predicate, the cardinality may return value -2, hence unknown; if you do use filter predicates on your dataset, ensure that you have calculated in another manner the length of your dataset( for example length of pandas dataframe before applying .from_tensor_slices() on it.
Hello I am using Kmeans to build a topic classifier, my idea is to take several Facebook comments from different users to have several documents.
My list of documents looks as follows:
list=["comment1","comment2",...,"commentN"]
Then I used tfidf to vectorize every comment and assign it to a specific cluster,
the output of my program is the following:
tfidf = tfidf_vectorizer.fit_transform(list)
tf = tf_vectorizer.fit_transform(list)
print("size of tf",tf.shape)
print("size of tfidf",tfidf.shape)
#Creating clusters from data
kmeans = KMeans(n_clusters=8, random_state=0).fit(tf)
print("printing labels",kmeans.labels_)
#Printing the number of clusters
print("Number of clusters",set(kmeans.labels_))
print("dimensions of matrix labels",(kmeans.labels_).shape)
#Predicting new labels
y_pred = kmeans.predict(tf)
print("dimensions of predict matrix",y_pred.shape)
My output looks as follows:
size of tf (202450, 2000)
size of tfidf (202450, 2000)
printing labels [1 1 1 ..., 1 1 1]
Number of clusters {0, 1, 2, 3, 4, 5, 6, 7}
dimensions of matrix labels (202450,)
dimensions of predict matrix (202450,)
C:\Program Files\Anaconda3\lib\site-packages\sklearn\utils\validation.py:420: DataConversionWarning: Data with input dtype int64 was converted to float64.
warnings.warn(msg, DataConversionWarning)
Now the problema is that I would like to find a way to give sense to this clusters I mean the class 0 is about sports, class 1 is talking about politics, so I would like to appreciate any recomendation to understand this clusters, or at least to find a way to get all the commments that belongs to a specific cluster to then interpret this result thanks for the support.
There are multiple approaches
The easiest approache is to get the centroid of each cluster, it is a good summary of most words used in the cluster.
The second approache is to get the sub matrix of tf-idf of element assigned to each cluster,
after that you can use ACP on sub matrix to extract factors , and understand more The composition of each cluster.
Sorry I do not use sckit-learn, so I cannot help you by code
Hop that will help
I have a corpus of wiki pages (baseball, hockey, music, football) which I'm running through tfidf and then through kmeans. After a couple issues to start (you can see my previous questions), I'm finally getting a KMeansModel...but when I try to predict, I keep getting the same center. Is this because of the small dataset, or because I'm comparing a multi-word document against a smaller amount of words(1-20) query? Or is there something else I'm doing wrong? See the below code:
//Preprocessing of data includes splitting into words
//and removing words with only 1 or 2 characters
val corpus: RDD[Seq[String]]
val hashingTF = new HashingTF(100000)
val tf = hashingTF.transform(corpus)
val idf = new IDF().fit(tf)
val tfidf = idf.transform(tf).cache
val kMeansModel = KMeans.train(tfidf, 3, 10)
val queryTf = hashingTF.transform(List("music"))
val queryTfidf = idf.transform(queryTf)
kMeansModel.predict(queryTfidf) //Always the same, no matter the term supplied
This question seems somewhat related to this one
More a checklist than an answer:
A single word query or a very short sentence is probably not a good choice especially when combined with a large feature vector. I would start with significant fragments of the documents from the corpus
Manually check similarity between query an each cluster. Is it even remotely similar to each cluster?
import breeze.linalg.{DenseVector => BDV, SparseVector => BSV, Vector => BV}
import breeze.linalg.functions.cosineDistance
import org.apache.spark.mllib.linalg.{Vector, SparseVector, DenseVector}
def toBreeze(v: Vector): BV[Double] = v match {
case DenseVector(values) => new BDV[Double](values)
case SparseVector(size, indices, values) => {
new BSV[Double](indices, values, size)
}
}
val centers = kMeansModel.clusterCenters.map(toBreeze(_))
val query = toBreeze(queryTfidf)
centers.map(c => cosineDistance(query, c))
Does K-Means converge? Depending on a dataset and initial centroids ten or twenty iterations can be not enough. Try to increase this number to one thousand or so and see if the problem persist.
Is your corpus diverse enough to form meaningful clusters? Try to find centroids for each document in you corpus. Do you get a relatively uniform distribution or almost all documents are assigned to a single cluster.
Perform visual inspection. Take your tfidf RDD convert to a matrix, apply PCA, plot, color by cluster and see if you get a meaningful results.
Plot centroids as well and check if these cover possible cluster. If not check convergence once again.
You can also check similarities between centroids:
(0 until centers.size)
.toList
.flatMap(i => ((i + 1) until centers.size)
.map(j => (i, j, 1 - cosineDistance(centers(i), centers(j)))))
Is your pre-processing thorough enough? Simple removal of the short words most likely won't suffice. I would at lest extend it using with stopwords removal. Some stemming wouldn't hurt too.
K-Means results depend on the initial centroids. Try running an algorithm multiple times an see if problem persists.
Try more sophisticated algorithm like LDA
I was looking at the Spark 1.5 dataframe/row api and the implementation for the logistic regression. As I understand, the train method therein first converts the dataframe to RDD[LabeledPoint] as,
override protected def train(dataset: DataFrame): LogisticRegressionModel = {
// Extract columns from data. If dataset is persisted, do not persist oldDataset.
val instances = extractLabeledPoints(dataset).map {
case LabeledPoint(label: Double, features: Vector) => (label, features)
}
...
And then it proceeds to feature standardization, etc.
What I am confused with is, the DataFrame is of type RDD[Row] and Row is allowed to have any valueTypes, for e.g. (1, true, "a string", null) seems a valid row of a dataframe. If that is so, what does the extractLabeledPoints above mean? It seems it is selecting only Array[Double] as the feature values in Vector. What happens if a column in the data-frame was strings? Also, what happens to the integer categorical values?
Thanks in advance,
Nikhil
Lets ignore Spark for a moment. Generally speaking linear models, including logistic regression, expect numeric independent variables. It is not in any way specific to Spark / MLlib. If input contains categorical or ordinal variables these have to be encoded first. Some languages, like R, handle this in a transparent manner:
> df <- data.frame(x1 = c("a", "b", "c", "d"), y=c("aa", "aa", "bb", "bb"))
> glm(y ~ x1, df, family="binomial")
Call: glm(formula = y ~ x1, family = "binomial", data = df)
Coefficients:
(Intercept) x1b x1c x1d
-2.357e+01 -4.974e-15 4.713e+01 4.713e+01
...
but what is really used behind the scenes is so called design matrix:
> model.matrix( ~ x1, df)
(Intercept) x1b x1c x1d
1 1 0 0 0
2 1 1 0 0
3 1 0 1 0
4 1 0 0 1
...
Skipping over the details it is the same type of transformation as the one performed by the OneHotEncoder in Spark.
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer}
val df = sqlContext.createDataFrame(Seq(
Tuple1("a"), Tuple1("b"), Tuple1("c"), Tuple1("d")
)).toDF("x").repartition(1)
val indexer = new StringIndexer()
.setInputCol("x")
.setOutputCol("xIdx")
.fit(df)
val indexed = indexer.transform(df)
val encoder = new OneHotEncoder()
.setInputCol("xIdx")
.setOutputCol("xVec")
val encoded = encoder.transform(indexed)
encoded
.select($"xVec")
.map(_.getAs[Vector]("xVec").toDense)
.foreach(println)
Spark goes one step further and all features, even if algorithm allows nominal/ordinal independent variables, have to be stored as Double using a spark.mllib.linalg.Vector. In case of spark.ml it is a DataFrame column, in spark.mllib a field in spark.mllib.regression.LabeledPoint.
Depending on a model interpretation of the feature vector can be different though. As mentioned above for linear model these will be interpreted as numerical variables. For Naive Bayes theses are considered nominal. If model accepts both numerical and nominal variables Spark and treats each group in a different way, like decision / regression trees, you can provide categoricalFeaturesInfo parameter.
It is worth pointing out that dependent variables should be encoded as Double as well but, unlike independent variables, may require additional metadata to be handled properly. If you take a look at the indexed DataFrame you'll see that StringIndexer not only transforms x, but also adds attributes:
scala> org.apache.spark.ml.attribute.Attribute.fromStructField(indexed.schema(1))
res12: org.apache.spark.ml.attribute.Attribute = {"vals":["d","a","b","c"],"type":"nominal","name":"xIdx"}
Finally some Transformers from ML, like VectorIndexer, can automatically detect and encode categorical variables based on the number of distinct values.
Copying clarification from zero323 in the comments:
Categorical values before being passed to MLlib / ML estimators have to be encoded as Double. There quite a few built-in transformers like StringIndexer or OneHotEncoder which can be helpful here. If algorithm treats categorical features in a different manner than a numerical ones, like for example DecisionTree, you identify which variables are categorical using categoricalFeaturesInfo.
Finally some transformers use special attributes on columns to distinguish between different types of attributes.