[getStream]Parameters for ranking in getStream - getstream-io

1) Is there any parameter available in stream which helps us in ranking the feeds according to the distance i.e. person getting feeds firstly for closer places and then for distant places in the feed ranking?

Related

Creating radar image from web api data

To get familiar with front-end web development, I'm creating a weather app. Most of the tutorials I found display the temperature, humidity, chance of rain, etc.
Looking at the Dark Sky API, I see the "Time Machine Request" returns observed weather conditions, and the response contains a 'precipIntensity' field: The intensity (in inches of liquid water per hour) of precipitation occurring at the given time. This value is conditional on probability (that is, assuming any precipitation occurs at all).
So, it made me wonder about creating a 'radar image' of precipitation intensity?
Assuming other weather apis are similar, is generating a radar image of precipitation as straightforward as:
Create a grid of latitude/longitude coordinates.
Submit a request for weather data for each coordinate.
Build a color-coded grid of received precipitation intensity values and smooth between them.
Or would that be considered a misuse of the data?
Thanks,
Mike
This would most likely end up in a very low resolution product. I will explain.
Weather observations come in from input sources ranging from mesonet stations, airports, and other programs like the "citizen weather observer" program. All of these thousands of inputs are input into the NOAA MADIS system, a centralized server that stores all observations. The companies that generate the API's pull the data from MADIS.
The problem with the observed conditions is twofold : one is that the stations are highly clustered in urban areas. In Texas, for example - there are 100's of stations in Central TX near the cities of San Antonio and Austin, but 100 miles west there is essentially nothing. To generate a radar image using this method would involve extreme interpolation- and...
The second problem is observation time. The input from rain gauges are many times delayed several minutes to an hour or more. This would give inaccurate data.
If you wanted a gridded system, the best answer would be to use MRMS (multi-radar-multi-sensor) data from the NWS. It is not an API. These are .grib files that must be downloaded and processed. This is the live viewer and if you want to work on the data itself you can use the NOAA Weather Climate Toolkit to view and/or process by GUI or batch process (You can export to geoTIF and colorize it with GDAL tools). The actual MRMS data is located here and for the basic usage you are looking for, you could use the latest data in the "MergedReflectivityComposite" folder. (That would be how other radar apps show rain.) If you want actual precip intensity, check the "PrecipRate" folder.
For anything else except radar (warning polygons, etc) the NWS has an API that is located here.
If you have other questions, I will be happy to help.

How can I use KMeans to cluster tweets in Spark?

I'd like to cluster tweets based on topic (ex. all Amazon tweets in one cluster, all Netflix tweets in another, etc.) The thing is, all the incoming tweets are already filtered on these keywords, but they're jumbled up, and I'm just categorizing them as they come in.
I'm using Spark streaming and am looking for a way to vectorize these tweets. Because this is batch processing, I don't have access to the entire corpus of tweets.
If you have a predefined vocabulary with potentially multiple terms selected simultaneously - e.g. a set of non-mutually-exclusive tweet categories that you are interested in - then you can have a binary vector in which each bit represents one of the categories.
If the categories are mutually exclusive then what could you hope to achieve by clustering? Specifically there would be no "gray area" in which some observations belong to CategorySet-A, others to CategorySet-B and others to some in-between combination. If every observation is hard-capped at one category than you have discrete points not clusters.
If instead you wish to cluster based on similar sets of words - then you might need to know the "vocabulary" up-front - which in this case means: "what are the tweet terms that I care about". In that case you can use a bag of words model https://machinelearningmastery.com/gentle-introduction-bag-words-model/ to compare the tweets - and then cluster based on the generated vectors.
Now if you are uncertain of the vocabulary apriori - which is the likely case here since you do not know what would be the content of the next tweet - then you will likely resort to re-clustering on a regular basis - as you gain new words. You can then use an updated bag of words that includes the newly "seen" terms. Note that this incurs processing cost and latency. To avoid the cost/latency you have to decide ahead of time which terms to restrict your clustering on: which may be possible if you're interested in a targeted subject.

Number of training samples for text classification tas

Suppose you have a set of transcribed customer service calls between customers and human agents, where on average each call's length is 7 minutes. Customers will mostly call because of issues they have with the product. Let's assume that a human can assign one label per axis per call:
Axis 1: What was the problem from the customer's perspective?
Axis 2: What was the problem from the agent's perspective?
Axis 3: Could the agent resolve the customer's issue?
Based on the manually labeled texts you want to train a text classifier that shall predict a label for each call for each of the three axes. But the labeling of recordings takes time and costs money. On the other hand you need a certain amount of training data to get good prediction results.
Given the above assumptions, how many manually labeled training texts would you start with? And how do you know that you need more labeled training texts?
Maybe you've worked on a similar task before and can give some advice.
UPDATE (2018-01-19): There's no right or wrong answer to my question. Ok, ideally, somebody worked on exactly the same task, but that's very unlikely. I'll leave the question open for one more week and then accept the best answer.
This would be tricky to answer but I will try my best based on my experience.
In the past, I have performed text classification on 3 datasets; the number in the bracket indicates how big my dataset was: restaurant reviews (50K sentences), reddit comments (250k sentences) and developer comments from issue tracking systems (10k sentences). Each of them had multiple labels as well.
In each of the three cases, including the one with 10k sentences, I achieved an F1 score of more than 80%. I am stressing on this dataset specifically because I was told by some that the size is less for this dataset.
So, in your case, assuming you have atleast 1000 instances (calls that include conversation between customer and agent) of average 7 minute calls, this should be a decent start. If the results are not satisfying, you have the following options:
1) Use different models (MNB, Random Forest, Decision Tree, and so on in addition to whatever you are using)
2) If point 1 gives more or less similar results, check the ratio of instances of all the classes you have (the 3 axis you are talking about here). If they do not share a good ratio, get more data or try out the different balancing techniques if you cannot get more data.
3) Another way would be to classify them at a sentence level than message or conversation level to generate more data and individual labels for sentences rather than message or the conversation itself.

Multi-Target Regression/Interpolations

Looking for some advice on the problem I have in front of me.
I have a data set of movies watched by users. For some of the users, we know that they watched the movie, and what their rating is for that movie. For many others we know they watched the movie, but don't know their rating of that movie.
I am looking to find a way to apply a predicted or perhaps interpolated rating to those users movies they watched based on the larger dataset that has movies ratings. I am trying to find out what the best course of action around this would be. I have 1.5m users and 20K movies; however only 10% of those movies are rated by about 85% of users.
My approach is thus looking at cosine similarity and interpolating the rating based on the neighbor; and if the nearest neighbor doesnt have a value for a specific movie, go to the next nearest until all the movies have a rating. The other approach is looking at NNMF to apply ratings, and having 2*features-- one binary representation of the movies, the other the rating. So when I look to "predict" for a user, Ill input their binary movie values and it will return their ratings.
My questions are: Does the NNMF approach make sense? I have never used NNMF in that way. Also, are there any other models you think make sense? I am wondering if their is more of a prediction algorithm that can be employed rather than an interpolation.

PredictionIO for Content Recommendation e.g. Tweets

I recently installed PredictionIO.
What I'd like to achieve is: I'd like to categorize content on the words included in the text. But how can I import data like raw Tweets to PredictionIO? Is it possible to let PredictionIO run over the content and find strong words and to sort them in categories?
What I'd like to get is something like this: Query for Boston Red Sox --> keywords that should appear would be: baseball, Boston, sports, ...
So I'll add on a little to what Thomas said. He's right, it all depends whether or not you have labels associated to your tweets. If your data is labeled then this will be a Text Classification problem. Look at this for more detailed info:
If you're instead looking to cluster, or group, a set of unlabeled observations then, as Thomas said, your best bet is to incorporate LDA into the works. Look at the latter documentation for more information, but basically once you run the LDA model you'll obtain an object of type DistributedLDAModel which has a method topicDistributions which gives you, for each tweet, a vector where each component is associated to a topic, and the component entry gives you the probability that the tweet belongs to that topic. You can cluster by assigning each tweet the topic with highest probability.
You also have access to a matrix of size MxN, where M is the number of words in your vocabulary, and N is the number of topics, or clusters, you wish to discover in your data. You can roughly interpret the ij th entry of this Topics Matrix as the probability that the word i appears in a document given that the document belongs to topic j. Another rule you could use for clustering is to treat each word vector associated to your tweets as a vector of counts. Then, you can interpret the ij entry of the product of your word matrix (tweets as rows, words as columns) and the Topics Matrix returned by LDA as the probability that tweet i belongs to topic j (this follows under certain assumptions, feel free to ask if you want more details). Again now you assign tweet i to the topic associated to the largest numerical value in row i of the resulting matrix. You can even use this clustering rule for assigning topics to incoming observations once you have used your original set of tweets for topic discovery!
Now, for data processing, you can still use the Text Classification reference for transforming your Tweets to word count vectors via the DataSource and Preparator components. As for importing your data, if you have the tweets saved locally on a file, you can use PredictionIO's Python SDK to import your data. An example is also given in the classification reference.
Feel free to ask any questions if anything isn't clear, and good luck!
So, really depends on if you have labelled data.
For example:
Baseball :: "I love Boston Red Sox #GoRedSox"
Sports :: "Woohoo! I love sports #winning"
Boston :: "Baseball time at Fenway Park. Red Sox FTW!"
...
Then you would be able to train a model to classifying Tweets against these keywords. You might be interested in templates for MLlib Naive Bayes, Decision Trees.
If you don't have labelled data (really, who wants to manually label Tweets) you might be able to use approaches such as Topic Modeling (e.g., LDA).
I don't think there is a template for LDA but being an active open source project it wouldn't surprise me if someone has already implemented this so might be a good idea to ask on PredictionIO user or dev forums.

Resources