Add Geolocation filter to twitter stream through spark streaming in Java - apache-spark

I want the tweets related to particular geo location alone. After googling around, I found that this can be achieved by adding extra methods/functionality to TwitterUtils and TwitterInputDStream classes. But I am unable to do so as these are final classes.
Help me on How can we achieve this?
Thanks in advance.

This is the best answer I could find to do this when creating the stream. However, with this filter, we are filtering the tweets after receiving them. I think there is a way to filter before receiving them by supplying the filter to the twitter api. This code has been verified to work and comes from:
http://www.michael-goettsche.de/?p=19#return-note-19-4
JavaDStream<Status> tweetsWithLocation = twitterStream.filter(
new Function<Status, Boolean>() {
public Boolean call(Status status){
if (status.getGeoLocation() != null) {
return true;
} else {
return false;
}
}
}
);
JavaDStream<String> statuses = tweetsWithLocation.map(
new Function<Status, String>() {
public String call(Status status) {
return status.getGeoLocation().toString() + ": " + status.getText();
}
}
);

Related

How to save Sparks MatrixFactorizationModel recommendProductsForUsers to Hbase

I am new to spark and I am want to save the output of recommendProductsForUsers to Hbase table. I found an example (https://sparkkb.wordpress.com/2015/05/04/save-javardd-to-hbase-using-saveasnewapihadoopdataset-spark-api-java-coding/) showing to use JavaPairRDD and saveAsNewAPIHadoopDataset to save.
How can I convert JavaRDD<Tuple2<Object, Rating[]>> to JavaPairRDD<ImmutableBytesWritable, Put> so that I can use saveAsNewAPIHadoopDataset?
//Loads the data from hdfs
MatrixFactorizationModel sameModel = MatrixFactorizationModel.load(jsc.sc(), trainedDataPath);
//Get recommendations for all users
JavaRDD<Tuple2<Object, Rating[]>> ratings3 = sameModel.recommendProductsForUsers(noOfProductsToReturn).toJavaRDD();
By using mapToPair. From the same source you provided example(i changed types by hand):
JavaPairRDD<ImmutableBytesWritable, Put> hbasePuts = javaRDD.mapToPair(
new PairFunction<Tuple2<Object, Rating[]>, ImmutableBytesWritable, Put>() {
#Override
public Tuple2<ImmutableBytesWritable, Put> call(Tuple2<Object, Rating[]> row) throws Exception {
Put put = new Put(Bytes.toBytes(row.getString(0)));
put.add(Bytes.toBytes("columFamily"), Bytes.toBytes("columnQualifier1"), Bytes.toBytes(row.getString(1)));
put.add(Bytes.toBytes("columFamily"), Bytes.toBytes("columnQualifier2"), Bytes.toBytes(row.getString(2)));
return new Tuple2<ImmutableBytesWritable, Put>(new ImmutableBytesWritable(), put);
}
});
It goes like this, you cretne new instance of put supplying it with row key in constructor, and then for each column you call add. and then you return the put created.
This is how i solved the above problem, hope this will be helpful to someone.
JavaPairRDD<ImmutableBytesWritable, Put> hbasePuts1 = ratings3
.mapToPair(new PairFunction<Tuple2<Object, Rating[]>, ImmutableBytesWritable, Put>() {
#Override
public Tuple2<ImmutableBytesWritable, Put> call(Tuple2<Object, Rating[]> arg0)
throws Exception {
Rating[] userAndProducts = arg0._2;
System.out.println("***********" + userAndProducts.length + "**************");
List<Item> items = new ArrayList<Item>();
Put put = null
String recommendedProduct = "";
for (Rating r : userAndProducts) {
//Some logic here to convert Ratings into appropriate put command
// recommendedProduct = r.product;
}
put.addColumn(Bytes.toBytes("recommendation"), Bytes.toBytes("product"),Bytes.toBytes(recommendedProduct)); Bytes.toBytes("product"),Bytes.toBytes(response.getItems().toString()));
return new Tuple2<ImmutableBytesWritable, Put>(new ImmutableBytesWritable(), put);
}
});
System.out.println("*********** Number of records in JavaPairRdd: "+ hbasePuts1.count() +"**************");
hbasePuts1.saveAsNewAPIHadoopDataset(newApiJobConfig.getConfiguration());
jsc.stop();
We just open sourced Splice Machine and we have examples integrating MLIB with querying and storage into Splice Machine. I do not know if this will help but thought I would let you know.
http://community.splicemachine.com/use-spark-libraries-splice-machine/
Thanks for the post, very cool.

Google Custom Search API - Search Results

I have somewhat lost touch with custom search engines ever since Google switched from its more legacy search engine api in favor of the google custom search api. I'm hoping someone might be able to tell me whether a (pretty simple) goal can be accomplished with the new framework, and potentially any starting help would be great.
Specifically, I am looking to write a program which will read in text from a text file, then use five words from said document in a google search - the point being to figure out how many results accrue from said search.
An example input/output would be:
Input: "This is my search term" -- quotations included in the search!
Output: there were 7 total results
Thanks so much, all, for your time/help
First you need to create a Google Custom Search project inside you google account.
From this project you must obtain a Custom Search Engine ID , known as cx parameter. You must also obtain a API key parameter. Both of these are available from your Google Custom Search API project inside your google account.
Then, if you prefer Java , here's a working example:
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
public class GoogleCustonSearchAPI {
public static void main(String[] args) throws Exception {
String key="your_key";
String qry="your_query";
String cx = "your_cx";
//Fetch urls
URL url = new URL(
"https://www.googleapis.com/customsearch/v1?key="+key+"&cx="+cx+"&q="+ qry +"&alt=json&queriefields=queries(request(totalResults))");
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.setRequestMethod("GET");
conn.setRequestProperty("Accept", "application/json");
BufferedReader br = new BufferedReader(new InputStreamReader(
(conn.getInputStream())));
//Remove comments if you need to output in JSON format
/*String output;
System.out.println("Output from Server .... \n");
while ((output = br.readLine()) != null) {
System.out.println(output);
}*/
//Print the urls and domains from Google Custom Search String searchResult;
while ((searchResult = output.readLine()) != null) {
int startPos=searchResult.indexOf("\"link\": \"")+("\"link\": \"").length();
int endPos=searchResult.indexOf("\",");
if(searchResult.contains("\"link\": \"") && (endPos>startPos)){
String link=searchResult.substring(startPos,endPos);
if(link.contains(",")){
String tempLink = "\"";
tempLink+=link;
tempLink+="\"";
System.out.println(tempLink);
}
else{
System.out.println(link);
}
System.out.println(getDomainName(link));
}
}
conn.disconnect();
}
public static String getDomainName(String url) throws URISyntaxException {
URI uri = new URI(url);
String domain = uri.getHost();
return domain.startsWith("www.") ? domain.substring(4) : domain;
}
The "&queriefields=queries(request(totalResults))" is what makes the difference and gives sou what you need. But keep in mind that you can perform only 100 queries per day for free and that the results of Custom Search API are sometimes quite different from the those returned from Google.com search
If anybody would still need some example of CSE (Google Custom Search Engine) API, this is working method
public static List<Result> search(String keyword){
Customsearch customsearch= null;
try {
customsearch = new Customsearch(new NetHttpTransport(),new JacksonFactory(), new HttpRequestInitializer() {
public void initialize(HttpRequest httpRequest) {
try {
// set connect and read timeouts
httpRequest.setConnectTimeout(HTTP_REQUEST_TIMEOUT);
httpRequest.setReadTimeout(HTTP_REQUEST_TIMEOUT);
} catch (Exception ex) {
ex.printStackTrace();
}
}
});
} catch (Exception e) {
e.printStackTrace();
}
List<Result> resultList=null;
try {
Customsearch.Cse.List list=customsearch.cse().list(keyword);
list.setKey(GOOGLE_API_KEY);
list.setCx(SEARCH_ENGINE_ID);
Search results=list.execute();
resultList=results.getItems();
}
catch ( Exception e) {
e.printStackTrace();
}
return resultList;
}
This method returns List of Result Objects, so you can iterate through it
List<Result> results = new ArrayList<>();
try {
results = search(QUERY);
} catch (Exception e) {
e.printStackTrace();
}
for(Result result : results){
System.out.println(result.getDisplayLink());
System.out.println(result.getTitle());
// all attributes
System.out.println(result.toString());
}
I use gradle dependencies
dependencies {
compile 'com.google.apis:google-api-services-customsearch:v1-rev57-1.23.0'
}
Don't forget to define your own GOOGLE_API_KEY, SEARCH_ENGINE_ID (cx), QUERY and HTTP_REQUEST_TIMEOUT (ie private static final int HTTP_REQUEST_TIMEOUT = 3 * 600000;)

Groovy addBatch/executeBatch with autoGenerated keys

Has anyone retrieved the auto-generated keys for a database insert while using Groovy SQL's withBatch method? I have the following code
def Sql target = ...//database connection
target.withBatch { ps ->
insertableStuff.each { ps.addBatch ( it ) }
ps.executeBatch()
def results = ps.getGeneratedKeys() //what do I do with this?
}
We're using DB2, and I've successfully tested the getGeneratedKeys method with a single statement/result set, but once I wrap the process in a batch, I'm not sure what objects I'm dealing with anymore.
According to IBM, it is possible to get the results back, but their example is using standard JDBC objects, not the groovy ones. Any ideas?
I took the Groovy SQL stuff out the picture to see if I could get something working, I wanted to make sure that DB2 for z/OS actually supported the function, and was able to get the generated values. I was using IBM's example, however I had to add some extra code to handle for the casting that the IBM example is using.
SQL target = ...//get database connection
def preparedStatement = target.connection.prepareStatement(statement, ['ISN'] as String[])
ResultSet[] resultSets = ((DB2PreparedStatement) (ps.getDelegate().getDelegate())).getDBGeneratedKeys()
resultSets.each { ResultSet results ->
while(results.next()) {
println results.getInt(1)
}
}
So... that's a little clunky, but it's functional. Unfortunately, by controlling the statement myself, I lost all of the parameter mapping that Groovy normally does for me.
I was looking through the groovy Sql source code and can see where they are explicitly telling the database connection not to handle parameters, so I'm thinking I'll add a new method to Sql.metaClass that can pass in a list of the auto-generated column names or something to make this more palatable.
I also want to see if there's a way to get the getGeneratedKeys method working so that I don't have to do all of that casting. At the very least, a utility method to safely handle the casting for me.
try {
withinBatch = true;
PreparedStatement statement = (PreparedStatement) getAbstractStatement(new CreatePreparedStatementCommand(0), connection, sql);
configure(statement);
psWrapper = new BatchingPreparedStatementWrapper(statement, indexPropList, batchSize, LOG, this);
closure.call(psWrapper);
return psWrapper.executeBatch();
} catch (SQLException e) {
The createNewPreparedStatement(0) prevents the creation of a statement which could return the auto-generated keys.
Just to make sure I wasn't crazy, I re-tried the 'getGeneratedKeys' method again with a statement that I know works and I got no results (see below). I had to recursively spin through the results to find the IBM class. So... not my favorite code, it's pretty brittle, but it's functional. Now I just need to see if I can still use the withBatch method somehow, I'll obviously need to override some things.
println 'print using getGeneratedKeys'
def results = preparedStatement.getGeneratedKeys()
while (results.next()) {
println SqlGroovyMethods.toRowResult(results)
}
println 'print using delegate processing'
println getGeneratedKeys(preparedStatement)
private List getGeneratedKeys(PreparedStatement statement) {
switch (statement) {
case DelegatingStatement:
return getGeneratedKeys(DelegatingStatement.cast(statement).getDelegate())
case DB2PreparedStatement:
ResultSet[] resultSets = DB2PreparedStatement.cast(statement).getDBGeneratedKeys()
List keys = []
resultSets.each { ResultSet results ->
while (results.next()) {
keys << SqlGroovyMethods.toRowResult(results)
}
}
return keys
default:
return [SqlGroovyMethods.toRowResult(statement.getGeneratedKeys())]
}
}
---- Console Output ----
print using getGeneratedKeys
print using delegate processing
[[KEY:7391], [KEY:7392]]
Okay, got it working. I had to hack my way into the Groovy SQL class, and there are some things that I just couldn't do because the methods in the Groovy class were private, so this implementation doesn't support cachedStatements, the isWithinBatch method won't operate correctly in the closure, and there's no access to the number of rows that were updated.
It'd be nice to see some variation of this in the base Groovy code, perhaps with a extension point where you put in your own handler (since you wouldn't want the IBM specific stuff in the base Groovy code), but at least I have a workable solution now.
public class SqlWithGeneratedKeys extends Sql {
public SqlWithGeneratedKeys(Sql parent) {
super(parent);
}
public List<GroovyRowResult> withBatch(String pSql, String [] keys, Closure closure) throws SQLException {
return this.withBatch(0, pSql, keys, closure);
}
public List<GroovyRowResult> withBatch(int batchSize, String pSql, String [] keys, Closure closure) throws SQLException {
final Connection connection = this.createConnection();
List<Tuple> indexPropList = null;
final SqlWithParams preCheck = this.buildSqlWithIndexedProps(pSql);
BatchingPreparedStatementWrapper psWrapper = null;
String sql = pSql;
if (preCheck != null) {
indexPropList = new ArrayList<Tuple>();
for (final Object next : preCheck.getParams()) {
indexPropList.add((Tuple) next);
}
sql = preCheck.getSql();
}
PreparedStatement statement = null;
try {
statement = connection.prepareStatement(sql, keys);
this.configure(statement);
psWrapper = new BatchingPreparedStatementWrapper(statement, indexPropList, batchSize, LOG, this);
closure.call(psWrapper);
psWrapper.executeBatch();
return this.getGeneratedKeys(statement);
} catch (final SQLException e) {
LOG.warning("Error during batch execution of '" + sql + "' with message: " + e.getMessage());
throw e;
} finally {
BaseDBServices.closeDBElements(connection, statement, null);
}
}
protected List<GroovyRowResult> getGeneratedKeys(Statement statement) throws SQLException {
if (statement instanceof DelegatingStatement) {
return this.getGeneratedKeys(DelegatingStatement.class.cast(statement).getDelegate());
} else if (statement instanceof DB2PreparedStatement) {
final ResultSet[] resultSets = DB2PreparedStatement.class.cast(statement).getDBGeneratedKeys();
final List<GroovyRowResult> keys = new ArrayList<GroovyRowResult>();
for (final ResultSet results : resultSets) {
while (results.next()) {
keys.add(SqlGroovyMethods.toRowResult(results));
}
}
return keys;
}
return Arrays.asList(SqlGroovyMethods.toRowResult(statement.getGeneratedKeys()));
}
}
Calling it is nice and clean.
println new SqlWithGeneratedKeys(target).withBatch(statement, ['ISN'] as String[]) { ps ->
rows.each {
ps.addBatch(it)
}
}

SharePoint oData API Only Returns 1000 Records

I am trying to query a SharePoint 2013 list using the Rest API for all items in the list. The problem is it only returns 1000 records max and I need to get all of the records. I am using the oData v4 API and auto generated service references for the site.
I figured it out: I am including the question and answer here in case anyone else needs it.
I created an extension method called SelectAll() that returns all of the records for a given query.
public static List<T> SelectAll<T>(this DataServiceContext dataContext, IQueryable<T> query)
{
var list = new List<T>();
DataServiceQueryContinuation<T> token = null;
var response = ((DataServiceQuery)query).Execute() as QueryOperationResponse<T>;
do
{
if (token != null)
{
response = dataContext.Execute(token);
}
list.AddRange(response);
} while ((token = response.GetContinuation()) != null);
return list;
}
You use it by calling dataContext.SelectAll(query);
I had the same problem, and wanted it to be a generic solution without providing the query. I do use the EntitySetAttribute to determine the listname.
public static List<T> GetAlltems<T>(this DataServiceContext context)
{
return context.GetAlltems<T>(null);
}
public static List<T> GetAlltems<T>(this DataServiceContext context, IQueryable<T> queryable)
{
List<T> allItems = new List<T>();
DataServiceQueryContinuation<T> token = null;
EntitySetAttribute attr = (EntitySetAttribute)typeof(T).GetCustomAttributes(typeof(EntitySetAttribute), false).First();
// Execute the query for all customers and get the response object.
DataServiceQuery<T> query = null;
if (queryable == null)
{
query = context.CreateQuery<T>(attr.EntitySet);
}
else
{
query = (DataServiceQuery<T>) queryable;
}
QueryOperationResponse<T> response = query.Execute() as QueryOperationResponse<T>;
// With a paged response from the service, use a do...while loop
// to enumerate the results before getting the next link.
do
{
// If nextLink is not null, then there is a new page to load.
if (token != null)
{
// Load the new page from the next link URI.
response = context.Execute<T>(token);
}
allItems.AddRange(response);
}
// Get the next link, and continue while there is a next link.
while ((token = response.GetContinuation()) != null);
return allItems;
}

Need help in Sharepoint workflow

I am new to sharepoint development and i have task in hand to do. I need to add few lines of code for the following logic.
Need to check if previous title and new title of task items are same.
If Not, then query the Task list
Find all the items which contain the previous title
Update their titles.
Here is my Pseudocode:
public override void ItemUpdating(SPItemEventProperties properties)
{
try {
this.DisableEventFiring();
//Need to write my logic here
base.ItemUpdating(properties);
}
catch (Exception ex) {
}
finally {
this.EnableEventFiring();
}
}
Can somebody guide me how to write the code for the above mentioned logic? If you have any sample code's with similar logic, please share it. It will be helpful for me.
Thanks in Advance!
This code might help you out. Maybe you need to adapt it for your needs, but the properties you need to access are the same.
public override void ItemUpdating(SPItemEventProperties properties)
{
//this will get your title before updating
var oldName = properties.ListItem["Title"].ToString();
//and this will get the new title
var newName = properties.AfterProperties["Title"].ToString();
if (newName != oldName)
{
using (var site = new SPSite("http://yoursitename"))
using (var web = site.OpenWeb())
{
var list = web.Lists["Tasks"];
var items = list.Items.OfType<SPListItem>().Where(i => (string) i["Title"] == oldName);
foreach (var item in items)
{
item["Title"] = newName;
item.Update();
}
}
}
base.ItemUpdating(properties);
}

Resources