HBase Intra-row scanning - node.js

I am trying to find some information on intra-row scanning through HBase REST API. However, I have a hard time to see how it could be done according to the documentation, and I am wondering if anyone used this feature from the REST interface?
References
http://hadoop-hbase.blogspot.jp/2012/01/hbase-intra-row-scanning.html
http://wiki.apache.org/hadoop/Hbase/Stargate
More specifically, I am looking into how I would do this from a Node.js application.
Edit:
I found this, but I have a hard time to see how intra-row scanning would be possible...
HBase REST Filter ( SingleColumnValueFilter )

I still have to test this further, but I have written a document which gives a few examples of stringified filter as per the HBase REST Filter question.
Basically, from what I see it is possible to use FilterList and stack multiple Filters in a JSON array.
More details: https://gist.github.com/3979381
Still have to try it, but it seems like this should work. Will update once I have tried it through the REST API.

Related

Create Data Catalog column tags by inspecting BigQuery data with Cloud Data Loss Prevention

I want to use DLP to inspect my tables in BigQuery, and then write the findings to policy tags on the columns of the table. For example, I have a (test) table that contains data including an email address and a phone number for individuals. I can use DLP to find those fields and identify them as emails and phone numbers, and I can do this in the console or via the API (I'm using NodeJS). When creating this inspection job, I know I can configure it to automatically write the findings to the Data Catalog, but this generates a tag on the table, not on the columns. I want to tag the columns with the specific type of PII that has been identified.
I found this tutorial that appears to achieve exactly that - but tutorial is a strong word; it's a script written in Java and a basic explanation of what that script does, with the only actual instructions being to clone the git repo and run a few commands. There's no information about which API calls are being made, not a lot of comments in the code, and no links to pertinent documentation. I have zero experience with Java, so I'm not able to work out the process and translate it into NodeJS for my own purposes.
I also found this similar tutorial which also utilises Dataflow, and again the instructions are simply "clone this repo, run this script". I've included the link because it features a screenshot showing what I want to achieve: tagging columns with PII data found by DLP
So, what I want to do appears to be possible, but I can't find useful documentation anywhere. I've been through the DLP and Data Catalog docs, and through the API references for NodeJS. If anyone could help me figure out how to do this, I'd be very grateful.
UPDATE: I've made some progress and changed my approach as a result.
DLP provides two methods to inspect data: dlp.inspectContent() and dlp.createDlpJob(). The latter takes a storageItem which can be a BigQuery table, but it doesn't return any information about the columns in the results, so I don't believe I can use it.
inspectContent() cannot be run on a BigQuery table; it can inspect structured text, which is what the Java script I linked above is utilising; that script is querying the BigQuery table, and constructing a Table from the results, then passing that Table into inspectContent() which then returns a Findings object which contains fieldnames. I want to do exactly that, but in NodeJS. I'm struggling to convert the BigQuery results into the format of a Table as NodeJS doesn't appear to have a constructor for that type, like Java does.
I was unable to find node.js documentation implementing column level tags.
However, you might find the Policy Tags official documentation helpful to point you in the right direction. Specifically, you might lack some roles to manage column-level tags.

Efficient pagination in Cosmos DB

I need to implement efficient pagination for Cosmos DB with nodejs api. There are many examples about the implementation with .NET and LINQ but I could not find anything good for nodejs. The idea is to send the pageSize and pageIndex and get the relevant result.
I already know we can always use dbClient.queryDocuments and get the queryIterator and perform the pagination but this requires always iterating from the first document in the DB. An example could be find here.
Any idea how to do it in an efficient way?
Unfortunately CosmosDB as an engine doesn’t have skip and take pagination support yet.
It is, however, a planned feature.
The blogs you’ve read provide one of the few viable workarounds for now which of course comes with a cost.
You could write something smarter and instead of iterating though every document from the beginning, you could keep the request’s continuation token and use it with your next request. That way you can have a previous and next button logic.

Is it possible to get all errors that include a Message substring via the Track:js API?

We use TrackJs to log JavaScript errors on Stack Overflow Talent. I want to export a csv of all errors that include the substring "%couldn't load id%" within the Message field.
The API documentation doesn't make it clear that this is possible. Is this possible?
Unfortunately we do not offer substring querying capability at this time :( The main use case for the API is bulk export to store your JS error data in other third party systems (though we are certainly amenable to supporting other use cases ;))
To that end though, the API is meant to be quick, and be able to return lots of results per query. It's no problem to retrieve up to 1000 records per page (using the size query parameter) and filter after the fact.

Way to dump the relations from Freebase?

I have ran through the Google API for Freebase, but still confusing.
Is there simple way to dump the relations from Freebase?
I want to dump all entity-name-pair with a specific relation (e.g. marry_with, ...), and also want the chinese entity names.
Should I
write MQL to query all entity satisfying the condition? (but the MQL service is going to be retired recently. )
or dump all freebase and parse?
or is there other API capable of doing this?
or other KB (YAGO, DBpedia, wikidata) is more easier of doing this?
Which way is easier to work out.
Please shed me some direction . thanks
Freebase was retired and Wikidata is the recommended alternative.
You can use the Wikidata Query API to get entities with a specific property.
For instance, the query http://wdq.wmflabs.org/api?q=CLAIM[26] retrieves the IDs of all items having the property spouse (P26).
You can combine this with the Wikidata API, for instance to get labels and aliases in English for the first three items returned by the previous query:
http://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q23|Q24|Q42&languages=en&props=labels|aliases

How to use Solr for multiple data sources?

I am a newbie to Solr & is facing challenges as below.
I have two data sources : a portal & a cms. I need to provide Solr search solution for these two sources so that when user searches on custom portlet(on portal), he should see results from both the sources at same place or Solr should fetch results from both sources. Also user should be able to access these results by clicking on same.
What all should i consider for implementing this use case. Should i use multiple Solr cores or single core? Also how can i achieve features like faceted search, search filter, stop words etc.?
Regards.
It should be perfectly fine to go with single core (and it will also work faster).
To import data from multiple data sources check out Solr Data Import Handler configuration:
http://wiki.apache.org/solr/DataImportHandler
and setup two entities - one for each of your data sources.
You will probably need to set some field to keep information about data source in imported document.
Your question is little bit too general to really answer. Go and experiment a little bit with documentation you have. It should not be very hard to get some basic search functionality.
You can find a lot of info about configuring Solr on LucidWorks wiki:
http://docs.lucidworks.com/display/solr/Faceting
and on Solr wiki: http://wiki.apache.org/solr/
You may also try with some books. Ex: http://www.packtpub.com/apache-solr-4-cookbook/book
I figured out a way to do the same. We can use http://wiki.apache.org/solr/Solrj as java client for Solr. Alfresco content can be put into XMLs & these XMLs can be dumped into SOlr using Solrj.

Resources