Graylog ; How to limit extractors application to a subset of sources - graylog2

Trying to use extractors within Graylog, I cannot find a way to limit the pattern matching to one source.
Basically I do a split&Index search to extract a field but I want this extractors to be used only for a subset of my sources.
Documentation seems poor for this.
Any Idea ?
Thanks
Loïc

Seems like this cannot be easily done. There're two options: pipelines and separate inputs. Obviously, if you capture different sources in different inputs then you prevent this problem.
The second solution is to use Pipelines (as of v2). Here's the author confirming this:
This is possible since Graylog 2.0 by using pipeline rules.
http://docs.graylog.org/en/2.1/pages/pipelines.html
p.s. I thought it might be possible to store and extract from the full message, but couldn't figure how to first cut the json prior to extraction. A sample message coming from a docker container might look like this (sending logs over syslog to Graylog):
<30>1 2016-11-26T22:22:38.951321+01:00 www.example.com docker 19459 - - {"name":"my-awesome-app","hostname":"docker24.example.com","pid":1,"level":30,"msg":"happily serving customers","time":"2016-11-26T21:22:38.950Z"}
So the entire field is not a proper JSON and Graylog JSON extractor would fail. You have your source here - www.example.com, so it's possible to configure the extractor to only run when this matches, but then the question is how to parse only the JSON section..

Related

How can we enrich a document on the fly before getting it ingested to the Opensearch Index

We have several existing indexes, where we need to add some additional attributes whose values should be derived based on a lookup on another index.We need to update/enrich the document before the ingestion, as update/re-indexing would add the additional overhead which might not be able to afford.
As per the documentations around, it seems like enrichment policies in Elasticsearch is meant to solve something similar, but was wondering is there any workaround for OpenSearch ?
Also, we have approx one third data coming from python scripts, to enrich these, is there any way to send pandas df to logstash directly to enrich them before ingestion to the index ?
Tried enrichment policies, lookup processor, but turns out these are not supported by Opensearch.

Create Data Catalog column tags by inspecting BigQuery data with Cloud Data Loss Prevention

I want to use DLP to inspect my tables in BigQuery, and then write the findings to policy tags on the columns of the table. For example, I have a (test) table that contains data including an email address and a phone number for individuals. I can use DLP to find those fields and identify them as emails and phone numbers, and I can do this in the console or via the API (I'm using NodeJS). When creating this inspection job, I know I can configure it to automatically write the findings to the Data Catalog, but this generates a tag on the table, not on the columns. I want to tag the columns with the specific type of PII that has been identified.
I found this tutorial that appears to achieve exactly that - but tutorial is a strong word; it's a script written in Java and a basic explanation of what that script does, with the only actual instructions being to clone the git repo and run a few commands. There's no information about which API calls are being made, not a lot of comments in the code, and no links to pertinent documentation. I have zero experience with Java, so I'm not able to work out the process and translate it into NodeJS for my own purposes.
I also found this similar tutorial which also utilises Dataflow, and again the instructions are simply "clone this repo, run this script". I've included the link because it features a screenshot showing what I want to achieve: tagging columns with PII data found by DLP
So, what I want to do appears to be possible, but I can't find useful documentation anywhere. I've been through the DLP and Data Catalog docs, and through the API references for NodeJS. If anyone could help me figure out how to do this, I'd be very grateful.
UPDATE: I've made some progress and changed my approach as a result.
DLP provides two methods to inspect data: dlp.inspectContent() and dlp.createDlpJob(). The latter takes a storageItem which can be a BigQuery table, but it doesn't return any information about the columns in the results, so I don't believe I can use it.
inspectContent() cannot be run on a BigQuery table; it can inspect structured text, which is what the Java script I linked above is utilising; that script is querying the BigQuery table, and constructing a Table from the results, then passing that Table into inspectContent() which then returns a Findings object which contains fieldnames. I want to do exactly that, but in NodeJS. I'm struggling to convert the BigQuery results into the format of a Table as NodeJS doesn't appear to have a constructor for that type, like Java does.
I was unable to find node.js documentation implementing column level tags.
However, you might find the Policy Tags official documentation helpful to point you in the right direction. Specifically, you might lack some roles to manage column-level tags.

How to use Solr for multiple data sources?

I am a newbie to Solr & is facing challenges as below.
I have two data sources : a portal & a cms. I need to provide Solr search solution for these two sources so that when user searches on custom portlet(on portal), he should see results from both the sources at same place or Solr should fetch results from both sources. Also user should be able to access these results by clicking on same.
What all should i consider for implementing this use case. Should i use multiple Solr cores or single core? Also how can i achieve features like faceted search, search filter, stop words etc.?
Regards.
It should be perfectly fine to go with single core (and it will also work faster).
To import data from multiple data sources check out Solr Data Import Handler configuration:
http://wiki.apache.org/solr/DataImportHandler
and setup two entities - one for each of your data sources.
You will probably need to set some field to keep information about data source in imported document.
Your question is little bit too general to really answer. Go and experiment a little bit with documentation you have. It should not be very hard to get some basic search functionality.
You can find a lot of info about configuring Solr on LucidWorks wiki:
http://docs.lucidworks.com/display/solr/Faceting
and on Solr wiki: http://wiki.apache.org/solr/
You may also try with some books. Ex: http://www.packtpub.com/apache-solr-4-cookbook/book
I figured out a way to do the same. We can use http://wiki.apache.org/solr/Solrj as java client for Solr. Alfresco content can be put into XMLs & these XMLs can be dumped into SOlr using Solrj.

Generating different datasets from live dbpedia dump

I was playing around with the different datasets provided at the dbpedia download page and found that it is kind of outdated.
Then I downloaded the latest dump from the dbpedia live site. When I extracted the June 30th file, I just got one huge 37GB .nt file.
I want to get different datasets (like the different .nt files available at the download page) from the latest dump. Is there a script or process to do it?
Solution 1:
You can use dbpedia live extractor.https://github.com/dbpedia/extraction-framework.
You need to configure proper extractors(Ex: infobox properties extractor, abstract extractor ..etc). It will download the latest wikipedia dumps and generates the dbpedia datasets.
You may need to make some code changes to get only the required data. One of my colleague did this for German data sets. You still need a lot of disk space for this.
Solution 2(I don't know whether it is really possible or not.):
Do a grep for the required properties on the datasets. You need to know the exact URIs of the properties you want to get.
ex: For getting all the home pages:
bzgrep 'http://xmlns.com/foaf/0.1/homepage' dbpedia_2013_03_04.nt.bz2 >homepages.nt
It will give you all the N-triples with homepages. You can load that in the rdf store.

How should I load the contents of a .txt file to serve on a website?

I am trying to build excerpts for each document returned as a search results on my website. I am using the Sphinx search engine and the Apache web server on Linux CentOS. The function within the Sphinx API that I'd like to use is called BuildExcerpts. This function requires you to pass an array of strings where each string contains the documents contents.
I'm wondering what the best practice is for retrieving the document contents in real time as I serve the results on the web. Currently, these documents are in text files on my system, spread across multiple drives. There are roughly 100MM of them and they take up a few terabytes of space.
It's easy for me to call something like file_get_contents(), but that feels like the wrong way to do this. My databases are already gigantic ( 100GB+ ) and I don't particularly want to throw the document contents in there along with the document attributes that already exist. Perhaps this is the best way to do this, however.
Suggestions?
Well the source needs to be fetched from somewhere. If you dont want to duplicate it in your database, then you will need to fetch it from the filesystem. (using file_get_contets or similar)
Although the BuildExerpts function does give you one extra option "load_files"
... then sphinx will read the data from the filename for you.
What problem are you experiencing with reading it from files? Is it too slow? If so maybe use some caching in front - using memcache maybe.

Resources