At different random times throughout the day I am going to do a "crawl" of data which I am going to feed into Elasticsearch. This bit is working just fine.
However the index should reflect only what was found in my most recent crawl and I currently have nothing to remove the content in the Elasticsearch index which was left over from the previous crawl but wasn't found in the new crawl.
From what I can see I have a few options:
A) Delete items based on how old they are. Won't work because index times are random.
B) Delete entire index and feed with fresh data. Doesn't seem very efficient and will leave me time with an empty or partial index.
C) Do an insert/modify query, if not found insert, if found already in the index update the timestamp, then do a second pass to delete any items with an older time stamp.
D) Something better.
What is a logical and efficient way to removing old content in a situation like this?
If I understand what you want to do, and you are sure that each crawl contains the complete data set, I would do this:
Crawl into time based index: you index-201504051656
In one go:
Create an alias to that new created index
Remove alias from previous index
close the old index or delete the old index
That way your application can always talk to the alias and you are sure that you will always have an index to talk to. Removing a lot of records from an index is relative heavy, closing or removing an index is relative cheap.
Related
If I make any configuration changes to Logstash.
Will I see the chnages applied to Elastic Search
For Example
If I change the grok pattern and added new Fields,
Will I be able to see the chnages effected on already indexed logs in elastic search.
If not what should I do?
Should I re-index whole old logs which are already indexed again to see the new fields??
If you add any new field it will get reflected in the Mapping Type and new field would get stored in that index. Every time a document contains new fields, those will end up in the index’s mappings. This isn’t worrying for a small amount of data, but it can become a problem as the mapping grows. But you should be very careful having too many fields in an index can lead to mapping explosion which can cause a lot of memory error.
Any change you make in your Logstash pipeline will only apply to logs ingested after the change, logs already in elasticsearch are not changed.
If you want to add new fields to documents already in elasticsearch you will need to reindex them through logstash.
I have nearly 200 000 lines of tuples in my Pandas Dataframe. I injected that data into elastic search. Now, when I run the program It should check whether the present data already there in elastic search if not present insert into it.
I'd recommend to not worry about it and just load everything into Elasticsearch. As long as your _ids are consistent the existing documents will be overwritten instead of duplicated. So just be sure to specify an _id for each document and you are fine, the bulk helpers in the elasticsearch-py client all support you setting an _id value for each document alredy.
How can I get last created document in couchdb? Maybe some how I can use _changes feature of couchdb? But documentation says, that I only can get list of document, ordered by first created document, ant there is no way to change order.
So how can I get last created document?
You can get the changes feed in descending order as it's also a view.
GET /dbname/_changes?descending=true
You can use limit= as well, so;
GET /dbname/_changes?descending=true&limit=1
will give the latest update.
Your only surefire way to get the last created document is to include a timestamp (created_at or something) with your document. From there, you just need a simple view to output all the docs by their creation date.
I was going to suggest using the last_seq information from the database, but the sequence number changes with every single write, and replication also complicates the matter further.
I am using the AdvancedDatabaseCrawler as a base for my search page. I have configured it so that I can search for what I want and it is very fast. The problem is that as soon as you want to do anything with the search results that requires accessing field values the performance goes through the roof.
The main search results part is fine as even if there are 1000 results returned from the search I am only showing 10 or 20 results per page which means I only have to retrieve 10 or 20 items. However in the sidebar I am listing out various filtering options with the number or results associated with each filtering option (eBay style). In order to retrieve these filter options I perform a relationship search based on the search results. Since the search results only contain SkinnyItems it has to call GetItem() on every single result to get the actual item in order to get the value that I'm filtering by. In other words it will call Database.GetItem(id) 1000 times! Obviously that is not terribly efficient.
Am I missing something here? Is there any way to configure Sitecore search to retrieve custom values from the search index? If I can search for the values in the index why can't I also retrieve them? If I can't, how else can I process the results without getting each individual item from the database?
Here is an idea of the functionality that I’m after: http://cameras.shop.ebay.com.au/Digital-Cameras-/31388/i.html
Klaus answered on SDN: use facetting with Apache Solr or similar.
http://sdn.sitecore.net/SDN5/Forum/ShowPost.aspx?PostID=35618
I've currently resolved this by defining dynamic fields for every field that I will need to filter by or return in the search result collection. That way I can achieve the facetted searching that is required without needing to grab field values from the database. I'm assuming that by adding the dynamic fields we are taking a performance hit when rebuilding the index. But I can live with that.
In the future we'll probably look at utilizing a product like Apache Solr.
first of all I'm totally new to FAST but I already have a couple of issues I need to solve, so I'm sorry if my questions are very basic =)
Well, the problem is that I have a field in the FAST index which in the source document is something like "ABC 12345" (please note the intentional whitespaces) but when stored in the index is in the form "ABC 123456" (please note that now there is a single space).
If I retrieve all the document values then this specific value is OK (with all the whitespaces), my only problem is with the way the value is stored in the index since I need to retrieve and display it to my user just like it appears in the original document, and I don't want to go to the full document just for this value, I want the value that I already have in the index. I think I need to update one of the FAST XML configuration files but I don't have enough documentation at hand in order to decide where to perform the change, index_profile.xml? in the XMLMapper file?
I've found the answer by myself. I'm using a XMLMapper for my collection, all I had to do was to add the ignore-whitespace attribute to the Mapping element and then set this attribute value to "false". This solved the problem and the raw data now when retrieved from the index contains the expected inner whitespaces.
Thanks.