How can i manually reindex solr using sunspot? - couchdb

I have couchdb. Sunspot was correctly indexing everything. But the Solr server crashed. I need to reindex the whole thing. rake sunspot:reindex wont work as it is tigthly coupled with active record. sunspot.index(model.all) didnt work. the solr core says 0 indexed docs even after doing that. is there a way out?

Post.solr_reindex
There are a number of options that can be passed to solr_reindex. The same options as to index; from the documentation
index in batches of 50, commit after each
Post.index
index all rows at once, then commit
Post.index(:batch_size => nil)
index in batches of 50, commit when all batches complete
Post.index(:batch_commit => false)
include the associated +author+ object when loading to index
Post.index(:include => :author)

What I was looking for was this:
Post.index!(Model.all)
There was something bad happening when I tried to index assuming that batch commits would happen automatically. Any way this worked totally fine for me.

I usually write below command to index models. It works perfectly every time.
For a Model i.e. (Post)
Sunspot.index Post.all
For a Model row i.e. (Post.where(id: 5))
Sunspot.index Post.where(id: 5)
It will work.
Cheers!

Related

Jooq database/schema name mapping

I use jooq to generate objects against a local database, but when running "for real" later in production the actual databases will have different names. To remedy this I use the <outputSchemaToDefault>true</outputSchemaToDefault> config option (maven).
At the same time, we have multiple databases (schemas), and are using a connection pool to the server like "jdbc:mysql://localhost:3306/" (without specifying a database here).
How do I tell jooq which database to use when running queries?
I have tried all config I can think of:
new Settings()
.withRenderSchema(true) // true/false seems to make no difference.
.withRenderCatalog(true) // true/false seems to make no difference.
.withRenderMapping(new RenderMapping()
.withDefaultSchema("my_database") // Seems to have no effect.
// The above 3 configs always give me an error saying "no database selected".
// Adding this gives me 'my_database.my_table' does not exist - while it actually does.
.withSchemata(new MappedSchema()
.withInputExpression(Pattern.compile(".*"))
.withOutput("my_database")
));
I have also tried using a database/schema name, as in not configuring outputSchemaToDefault. But then, adding the MappedSchema code above, but that gives me errors with "'my_databasemy_database.my_table' does not exist", which is correct. I have no clue why that code gives me the database/schema name twice?
Edit:
When jooq tells me that the db.table does not exist, if I put a break point in a good place and get the sql from jooq and run exactly that against my database it does work. But jooq fails to run it.
Also, I'm using version 3.15.3 of jooq.
I solved it. Instead of using .withInputExpression(Pattern.compile(".*")), it seems to work with .withInput("").
I'm still not sure why it works, or if this is the "correct" way of solving it. But at least it is a way forward.
No clue why using the pattern, I got the name twice though. But that one I'll leave alone.

Using Logstash Aggregate Filter plugin to process data which may or may not be sequenced

Hello all!
I am trying to use the Aggregate filter plugin of Logstash v7.7 to correlate and combine data from two different CSV file inputs which represent API data calls. The idea is to produce a record showing a combined picture. As you can expect the data may or may not arrive in the right sequence.
Here is as an example:
/data/incoming/source_1/*.csv
StartTime, AckTime, Operation, RefData1, RefData2, OpSpecificData1
231313232,44343545,Register,ref-data-1a,ref-data-2a,op-specific-data-1
979898999,75758383,Register,ref-data-1b,ref-data-2b,op-specific-data-2
354656466,98554321,Cancel,ref-data-1c,ref-data-2c,op-specific-data-2
/data/incoming/source_1/*.csv
FinishTime,Operation,RefData1, RefData2, FinishSpecificData
67657657575,Cancel,ref-data-1c,ref-data-2c,FinishSpecific-Data-1
68445590877,Register,ref-data-1a,ref-data-2a,FinishSpecific-Data-2
55443444313,Register,ref-data-1a,ref-data-2a,FinishSpecific-Data-2
I have a single pipeline that is receiving both these CSVs and I am able to process and write them as individual records to a single Index. However, the idea is to combine records from the two sources into one record each representing a superset. of Operation related information
Unfortunately, despite several attempts I have been unable to figure out how to achieve this via Aggregate filter plugin. My primary question is whether this is a suitable use of the specific plugin? And if so, any suggestions would be welcome!
At the moment, I have this
input {
file {
path => ['/data/incoming/source_1/*.csv']
tags => ["source1"]
}
file {
path => ['/data/incoming/source_2/*.csv']
tags => ["source2"]
}
# use the tags to do some source 1 and 2 related massaging, calculations, etc
aggregate {
task_id = "%{Operation}_%{RefData1}_%{RefData1}"
code => "
map['source_files'] ||= []
map['source_files'] << {'source_file', event.get('path') }
"
push_map_as_event_on_timeout => true
timeout => 600 #assuming this is the most far apart they will arrive
}
...
}
output {
elastic { ...}
}
And other such variations. However, I keep getting individual records being written to the Index and am unable to get one combined. Yet again, as you can see from the data set there's no guarantee of the sequencing of records - so I am wondering if the filter is the right tool for the job, to begin with? :-\
Or is it just me not being able to use it right! ;-)
In either case, any inputs/ comments/ suggestions welcome. Thanks!
PS: This message is being cross-posted over from Elastic forums. I am providing a link there just in case some answers pop up there too.
The answer is to use Elastic search in upsert mode. Please see the specifics here..
I recommend first that the information reaches you in order so that the filter can take it better, secondly, you could set the options in your pipeline.yml: pipeline.workers: 1 and pipeline.ordered: true, thus guaranteeing the order of processing.

Pymongo - Process only newly updated documents?

So I'm new to Pymongo & MongoDB, and I'm just confused as to how best to go about this problem. I have two collections:
Raw_collection
Processed_collection
Basically, I have raw documents that go into the Raw_collection, after which I process them by dropping some documents based on filters etc, and store the remaining documents into Processed_collection. Specifically, I plan to periodically update the records in Raw_collection as well.
As such, what would be the best way to process only the newly inserted documents to Raw_collection on a successive update? I looked into bulk methods but I'm not sure if that's what I want... this seems like a simple-ish problem to solve, but because of my inexperience I'm not sure what the solution would be. Any help is greatly appreciated, thanks!
So I ended up doing this through pymongo's insert_many method has:
import pandas
import pymongo
insert_raw_collection(): #call First
result = db[collection].insert_many(documents)
obj_id_list = result.inserted_ids
#[ObjectId('54f113fffba522406c9cc20e'), ObjectId('54f113fffba522406c9cc20f')]
return obj_id_list
insert_processed_collection(obj_id_list): # call Second
cursor = raw_collection_pandas_data_frame.find({"_id": {"$in": obid_list}})
for doc in cursor:
if filter(doc) == True
# do something
Basically, I return the list of inserted ObjectId's from the previous insert step and perform a filtering operation so I know which I want to keep.

Automating/Tracking Knex Migrations and Lucid Models

The Situation
I recently started working on a new project using nodejs. I have a background of using Python/Django and C#/.NET (not a huge fan of the latter). Node is awesome, but I must say I miss the ease of building models and automating migrations in Django. I am currently using the AdonisJS framework which leverages Knex. Knex is a powerful library, but the migrations all need to be manually built. Additionally, the AdonisJS ORM that manages the Models is independent of Knex (migration manager). You also do not define field attributes on the Models, which can have benifits for dynamically doing things in the front and back end. All things considered, there is a lot of room for human error, miscommunication and a boat load more typing required. I know the the hot thing these days is to keep it loose and fast, but for this specific project, I am looking for a bit more structure than loosely defined models.
Current State
What I have landed on is building a new Class called tableModel and a field class to define the fields within table model. I have already completed this and I am successfully writing the migration files leveraging mustache. I plan on also automatically writing the Models which I shouldn't have a problem with (fingers crossed).
The Problem
Here is where it gets a little tough and where I need help...I need to track what has been added or removed via migration so I can effectively write ups and downs as the tableModels change over time.
So let's say I add a "tableModel" which creates a migration to create table Foo with fields {id (bigint), user_id(int), name(string255)}
Later I want to add a field called description so I would simply add it to my "tableModel" and then run a build command which would build out the migration.
How do I check what has already been created though so I only do an up() for description?
Then I want to remove the name field so I mark it out in my "tableModel" and run a build migration command. How do I check what has been migrated that now needs to be added in to the down().
Edit: I would add a remove field to the up and the corresponding roll back to the down.
Bonus Round
Let's say I want to change user_id from an int to a bigint, because who makes a foreign key just an int? How do I check not just what needs to be added to the up and down, but also checks if I need to change a property on a field.
Edit: would just write the up. and a corresponding roll back to the down
The Big Question
Basically, how do I define dirty "tableModels" classes
Possible Solution?
I am thinking that maybe I should capture some type of registry or snapshot and then run the comparison when building the migrations and or models, then recapture/snapshot. If this is the route, should I store in a json file, write this to the DB itself, or is there another/better option.
If I create the tableModel instances as constants, could I actually write back to the JS file and capture the snapshot as an attribute? IF this is an option, is Node's file system the way to go and what's the best way to do this? Node keep suprising me so I wouldn't be baffled if any of these are an option.
Help!
If anyone has gone down this path before or knows of any tools I could leverage, I would greatly appreciate it and thank you in advance. Also, if I am headed in a completely wrong direction, then please let me know, I both handle and appreciate all types of feedback.
Example
Something to note, when I define the "tableModel" for a given migration or model, it is an instance of the class, I am not creating an extended class since this is not my orm.
class tableModel {
constructor(tableName, modelName = tableName, fields = []) {
this.tableName = tableName
this.modelName = modelName
this.fields = fields
}
// Bunch of other stuff
}
fooTableModel = new tableModel('fooTable', 'fooModel', fields = [
new tableField.stringField('title'),
new tableField.bigIntField('related_user_id'),
new tableField.textField('description','Testing Default',false,true)
]
)
which equates to:
tableModel {
tableName: 'fooTable',
modelName: 'fooModel',
fields:
[ stringField {
name: 'title',
type: 'string',
_unique: false,
allow_null: null,
fieldAttributes: {},
default_value: null },
bigIntField {
name: 'related_user_id',
type: 'bigInteger',
_unique: false,
allow_null: null,
fieldAttributes: {},
default_value: 0 },
textField {
name: 'description',
type: 'text',
_unique: false,
allow_null: true,
fieldAttributes: {},
default_value: 'Testing Default' } ]
You have the up and down notation mixed up. Those are for migrating the "latest" (runs the up function) and doing rollbacks (runs the down function). Up and down to not relate to dropping or adding table columns.
The migrations up is for any change, and the down is to reverse those changes. So if you wanted to drop a column from some table, you write the command in the up, then write the opposite in the down (you'd add it back in...), such that you can "rollback" and the change is effectively reversed. You have to be careful with such things though, as you can put yourself in a situation where you actually lose data.
Want to add a column? Write it in the up, and drop the column in the down.
One of the major points behind the migrations mechanism is to track the state of changes of your database, as time goes forward. So generally, if you created a table in some migration, then a day or so later you realize you need to drop/add columns, you normally don't go back and edit the existing migration, especially if the migration has already been run. You'd just write a new migration to drop/add your column.
Since you're using knex, there are a couple "knex" tables that get created. By default the one you're looking for is knex_migrations, unless someone specifically modified the settings to change the name of it. This table holds all the migrations that have run against your DB, per batch. From the CLI, assuming you have knex.js installed globally, you can run knex migrate:latest, and that will push all the migrations that exist in your directory to the target database, if they have not yet been run. It does this by way of examining that knex_migrations table. If you roll a change and don't like it, and assuming you've properly done the down function, you can invoke knex migrate:rollback to reverse the change. If there are 3 migration files that have NOT yet been run, invoking knex migrate:latest will run all 3 of those migration files under a new batch #, which is 1 higher than the most recent batch number. Conversely, if you invoke a knex migrate:rollback, it will find the highest batch number (there could be more than 1 migration in a batch...), and invoke the down function on all those files, effectively rollback those changes.
All that said, knex is a "query builder" tool. It's got a ton of helper functions to help build the sql for you. Personally, I find this to be a major distraction. Why spend hours on hours figuring out all the helper functions when I can just go crank out raw SQL and run that. Thus, that's what we've done in our system. we use knex.raw('') and write our own DDL and DML. It works great and does exactly what we need it to. We don't need to go figure out the magic of the query building.
The short answer is that knex will automatically know what has and has not been run for you (again, via that knex_migrations table it creates for you...).
Things can get weird though when it start involving git and different branches. I recommend that if you're writing migrations on some branch, and you need to go do other work, always remember to first perform a rollback of any migrations you've done in that branch BEFORE switching branches. Otherwise you will be in weird DB states that don't coincide with the application code.
I would personally just deal with updating models independently of writing migrations. For example, if I'm adding a description column to some table, then I probably want to manually update the ORM to reflect the change of the new db schema. Generally, I've found trying to use a tool that automagically does that for you (rather, if I change the orm, stuff happens to write all the underlying sql...) usually winds me up in a heap of trouble and I just spend more time trying to un-fudge stuff. But, that's just my 2 cents :)
Here is where it gets a little tough and where I need help...I need to track what has been added or removed via migration so I can effectively write ups and downs as the tableModels change over time.
You could store changes in a DB/txt file and those can act as snapshots. So when you want to rollback to a particular migration, you would find the changes (up/down) made for that mutation and adjust accordingly.
Later I want to add a field called description so I would simply add it to my "tableModel" and then run a build command which would build out the migration. How do I check what has already been created though so I only do an up() for description?
Here you either call the database itself directly and check what fields have already been created. If a field is already their and the attributes are the same, you can either ignore it or stop the transaction all together.
Bonus Round Let's say I want to change user_id from an int to a bigint, because who makes a foreign key just an int? How do I check not just what needs to be added to the up and down, but also checks if I need to change a property on a field.
Again, call the DB itself on the table in question. I know the SQL call would be:
describe [table_name];
After reading the end, I think you answered this yourself, but I think capturing these changes would work best in a NoSql database since you're using Node or PostGres with it's json field.

Sphinx search crashes when there are no indexes

I want a sphinx searchd to start up but there are no indexes populated as yet. I have a separate cron job that pulls data from a data source and then calls the indexer to generate the indexes.
So the first time searchd starts the cron job has not yet run, hence there are no indexes. And searchd fails with errors like:
FATAL: no valid indexes to serve
Is there any way to get around this? e.g. to start earchd even when there are no indexes and if someone searched against it during that time, it just returns no docids. Later when the cron job runs, the indexes will be populated and then searched can query those indexes.
if someone searched against it during that time, it just returns no docids.
That would require an actual index to search againast.
Just create an empty index. Then when indexer runs, it recreates the index (with data this time) and notifies searchd - using --rotate switch.
Example of a way to produce a 'empty' index, as provided by #ctx: (Added Dec, 2014)
source force {
type = xmlpipe2
xmlpipe_command = cat /tmp/test.xml
}
index force {
source = force
path = /path/to/sphinx/datadir/filename
charset_type=utf-8
}
/tmp/test.xml:
<?xml version="1.0" encoding="utf-8"?>
<sphinx:docset>
<sphinx:schema>
<sphinx:field name="subject"/>
</sphinx:schema>
</sphinx:docset>
indexer force and now searchd should be able to run.
Alternativly can use something like sql_query = SELECT 1,'' but that does require connection to a real database server.

Resources