How to index plain text files for search in Sphinx - search

I scanned dozens of articles and forum threads, looked through official documentation, but couldn't find an answer. This article sounds promising, since is says that The data to be indexed can generally come from very different sources: SQL databases, plain text files, HTML files, but unfortunately as all other articles and forum threads it is devoted to MySQL.
It is rather strange to hear that Sphinx is so cool, it can do this and that, it can do practically anything you want with any data source you like. But where are all those examples with data sources other than MySQL ? Just one tiniest and trivial step-by-step example of Sphinx configuration when you want to scan the easiest source of data in the world - plain text files. Let's say, I've installed Sphinx and want to scan my home directory (recursively) to find all plain text files, containing "Hello world". What should I do to implement this?
Prerequisites:
Ubuntu
sudo apt-get install sphinxsearch
... what is next????

Have a look at this before proceeding Sphinx without SQL! .
Ideally I would do this.
We are going to use Sphinx's sql_file_field to index a table with file path. Here is the PHP script to create a table with file path for a particular directory(scandir).
<?php
$con = mysqli_connect("localhost","root","password","database");
mysqli_query($con,"CREATE TABLE fileindex ( id INT(6) UNSIGNED AUTO_INCREMENT PRIMARY KEY,text VARCHAR(100) NOT NULL);");
// Check connection
if (mysqli_connect_errno()) {
echo "Failed to connect to MySQL: " . mysqli_connect_error();
}
$dir = scandir('/absolute/path/to/your/dir/');
foreach ($dir as $entry) {
if (!is_dir($entry)) {
$path= "/absolute/path/to/your/dir/$entry";
mysqli_query($con,"INSERT INTO fileindex ( text ) VALUES ( '$path' )");
}
}
mysqli_close($con);
?>
Below code is sphinx.conf file to index the table with filepath. Notice sql_file_field which will index those files which are specified in the text(filepath) column
source src1
{
type = mysql
sql_host = localhost
sql_user = root
sql_pass = password
sql_db = filetest
sql_port = 3306 # optional, default is 3306
sql_query_pre = SET CHARACTER_SET_RESULTS=utf8
sql_query_pre = SET NAMES utf8
sql_query = SELECT id,text from fileindex
sql_file_field = text
}
index filename
{
source = src1
path = /var/lib/sphinxsearch/data/files
docinfo = extern
}
indexer
{
mem_limit = 128M
}
searchd
{
log = /var/log/sphinxsearch/searchd.log
pid_file = /var/log/sphinxsearch/searchd.pid
}
After creating table, saving the sphinx.conf in /etc/sphinxsearch/sphinx.conf just run sudo indexer filename --rotate, your indexes are ready! Type search and then keyword to get results.

Related

How to export sql table,sp,view with nodejs?

I want to export tables, views, sp from my database to file.
One way to do that is to backup the database - I can't use this option because the database is on remote location and I do not have access to db server filesystem.
The other way is to use "Generate and Publis Script" wizard, and choose data and schema. - Which is failed during the generation (and I don't know why, for some reason I don't care why).
So my question is there is a sql query that I can run which iterate on all tables, views, and sp and get the schema and get data and write to file? (if some table is failed to open because some reason, then ignore).
Can I do it with nodejs? using sequelize perhaps? I not sure how to get table/view/sp schema with sequlize.
I would much like for guidance here
If i understood correctly, you'd like to select and view a tables contents and then be able to export it into a file on your computer right ?
If that's the case, I would do as follows:
1) First import my required functions (In my case I use MSSQL)
const sql = require('mssql');
const fs = require('fs');
I use MSSQL for database managing, so in this case I would need to import MSSQL to be able to connect and query my DB.
FS (Or 'File System') contains a function to copy a file from a certain file location into another location.
2) Then I will configure my database connection:
var config =
{
user: 'your username to log in',
password: 'password to log in',
server: "server path",
database: 'name of the database',
connectionTimeout: 0,
requestTimeout: 0,
pool:{
idleTimeoutMillis: 500,
max: 1
}
};
3)Then I would start making my function which would include the querying and saving of the file:
function commenceQuery()
{
var connection = new sql.ConnectionPool(config);
var request = new sql.Request(connection);
request.query("DECLARE #output removableTable (id INT IDENTITY, command NVARCHAR(512)) DECLARE #query = 'SELECT * FROM yourTable', #outputFile = VARCHAR (2048) = 'Where you want the file to be saved, most probably will be on the database computer', #connectionString VARCHAR = '-U databaseUserName -P databasePassword' + ##servername, #bcpQuery = 'bcp "#query" QUERYOUT "#outputFile" -T -c -t, -r\n #connectionString' SET #bcpQuery = REPLACE (#bcpQuery, '#query', #query) SET #bcpQuery = REPLACE (#bcpQuery, '#outputFile', #outputFile+'Test_Name.csv') SET #bcpQuery = REPLACE (#bcpQuery, '#connectionString', #connectionString) SET #bcpQuery = REPLACE (#bcpQuery, CHAR(10), ' ')) INSERT INTO #output EXEC master..xp cmdshell #bcpquery")
.then(function()
{
fs.copyFile('/filePath/to/where/output/isSpecified', '/filePath/of/where/you/want/toSave')
}
.catch(function()
{
conn.close()
})
};
To explain in details the query I've stated inside request.query(''):
DECLARE #output removableTable (id INT IDENTITY, command
NVARCHAR(512))
Declares a variable like a temporary table to put the information gathered into it
DECLARE #query = 'SELECT * FROM yourTable'
Declares a variable to hold the actual Query string
#outputFile = VARCHAR (2048) = 'Where you want the file to be saved,
most probably will be on the database computer'
Declares a variable to hold the file path destination of where the database should output the file, ex: C:\Program Files\anyFolder
#connectionString VARCHAR = '-U databaseUserName -P databasePassword' + ##servername
Same as the config we used above, it is the key to logging in to the database itself.
#bcpQuery = 'bcp "#query" QUERYOUT "#outputFile" -T -c -t, -r\n
#connectionString'
We're going to use this as a means of execution, replacing in variable that has the '#' with the actual values (That are declared below)
SET #bcpQuery = REPLACE (#bcpQuery, '#query', #query)
Would replace #query with the string it contains
SET #bcpQuery = REPLACE (#bcpQuery, '#outputFile',
#outputFile+'Test_Name.csv')
Same as above
SET #bcpQuery = REPLACE (#bcpQuery, '#connectionString',
#connectionString)
Same as above
SET #bcpQuery = REPLACE (#bcpQuery, CHAR(10), ' '))
Removes any line breaks
INSERT INTO #output
Inserts all what you received into #output
EXEC master..xp cmdshell #bcpquery
Executes the whole query in the cmdshell
Hope this helps !! Let me know if you get stuck with something

Google Apps Script creates sheets version of excel file. Issue with multiple creation of versions.

I found a solution for my original question in another post Google Apps Script creates sheets version of excel file.
Testing with the code provided in the answer I ran into another issue. Every time I run the script it creates the Spreadsheets version of the .xlsx files again even if they already exist. I have tried modifying the code withing the last If with no results. Then went back to run your code as posted in case I have missed something but it keeps creating versions of the same files.
Any idea of what could I do to fix this will be really appreciated.
The code provided int he answer is the following.
// Convert the user's stored excel files to google spreadsheets based on the specified directories.
// There are quota limits on the maximum conversions per day: consumer #gmail = 250.
function convertExcelToGoogleSheets()
{
var user = Session.getActiveUser(); // Used for ownership testing.
var origin = DriveApp.getFolderById("origin folder id");
var dest = DriveApp.getFolderById("destination folder id");
// Index the filenames of owned Google Sheets files as object keys (which are hashed).
// This avoids needing to search and do multiple string comparisons.
// It takes around 100-200 ms per iteration to advance the iterator, check if the file
// should be cached, and insert the key-value pair. Depending on the magnitude of
// the task, this may need to be done separately, and loaded from a storage device instead.
// Note that there are quota limits on queries per second - 1000 per 100 sec:
// If the sequence is too large and the loop too fast, Utilities.sleep() usage will be needed.
var gsi = dest.getFilesByType(MimeType.GOOGLE_SHEETS), gsNames = {};
while (gsi.hasNext())
{
var file = gsi.next();
if(file.getOwner().getEmail() == user.getEmail())
gsNames[file.getName()] = true;
}
// Find and convert any unconverted .xls, .xlsx files in the given directories.
var exceltypes = [MimeType.MICROSOFT_EXCEL, MimeType.MICROSOFT_EXCEL_LEGACY];
for(var mt = 0; mt < exceltypes.length; ++mt)
{
var efi = origin.getFilesByType(exceltypes[mt]);
while (efi.hasNext())
{
var file = efi.next();
// Perform conversions only for owned files that don't have owned gs equivalents.
// If an excel file does not have gs file with the same name, gsNames[ ... ] will be undefined, and !undefined -> true
// If an excel file does have a gs file with the same name, gsNames[ ... ] will be true, and !true -> false
if(file.getOwner().getEmail() == user.getEmail() && !gsNames[file.getName()])
{
Drive.Files.insert(
{title: file.getName(), parents: [{"id": dest.getId()}]},
file.getBlob(),
{convert: true}
);
// Do not convert any more spreadsheets with this same name.
gsNames[file.getName()] = true;
}
}
}
}
You want to convert Excel files in origin folder to Google Spreadsheet and put the converted Spreadsheet to dest folder.
When the filename of converted file is existing in dest folder, you don't want to convert it.
If my understanding is correct, how about this modification?
From:
if(file.getOwner().getEmail() == user.getEmail() && !gsNames[file.getName()])
To:
if(file.getOwner().getEmail() == user.getEmail() && !gsNames[file.getName().split(".")[0]])
Note:
In this modification, when the filename of converted file is found in the dest folder, the file is not converted.
When the filename has the extension like ###.xlsx and it is converted to Google Spreadsheet, it seems that the extension is automatically removed. I think that this is the reason that the duplicated files are created. So I used split(".")[0] for this situation.
Reference:
split()

Custom field name mapping in expressionengine

I am making some changes to a site using ExpressionEngine, there is a concept I can't seem to quite understand and am probably coding workarounds that have pre-provided methods.
Specifically with channels and members, the original creator has added several custom fields. I am from a traditional database background where each column in a table has a specific meaningful name. I am also used to extending proprietary data by adding a related table joined with a unique key, again, field names in the related table are meaningful.
However, in EE, when you add custom fields, it creates them in the table as field_id_x and puts an entry into another table telling you what these are.
Now this is all nice from a UI point of view for giving power to an administrator but this is a total headache when writing any code.
Ok, there's template tags but I tend not to use them and they are no good in a database query anyway.
Is there a simple way to do a query on say the members table and then address m_field_1 as what its really called - in my case "addresslonglat".
There are dozens of these fields in the table I am working on and at the moment I am addressing them with fixed names like "m_field_id_73" which means nothing.
Anybody know of an easy way to bring the data and its field names together easily?
Ideally i'd like to do the following:
$result = $this->EE->db->query("select * from exp_member_data where member_id = 123")->row();
echo $result->addresslonglat;
Rather than
echo $result->m_field_id_73;
This should work for you:
<?php
$fields = $data = array();
$member_fields = $this->EE->db->query("SELECT m_field_id, m_field_name FROM exp_member_fields");
foreach($member_fields->result_array() as $row)
{
$fields['m_field_id_'.$row['m_field_id']] = $row['m_field_name'];
}
$member_data = $this->EE->db->query("SELECT * FROM exp_member_data WHERE member_id = 1")->row();
foreach($member_data as $k => $v)
{
if($k != 'member_id')
{
$data[$fields[$k]] = $v;
}
}
print_r($data);
?>
Then just use your new $data array.

How do I generate document excerpts using Sphinx and the Sphinx PHP API?

I am trying to generate search excerpts from the full text of indexed documents. I am using Sphinx V2.02. My Sphinx indexes work fine and regular results are no problem.
I am loading the document off disk so I've set load_files to TRUE. I've tried both the web path of the file and the direct Linux file path.
Here is my excerpt code:
$options = array( 'load_files' => TRUE );
$docs = array( /files/0/123/123.txt );
$words = 'gears';
$excerpts = $sphinxclient->BuildExcerpts( $docs, 'files', $words, $options );
Here is the Sphinx Documentation for Generating Excerpts.
BuildExcerpts returns false every time, rather than returning excerpts. What's happening? Should I be executing this somehow at the same time as my regular query? I've been executing BuildExcerpts on each document returned from the main query.
The code to BuildExcertps above is correct.
The problem is that my 'files' index is distributed and the Sphinx BuildExcerpts call doesn't like that. It seems that BuildExcerpts is really just referencing the config for that index, so you have to reference one of the actual indexes, rather than the distributed index in the BuildExcerpts() call.
For example: I have my files index split into 5 shards, files_0, files_1, etc. Using 'files' as my index breaks BuildExcerpts. Using files_0 or any of my shards works fine.
$options = array( 'load_files' => TRUE );
$docs = array( /files/0/123/123.txt );
$words = 'gears';
$excerpts = $sphinxclient->BuildExcerpts( $docs, 'files_0', $words, $options );

Searching the Registry for a key - JScript

Is there a way to search the Registry for a specific key using Windows Scripting Host?
I'm using JavaScript (Jscript/VBScript?) to do so, and the msdn Library doesn't mention any such method: http://msdn.microsoft.com/en-us/library/2x3w20xf(v=VS.85).aspx
Thanks,
So here's an update to the problem:
The problem is a bit more complicated than a direct registry search. I have to look through the installed products on a windows box, to find a specific product entry that i want to delete. The registry path is defined as:
HKEY_LOCAL_MACHINE\Software\Microsoft...\Products.
Within the Products key, the installed products are listed, but their keys are defined as hash codes. Within the product keys are other keys with defined names and defined values. I want to be able to search on the latter keys and values. How can I do that, by-passing the unknown hash codes?
For example, I need to find a product with DisplayVersion key = 1.0.0. The path to that key is:
HKLM\Software\Microsoft\Windows\CurrentVersion\Installer\UserData\Products\A949EBE4EED5FD113A0CB40EED7D0258\InstallProperties\DisplayVersion.
How can I either pick up, or avoid writing, the product key: A949EBE4EED5FD113A0CB40EED7D0258 ??
Assuming you're using JScript via the Windows Scripting Host (and not JavaScript from a browser) you can get the value of a specific key using the WScript.RegRead method:
// MyScript.js
var key = 'HKEY_CURRENT_USER\\SessionInformation\\ProgramCount'
, wsh = WScript.CreateObject('WScript.Shell')
, val = wsh.RegRead(key);
WScript.Echo('You are currently running ' + val + ' programs.');
If you actually need to search for a key or value based on some conditions rather than a known registry key then you can to implement your own recursive search algorithm where registry values of type "REG_SZ" are leaf nodes.
As an exercise to get more familiar with JScript on the Windows Scripting Host, I've made a small interface to the registry that does exactly this. The example included in the project shows how to perform such a registry search in a WSF script:
<job id="FindDisplayVersions">
<script language="jscript" src="../registry.js"/>
<script language="jscript">
// Search the registry and gather 20 DisplayVersion values.
var reg = new Registry()
, rootKey = 'SOFTWARE\\Microsoft\\Windows\\CurrentVersion\\Installer\\UserData\\S-1-5-18\\Products'
, keyRegex = /Products\\(.*?)\\InstallProperties\\DisplayVersion$/
, valRegex = /^1\./
, maxResults = 20
, uids = [];
reg.find(rootKey, function(path, value) {
var keyMatch = keyRegex.exec(path);
if (keyMatch) {
if (valRegex.exec(value)) {
uids.push(keyMatch[1] + '\t=\t' + value);
if (uids.length >= maxResults) { return false; } // Stop searching
}
}
return true; // Keep searching.
});
WScript.Echo(uids.join("\n"));
</script>
</job>
Note that, as #Robert Harvey points out, this could take a really long time if the root key is too deeply connected. Simple testing takes only a few seconds on the key I chose but your mileage may vary; of course, no warranty or fitness for a purpose, don't blame me if your computer blows up.
http://code.google.com/p/jslibs/
if you don't find it there, you have to implement it yourself

Resources