searching in a part of word with Sphinx

searching in a part of word with Sphinx - search

I need to configure Sphinx to search for parts of word, not whole. It's written on the official site, that I should use directives
'min_infix_len', 'min_prefix_len' and 'enable_star'. Actually, all registered in the config file.
But search in a word part doesn't work.
* in the beginning or end of a word is not helping.
source src1
{
type= mysql
sql_host= localhost
sql_user= root
sql_pass=
sql_db= ajax
sql_port= 3306
sql_query= \
SELECT id, bookauthor FROM authors
sql_query_info= SELECT * FROM authors WHERE id=$id
}
index index1
{
source= src1
path = /var/data/src1
morphology =stem_en
min_word_len = 1
charset_type = sbcs
enable_star=1
min_infix_len =3
min_prefix_len = 3
}
searchd
{
listen = 9312
log = /var/log/searchd.log
query_log = /var/log/query.log
pid_file = /var/log/searchd.pid
}

You can only implement one of them
min_infix_len =3
min_prefix_len = 3
Prefix is for *something
Infix can be used with everything..
Remove one of them

Related

Concat 1 to n items into new spark column

I try to have a dynamic concat of fields, based on some configuration settings the goal is to have a new fields with merged values of 1 to n fields.
language = "JP;EN"
language = list(str(item) for item in language.split(";"))
no_langs = len(language)
# check if columns for multi-language exists
for lang in language:
doc_lang = "doctor.name_" + lang
if doc_lang not in case_df.columns:
case_df_final = AddColumn(case_df, doc_lang)
### combine translations of masterdata
case_df = case_df.withColumn(
"doctor",
F.concat(
F.col(("doctor.name_" + language[0])),
F.lit(" // "),
F.col(("doctor.name_" + language[1])),
),
)
What I would like to achieve is that the new column is dynamic depending of the amount of languages configured. E.g. If only one language is used the result would be like this.
case_df = case_df.withColumn(
"doctor",
F.col(("doctor.name_" + lang[0]))
)
For 2 languages or more it should pick all the languages based on the order in the list.
Thanks for your help.
I am using Spark 2.4. with Python 3
The expected output would be the following

Final working code is the following:
# check if columns for multi-language exists
for lang in language:
doc_lang = "doctor.name_" + lang
if doc_lang not in case_df.columns:
case_df = AddColumn(case_df, doc_lang)
doc_lang_new = doc_lang.replace(".", "_")
case_df = case_df.withColumnRenamed(doc_lang, doc_lang_new)
doc_fields = list(map(lambda k: "doctor_name_" + k, language))
case_df = case_df.withColumn("doctor", F.concat_ws(" // ", *doc_fields))
Thanks all for the help and hints.

Linkedin web scraping snippet

I'm doing a web scraping data university research project. I started working on a ready GitHub project, but this project does not retrieve all the data.
The project works like this:
Search Google using keywords: example: (accountant 'email me at' Google)
Extract a snippet.
Retrieve data from this snippet.
The issue is:
The snippets extracted are like this: " ... marketing division in 2009. For more information on career opportunities with our company, email me: vicki#productivedentist.com. Neighborhood Smiles, LLC ..."
The snippet does not show all, the "..." hides information like role, location... How can I retrieve all the information with the script?
from googleapiclient.discovery import build #For using Google Custom Search Engine API
import datetime as dt #Importing system date for the naming of the output file.
import sys
from xlwt import Workbook #For working on xls file.
import re #For email search using regex.
if __name__ == '__main__':
# Create an output file name in the format "srch_res_yyyyMMdd_hhmmss.xls in output folder"
now_sfx = dt.datetime.now().strftime('%Y%m%d_%H%M%S')
output_dir = './output/'
output_fname = output_dir + 'srch_res_' + now_sfx + '.xls'
search_term = sys.argv[1]
num_requests = int(sys.argv[2])
my_api_key = "replace_with_you_api_key" #Read readme.md to know how to get you api key.
my_cse_id = "011658049436509675749:gkuaxghjf5u" #Google CSE which searches possible LinkedIn profile according to query.
service = build("customsearch", "v1", developerKey=my_api_key)
wb=Workbook()
sheet1 = wb.add_sheet(search_term[0:15])
wb.save(output_fname)
sheet1.write(0,0,'Name')
sheet1.write(0,1,'Profile Link')
sheet1.write(0,2,'Snippet')
sheet1.write(0,3,'Present Organisation')
sheet1.write(0,4,'Location')
sheet1.write(0,5,'Role')
sheet1.write(0,6,'Email')
sheet1.col(0).width = 256 * 20
sheet1.col(1).width = 256 * 50
sheet1.col(2).width = 256 * 100
sheet1.col(3).width = 256 * 20
sheet1.col(4).width = 256 * 20
sheet1.col(5).width = 256 * 50
sheet1.col(6).width = 256 * 50
wb.save(output_fname)
row = 1 #To insert the data in the next row.
#Function to perform google search.
def google_search(search_term, cse_id, start_val, **kwargs):
res = service.cse().list(q=search_term, cx=cse_id, start=start_val, **kwargs).execute()
return res
for i in range(0, num_requests):
# This is the offset from the beginning to start getting the results from
start_val = 1 + (i * 10)
# Make an HTTP request object
results = google_search(search_term,
my_cse_id,
start_val,
num=10 #num value can be 1 to 10. It will give the no. of results.
)
for profile in range (0, 10):
snippet = results['items'][profile]['snippet']
myList = [item for item in snippet.split('\n')]
newSnippet = ' '.join(myList)
contain = re.search(r'[\w\.-]+#[\w\.-]+', newSnippet)
if contain is not None:
title = results['items'][profile]['title']
link = results['items'][profile]['link']
org = "-NA-"
location = "-NA-"
role = "-NA-"
if 'person' in results['items'][profile]['pagemap']:
if 'org' in results['items'][profile]['pagemap']['person'][0]:
org = results['items'][profile]['pagemap']['person'][0]['org']
if 'location' in results['items'][profile]['pagemap']['person'][0]:
location = results['items'][profile]['pagemap']['person'][0]['location']
if 'role' in results['items'][profile]['pagemap']['person'][0]:
role = results['items'][profile]['pagemap']['person'][0]['role']
print(title[:-23])
sheet1.write(row,0,title[:-23])
sheet1.write(row,1,link)
sheet1.write(row,2,newSnippet)
sheet1.write(row,3,org)
sheet1.write(row,4,location)
sheet1.write(row,5,role)
sheet1.write(row,6,contain[0])
print('Wrote {} search result(s)...'.format(row))
wb.save(output_fname)
row = row + 1
print('Output file "{}" written.'.format(output_fname))

node uri regex not capturing capture groups

I know there are a billion regex questions on stackoverflow, but I can't understand why my uri matcher isn't working in node.
I have the following:
var uri = "file:tmp.db?mode=ro"
function parseuri2db(uri){
var regex = new RegExp("(?:file:)(.*)(?:\\?.*)");
let dbname = uri.match(regex)
return dbname
}
I'm trying to identify only the database name, which I expect to be:
After an uncaptured file: group
Before an optional ? + parameters to end of string.
While I'm using:
var regex1 = new RegExp("(?:file:)(.*)(?:\\?.*)");
I thought the answer was actually more like:
var regex2 = new RegExp("(?:file:)(.*)(?:\\??.*)");
With a 0 or 1 ? quantifier on the \\? literal. But the latter fails.
Anyway, my result is:
console.log(parseuri2db(conf.db_in.filename))
[ 'file:tmp.db?mode=ro',
'tmp.db',
index: 0,
input: 'file:tmp.db?mode=ro' ]
Which seems to be capturing the whole string in the first argument, rather than just the single capture group I asked for.
My questions are:
What am I doing wrong that I'm getting multiple captures?
How can I rephrase this to capture my capture groups with names?
I expected something like the following to work for (2):
function parseuri2db(uri){
// var regex = new RegExp("(?:file:)(.*)(?:\\?.*)");
// let dbname = uri.match(regex)
var regex = new RegExp("(?<protocol>file:)(?<fname>.*)(<params>\\?.*)");
let [, protocol, fname, params] = uri.match(regex)
return dbname
}
console.log(parseuri2db(conf.db_in.filename))
But:
SyntaxError: Invalid regular expression: /(?<protocol>file:)(?<fname>.*)(<params>\?.*)/: Invalid group
Update 1
Answer to my first question is that I needed to not capture the ? literal in the second capture group:
"(?:file:)([^?]*)(?:\\??.*)"

That particular node regex library does not support groups.

Can Matlab eliminate the path in URL and left only the domain part?

Can Matlab eliminate the path in URL and leave only the domain part? Does Matlab have any function to eliminate the path behind?
Let's say, example 1:
input :http://www.mathworks.com/help/images/removing-noise-from-images.html
output :http://www.mathworks.com

This regexp pattern should do the trick:
>> str = 'http://www.mathworks.com/help/images/removing-noise-from-images.html';
>> out = regexp(str,'\w*://[^/]*','match','once')
out =
'http://www.mathworks.com'
The search pattern '\w*://[^/]*' says look for a string that starts with some "word" characters ('\w*) corresponding to the protocol (e.g. http, https, rtsp), followed by the ubiquitous ://, and then any number of characters that are not a forward slash ([^/]*).
Edit: The 'once' option should eliminate a nested cell.
UPDATE: just the hostname, allowing inputs with no protocol.
>> str = {'http://www.mathworks.com/help/images/removing-noise-from-images.html';
'https://www.mathworks.com/help/matlab/ref/strcmpi#dfvfv.html';
'google.com/voice'}
>> out = regexp(str,'([^/]*)(?=/[^/])','match','once')
out =
'www.mathworks.com'
'www.mathworks.com'
'google.com'
UPDATE 2: regexp madness!
>> str = {'http://www.mathworks.com/help/images/removing-noise-from-images.html';
'https://www.mathworks.com/help/matlab/ref/strcmpi#dfvfv.html';
'google.com/voice';
'http://monkey.org/';
'stackoverflow.com/';
'meta.stackoverflow.com'};
>> out = regexp(str,'.*?[^/](?=(/([^/]|$)|$))','match','once')
out =
'http://www.mathworks.com'
'https://www.mathworks.com'
'google.com'
'http://monkey.org'
'stackoverflow.com'
'meta.stackoverflow.com'
% hostname.m
function hostnames = hostname(str)
hostnames = regexp(str,'.*?[^/](?=(/([^/]|$)|$))','match','once');

Code:
function output_url = domain_name(input_url)
c1 = strfind(input_url,'//');
ind1 = strfind(input_url,'/');
if isempty(c1) && isempty(ind1)
output_url = input_url; % For case like - www.mathworks.com
return;
end
if ~isempty(c1)
if numel(ind1)>2
output_url = input_url(1:ind1(3)-1); % For cases like - http://www.mathworks.com/ or http://www.mathworks.com/something/
else
output_url = input_url; % For case like - http://www.mathworks.com
end
else
output_url = input_url(1:ind1(1)-1); % For cases like - www.mathworks.com/ or www.mathworks.com/something/
end
return;
Example runs:
%% Long URLs with extensions
disp(domain_name('www.mathworks.com/help/images/removing-noise-from-images.html'))
disp(domain_name('http://www.mathworks.com/help/images/removing-noise-from-images.html'))
%% Short URLs without HTTP://
disp(domain_name('www.mathworks.com'))
disp(domain_name('www.mathworks.com/'))
%% Short URLs with HTTP://
disp(domain_name('http://www.mathworks.com'))
disp(domain_name('http://www.mathworks.com/'))
Return:
www.mathworks.com
http://www.mathworks.com
www.mathworks.com
www.mathworks.com
http://www.mathworks.com
http://www.mathworks.com
An alternative method and probably efficient one would be to use REGEXP, but apparently I prefer numbers.
Edit 1: If you prefer to use bunch of URLs at the sametime, you may use a cell array. Obviously, the output would be a cell array too. Look at the following MATLAB script to get a feel of it -
% Input
in_urls_cell = [{'http://mathworks.com/'},{'mathworks.com/help/matlab/ref/strcmpi.html'},{'mathworks.com/help/matlab/ref/strcmpi#dfvfv.html'}];
% Get domain name
out_urls_cell = cell(size(in_urls_cell));
for count = 1:numel(in_urls_cell)
out_urls_cell(count)={domain_name(cell2mat(in_urls_cell(count)))};
end
% Display only domain name
for count = 1:numel(out_urls_cell)
disp(cell2mat(out_urls_cell(count)));
end
The above script returns -
http://mathworks.com
mathworks.com
mathworks.com

Searching similar products sku with sphinxsearch

I have search form in my e-commerce site. Search engine is Sphinxsearch.
I have products with sku like (04078, PS04078, DS04078, 04078-1, 04078-2, 4078-3).
The problem I cannot figure out how to configure sphinx to get results I need:
Searching by '04078' gives me only item with sku 04078, but not all 6 items.
How to get all 6 items in result set?
My conf:
source products
{
type = mysql
sql_host = #
sql_user = #
sql_pass = #
sql_db = #
sql_port = #
sql_query_pre = SET CHARACTER SET utf8
sql_query = \
SELECT id,price,name,sku,producer_name \
FROM products
#sql_attr_string = post_title
#sql_field_string = post_content
sql_query_info = SELECT * FROM products WHERE id=$id
}
index products
{
source = products
path = /var/data/products
docinfo = extern
mlock = 0
charset_type = utf-8
html_strip = 1
html_remove_elements = style, script
enable_star = 1
min_word_len = 1
min_infix_len = 3
}

The new regex filter in 2.1.1 beta sounds just the ticket to munge the product codes into a consistant form...
http://sphinxsearch.googlecode.com/svn/trunk/doc/sphinx.html#conf-regexp-filter

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

searching in a part of word with Sphinx - search

You can only implement one of them min_infix_len =3 min_prefix_len = 3 Prefix is for *something Infix can be used with everything.. Remove one of them

Related

Concat 1 to n items into new spark column

Linkedin web scraping snippet

node uri regex not capturing capture groups

Can Matlab eliminate the path in URL and left only the domain part?

Searching similar products sku with sphinxsearch

Categories

Resources