Setting up Sphinx Search

Setting up Sphinx Search - ubuntu-14.04

I've been looking for tutorials online to setup sphinx search, and I have got the test database working. However I am having trouble getting my own database to work.
sphinx.conf
source src1
{
type = mysql
sql_host = localhost
sql_user = root
sql_pass = MyPassword
sql_db = MyDatabase
sql_port = 3306
sql_query = \
SELECT listing_id, title, description, image_id \
FROM listings
sql_attr_uint = listing_id
sql_query_info = SELECT listing_id, title, description, image_id FROM listings
}
index test1
{
source = src1
path = /var/lib/sphinxsearch/data/test1
docinfo = extern
charset_type = sbcs
}
searchd
{
listen = 9312
log = /var/log/sphinxsearch/searchd.log
However when I try and run:
sudo indexer --all --rotate
The output in putty is:
using config file '/etc/sphinxsearch/sphinx.conf'...
indexing index 'test1'...
WARNING: attribute 'listing_id' not found - IGNORING
WARNING: Attribute count is 0: switching to none docinfo
WARNING: collect_hits: mem_limit=0 kb too low, increasing to 24576 kb
collected 3 docs, 0.0 MB
sorted 0.0 Mhits, 100.0% done
total 3 docs, 49 bytes
total 0.002 sec, 16740 bytes/sec, 1024.94 docs/sec
total 2 reads, 0.000 sec, 0.0 kb/call avg, 0.0 msec/call avg
total 6 writes, 0.000 sec, 0.0 kb/call avg, 0.0 msec/call avg
rotating indices: succesfully sent SIGHUP to searchd (pid=911)
However, when I try and run "search df" for example, I get:
Sphinx 2.0.4-id64-release (r3135)
Copyright (c) 2001-2012, Andrew Aksyonoff
Copyright (c) 2008-2012, Sphinx Technologies Inc (http://sphinxsearch.com)
using config file '/etc/sphinxsearch/sphinx.conf'...
FATAL: 'sql_query_info' value must contain '$id'
I am running Sphinx Search on Ubuntu 14.04 using an account called "user" which is part of the sudoers file.
I have lost my mind with this, so would appreciate someones help.
Thanks

Your sql_query_info is invalid. It needs to contain $id, as the message says.
However, would highly recommend not using search tool - its broken. Skip it. (articles recommending its use, are outdated) - sql_query_info is only used by search.
Move right on to starting searchd, and use test.php if dont have an application yet. Using test.php to test your index is MUCH better.

Related

Submitted metrics not showing up on prometheus endpoint

I have a code which looks like this, it is supposed to collect some custom metrics and expose it over prometheus.
def collect_metrics():
registry = prometheus_client.CollectorRegistry()
label_names = ['parent', 'namespace','team', 'name', 'status']
sib = Gauge(f'disk_sizeInBytes','Gets the size of the disk in bytes.', label_names, registry=registry)
msib = Gauge(f'disk_maxSizeInMegabytes', 'Gets or sets the maximum size of the disk in megabytes, which is the size of memory allocated for the disk.', label_names, registry=registry)
...
sib.labels(parent=parent_name, namespace=namespace_name, team=team, name=disk_name, status=disk_status).set(disk_list[dp]["sizeInBytes"])
msib.labels(parent=parent_name, namespace=namespace_name, team=team, name=disk_name, status=disk_status).set(disk_list[dp]["maxSizeInMegabytes"])
print(f'{datetime.datetime.now()} | disk_name: {disk_name} | sib: {disk_list[dp]["sizeInBytes"]} | msib: {disk_list[dp]["maxSizeInMegabytes"]}')
...
if __name__ == '__main__':
...
start_htdp_server(8005)
collect_metrics()
The code works fine without any errors, however I don’t see anything being shown over endpoint http://localhost:8005/, though i see some default metrics being shown such as:
# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 403.0
python_gc_objects_collected_total{generation="1"} 0.0
python_gc_objects_collected_total{generation="2"} 0.0
# HELP python_gc_objects_uncollectable_total Uncollectable object found during GC
# TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
# HELP python_gc_collections_total Number of times this generation was collected
# TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 39.0
python_gc_collections_total{generation="1"} 3.0
python_gc_collections_total{generation="2"} 0.0
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="10",patchlevel="4",version="3.10.4"} 1.0
Can someone point me, what is the issue here?

Couple of things:
Remove registry = prometheus_client.CollectorRegistry()
Remove registry=registry from the Gauge declarations
Add a loop to keep the process running.
import datetime
import re
import time
from prometheus_client import CollectorRegistry,Gauge
from prometheus_client import start_http_server
def collect_metrics():
label_names = ['parent', 'namespace','team', 'name', 'status']
sib = Gauge(
'disk_sizeInBytes',
'Gets the size of the disk in bytes.',
label_names,
)
msib = Gauge(
'disk_maxSizeInMegabytes',
'Gets or sets the maximum size of the disk in megabytes, which is the size of memory allocated for the disk.',
label_names,
)
sib.labels(
parent="parent_name",
namespace="namespace_name",
team="team",
name="disk_name",
status="disk_status",
).set(10.0)
msib.labels(
parent="parent_name",
namespace="namespace_name",
team="team",
name="disk_name",
status="disk_status",
).set(5.0)
if __name__ == '__main__':
...
start_http_server(8005)
collect_metrics()
while True:
time.sleep(5)
# HELP disk_sizeInBytes Gets the size of the disk in bytes.
# TYPE disk_sizeInBytes gauge
disk_sizeInBytes{name="disk_name",namespace="namespace_name",parent="parent_name",status="disk_status",team="team"} 10.0
# HELP disk_maxSizeInMegabytes Gets or sets the maximum size of the disk in megabytes, which is the size of memory allocated for the disk.
# TYPE disk_maxSizeInMegabytes gauge
disk_maxSizeInMegabytes{name="disk_name",namespace="namespace_name",parent="parent_name",status="disk_status",team="team"} 5.0

Nextflow with Azure Batch - Cannot find a matching VM image

While trying to set up Nextflow with Azure Batch (NF-Core), I am getting following error. I tried this on multiple workflows (sarek, ataseq etc.) I get the same error -
N E X T F L O W ~ version 22.04.0
Pulling nf-core/atacseq ...
downloaded from https://github.com/nf-core/atacseq.git
Launching `https://github.com/nf-core/atacseq` [rhl6d5529] DSL1 - revision: 1b3a832db5 [1.2.1]
Downloading plugin nf-azure#0.13.1
----------------------------------------------------
,--./,-.
___ __ __ __ ___ /,-._.--~'
|\ | |__ __ / ` / \ |__) |__ } {
| \| | \__, \__/ | \ |___ \`-._,-`-,
`._,._,'
nf-core/atacseq v1.2.1
----------------------------------------------------
Run Name : rhl6d5529
Data Type : Paired-End
Design File : https://raw.githubusercontent.com/nf-core/test-datasets/atacseq/design.csv
Genome : Not supplied
Fasta File : https://raw.githubusercontent.com/nf-core/test-datasets/atacseq/reference/genome.fa
GTF File : https://raw.githubusercontent.com/nf-core/test-datasets/atacseq/reference/genes.gtf
Mitochondrial Contig : MT
MACS2 Genome Size : 1.2E+7
Min Consensus Reps : 1
MACS2 Narrow Peaks : No
MACS2 Broad Cutoff : 0.1
Trim R1 : 0 bp
Trim R2 : 0 bp
Trim 3' R1 : 0 bp
Trim 3' R2 : 0 bp
NextSeq Trim : 0 bp
Fingerprint Bins : 100
Save Genome Index : No
Max Resources : 6 GB memory, 2 cpus, 12h time per job
Container : docker - nfcore/atacseq:1.2.1
Output Dir : ./results
Launch Dir : /
Working Dir : /nextflow/atacseq/rhl6d5529
Script Dir : /.nextflow/assets/nf-core/atacseq
User : root
Config Profile : test,azurebatch
Config Description : Minimal test dataset to check pipeline function
Config Contact : Venkat Malladi (#vsmalladi)
Config URL : https://azure.microsoft.com/services/batch/
----------------------------------------------------
Uploading local `bin` scripts folder to az://nextflow/atacseq/rhl6d5529/tmp/66/bd55d79e42999df38ba04a81c3aa04/bin
[- ] process > CHECK_DESIGN -
[- ] process > CHECK_DESIGN [ 0%] 0 of 1
[- ] process > CHECK_DESIGN [ 0%] 0 of 1
Error executing process > 'CHECK_DESIGN (design.csv)'
Caused by:
Cannot find a matching VM image with publisher=microsoft-azure-batch; offer=centos-container; OS type=linux; verification type=verified
[58/55b7f7] process > CHECK_DESIGN (design.csv) [100%] 1 of 1, failed: 1
Error executing process > 'CHECK_DESIGN (design.csv)'
Caused by:
Cannot find a matching VM image with publisher=microsoft-azure-batch; offer=centos-container; OS type=linux; verification type=verified
I tried looking into the source code of nextflow. I found the error to be in AzBatchService.groovy (line number below).
https://github.com/nextflow-io/nextflow/blob/0e593e6ab82880810d8139a4fe6e3c47ff69a531/plugins/nf-azure/src/main/nextflow/cloud/azure/batch/AzBatchService.groovy#L442
I did some further digging in my Azure Batch account instance. Basically, I wanted to confirm if the list of supported images being received from the Azure Batch account has the one that is required for this pipeline. I could confirm that the server did indeed respond with the required image -
What could be the issue here? I remember running the exact same pipeline a few weeks back and it did work a few times. Am I missing something?

Just had another look through the Azure Cloud docs and think this might be relevant:
By default, Nextflow creates CentOS 8-based pool nodes, but this
behavior can be customised in the pool configuration. Below the
configurations for image reference/SKU combinations to select two
popular systems.
Ubuntu 20.04:
sku = "batch.node.ubuntu 20.04"
offer = "ubuntu-server-container"
publisher = "microsoft-azure-batch"
CentOS 8 (default):
sku = "batch.node.centos 8"
offer = "centos-container"
publisher = "microsoft-azure-batch"
I think the issue here is a mismatched nodeAgentSkuId. Nextflow is expecting a CentOS 8 node agent SKU, but you have a CentOS 7 SKU. If it's not possible to change the nodeAgentSkuId somehow, the node agent SKU that Nextflow uses should be able to be overridden by adding this to your nextflow.config:
azure.batch.pools.<name>.sku = 'batch.node.centos 7'
Where <name> is the pool identifier:
azure.batch.pools.<name>.sku
Specify the ID of the Compute Node agent SKU which the pool identified with <name> supports (default: batch.node.centos 8, requires nf-azure#0.11.0).
https://www.nextflow.io/docs/edge/azure.html#advanced-settings

How to use regex to find all disk numbers that are offline in a string

I'm using nodejs to execute a certain cmd script, however the output from cmd is in string format and I need to access which disks numbers have the status offline.
for example, the string i need to search in is
"\r\nMicrosoft DiskPart version 10.0.19041.610\r\n\r\nCopyright (C) Microsoft Corporation.\r\nOn computer: DESKTOP-HACFL5A\r\n\r\n Disk ### Status Size Free Dyn Gpt\r\n -------- ------------- ------- ------- --- ---\r\n Disk 0 Online 238 GB 1024 KB *\r\n Disk 1 Online 14 GB 9 GB *\r\n\r\nDisk 1 is now the selected disk.\r\n"
By using the following regex i get the information that there are 2 disks online, but how do i get their disk numbers? I need the regex output to store all the offline disk numbers in an array.
var str = JSON.stringify(stdout);
var matches = str.match( /Online/g )
console.log("matches is: ", matches)
matches is: [ 'Online', 'Online' ]

This would work:
\d+(?= Offline)
var str = `Microsoft DiskPart version 10.0.19041.610
Copyright (C) Microsoft Corporation.
On computer: DESKTOP-HACFL5A
Disk ### Status Size Free Dyn Gpt
-------- ------------- ------- ------- --- ---
Disk 0 Offline 238 GB 1024 KB *
Disk 1 Online 14 GB 9 GB *
Disk 1 is now the selected disk.`;
var matches = str.match( /\d+(?= Offline)/g );
console.log("matches is: ", matches)

Got it to work with the following code
var str = JSON.stringify(stdout);
const t = str.replace(/\s+/g, ' ').trim()
const regex = /\bDisk (\d+) Online\b/g;
console.log(Array.from(t.matchAll(regex), m => m[1]));
Thanks everyone.

Mysql seconds_behind master very high

Hi we have mysql master slave replication, master is mysql 5.6 and slave is mysql 5.7, seconds behind master is 245000, how I make it catch up faster. Right now it is taking more than 6 hours to copy 100 000 seconds.
My slave ram is 128 GB. Below is my my.cnf
[mysqld]
# Remove leading # and set to the amount of RAM for the most important data
# cache in MySQL. Start at 70% of total RAM for dedicated server, else 10%.
innodb_buffer_pool_size = 110G
# Remove leading # to turn on a very important data integrity option: logging
# changes to the binary log between backups.
# log_bin
# These are commonly set, remove the # and set as required.
basedir = /usr/local/mysql
datadir = /disk1/mysqldata
port = 3306
#server_id = 3
socket = /var/run/mysqld/mysqld.sock
user=mysql
log_error = /var/log/mysql/error.log
# Remove leading # to set options mainly useful for reporting servers.
# The server defaults are faster for transactions and fast SELECTs.
# Adjust sizes as needed, experiment to find the optimal values.
join_buffer_size = 256M
sort_buffer_size = 128M
read_rnd_buffer_size = 2M
#copied from old config
#key_buffer = 16M
max_allowed_packet = 256M
thread_stack = 192K
thread_cache_size = 8
query_cache_limit = 1M
#disabling query_cache_size and type, for replication purpose, need to enable it when going live
query_cache_size = 0
#query_cache_size = 64M
#query_cache_type = 1
query_cache_type = OFF
#GroupBy
sql_mode=STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION
#sql_mode=NO_ENGINE_SUBSTITUTION,STRICT_TRANS_TABLES
enforce-gtid-consistency
gtid-mode = ON
log_slave_updates=0
slave_transaction_retries = 100
#replication related changes
server-id = 2
relay-log = /disk1/mysqllog/mysql-relay-bin.log
log_bin = /disk1/mysqllog/binlog/mysql-bin.log
binlog_do_db = brandmanagement
#replicate_wild_do_table=brandmanagement.%
replicate-wild-ignore-table=brandmanagement.t\_gnip\_data\_recent
replicate-wild-ignore-table=brandmanagement.t\_gnip\_data
replicate-wild-ignore-table=brandmanagement.t\_fb\_rt\_data
replicate-wild-ignore-table=brandmanagement.t\_keyword\_tweets
replicate-wild-ignore-table=brandmanagement.t\_gnip\_data\_old
replicate-wild-ignore-table=brandmanagement.t\_gnip\_data\_new
binlog_format=row
report-host=10.125.133.220
report-port=3306
#sync-master-info=1
read-only=1
net_read_timeout = 7200
net_write_timeout = 7200
innodb_flush_log_at_trx_commit = 2
sync_binlog=0
sync_relay_log_info=0
max_relay_log_size=268435456

Lots of possible solutions. But I'll go with the simplest one. Have you got enough network bandwidth to send all changes over the network? You're using "row" binlog, which may be good in case of random, unindexed updates. But if you're changing a lot of data using indexes only, then "mixed" binlog may be better.

Force lshosts command to return megabytes for "maxmem" and "maxswp" parameters

When I type "lshosts" I am given:
HOST_NAME type model cpuf ncpus maxmem maxswp server RESOURCES
server1 X86_64 Intel_EM 60.0 12 191.9G 159.7G Yes ()
server2 X86_64 Intel_EM 60.0 12 191.9G 191.2G Yes ()
server3 X86_64 Intel_EM 60.0 12 191.9G 191.2G Yes ()
I am trying to return maxmem and maxswp as megabytes, not gigabytes when lshosts is called. I am trying to send Xilinx ISE jobs to my LSF, however the software expects integer, megabyte values for maxmem and maxswp. By doing debugging, it appears that the software grabs these parameters using the lshosts command.
I have already checked in my lsf.conf file that:
LSF_UNIT_FOR_LIMTS=MB
I have tried searching the IBM Knowledge Base, but to no avail.
Do you use a specific command to specify maxmem and maxswp units within the lsf.conf, lsf.shared, or other config files?
Or does LSF force return the most practical unit?
Any way to override this?

LSF_UNIT_FOR_LIMITS should work, if you completely drained the cluster of all running, pending, and finished jobs. According to the docs, MB is the default, so I'm surprised.
That said, you can use something like this to transform the results:
$ cat to_mb.awk
function to_mb(s) {
e = index("KMG", substr(s, length(s)))
m = substr(s, 0, length(s) - 1)
return m * 10^((e-2) * 3)
}
{ print $1 " " to_mb($6) " " to_mb($7) }
$ lshosts | tail -n +2 | awk -f to_mb.awk
server1 191900 159700
server2 191900 191200
server3 191900 191200
The to_mb function should also handle 'K' or 'M' units, should those pop up.

If LSF_UNIT_FOR_LIMITS is defined in lsf.conf, lshosts will always print the output as a floating point number, and in some versions of LSF the parameter is defined as 'KB' in lsf.conf upon installation.
Try searching for any definitions of the parameter in lsf.conf and commenting them all out so that the parameter is left undefined, I think in that case it defaults to printing it out as an integer in megabytes.
(Don't ask me why it works this way)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Setting up Sphinx Search - ubuntu-14.04

Related

Submitted metrics not showing up on prometheus endpoint

Nextflow with Azure Batch - Cannot find a matching VM image

How to use regex to find all disk numbers that are offline in a string

Mysql seconds_behind master very high

Force lshosts command to return megabytes for "maxmem" and "maxswp" parameters

Categories

Resources