elasticsearch stops adding documents after some point - python-3.x

I'm new to elasticsearch and want to index many sentences to search them efficiently.
At first I tried bulk adding to an index, but that didn't work for me, so now I'm adding sentences one by one using the following piece of (python) code:
c = pycurl.Curl()
add_document(c, 'myIndexName', 'someJsonString', 99)
def add_document(c, index_name, js, _id):
c.setopt(c.POST, 1)
c.setopt(c.URL, 'localhost:9200/%s/sentence/%i' % (index_name, _id))
c.setopt(c.POSTFIELDS, json.dumps(js))
c.perform()
Where I'm incrementing the id, and an example of a json input string would be:
{"sentence_id": 2, "article_name": "Kegelschnitt", "paragraph_id": 1, "plaintext": "Ein Kegelschnitt ist der zweidimensionale Sonderfall einer Quadrik .", "postags": "Ein/ART Kegelschnitt/NN ist/VAFIN der/ART zweidimensionale/ADJA Sonderfall/NN einer/ART Quadrik/NE ./$."}
So far so good, seems to work. I suspect that getting this to work in a bulk import way is a lot more efficient, but since this is a one-time only process, efficiency is not my primary concern.
I'm using this query (on the command line) to get an overview of my indices:
curl 'localhost:9200/_cat/indices?v'
Which gives me (for the relevant index):
health status index pri rep docs.count docs.deleted store.size pri.store.size
yellow open wiki_dump_jan2019 5 1 795502 276551 528.1mb 528.1mb
Similarly, the query:
curl -XGET 'localhost:9200/wiki_dump_jan2019/sentence/_count?pretty' -H 'Content-Type: application/json' -d '{"query": {"match_all": {}}}'
returns
{
"count" : 795502,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
}
}
Telling me that I have 795.502 sentences in my index.
My problem here is that in total I do over 23 million inserts. I realise that there may well be some duplicate sentences, but checked this and found over 21 million unique sentences. My python code executed fine, no errors, and I checked the elasticsearch logs and did not find anything alarming in there. I'm a bit unsure about the number of docs.deleted from the index (276.551, see above), but I understood that this may have to do with re-indexing, duplicates, and should not necessarily be a problem (and in any case, the total number of docs and the docs.deleted are still way below my number of sentences).
The only thing I could find (getting close to my problem) was this post: elasticsearch stops indexing new documents after a while, using Tire , but the following query:
curl -XGET 'localhost:9200/_nodes/stats/process?filter_path=**.max_file_descriptors'
returns:
{"nodes":{"hoOAMZoCTkOgirg6_aIkUQ":{"process":{"max_file_descr^Ctors":65536}}}}
so from what I understand upon installation it defaulted to the max value and this should not be the issue.
Anyone who can shed some light on this?
UPDATE: Ok, I guess I'm officially stupid. My issue was that I used the sentence_id as index id in the adding/inserting process. This sentence_id is coming from one particular document, so the max nr of docs (sentences) in my index would be the highest sentence_id (the longest document in my data set apparently had 795502 sentences). It just kept overwriting all entries after every document... Sorry for having wasted your time if you read this. NOT an elasticsearch issue; bug in my python code (outside of the displayed function above).

Related

Number of results not as it should be on boolean Solr query

I have a Solr instance running with about 200 entries in its database. I want to search for strings with OR but fail to get a working query.
When running a simple query like this: q=fieldname:"string", i get 13 results. When running another query like this: q=fieldname:"otherstring", i gt 18 results. In the end i would expect it to be 27 results because together there are 31 results and 4 of the are the same ones as they contain both strings.
Problem now comes if i want to search both these strings at once it will return all kinds of results but not the expected 27. I found this site describing how it should work and tried a couple of different things:
q=fieldname:"string otherstring" gives me 10
q=fieldname:"otherstring string" gives me 0
q=fieldname:"string otherstring"~1 gives me 10
q=fieldname:"otherstring string"~1 gives me 1
q=fieldname:"(string otherstring)" gives me 37 but some are not related at all
q=(+fieldname:"string" +fieldname:"otherstring")" same as above
I could go on with this as i tried more if these combinations. Can anyone help me getting a query with the correct number of results or can anyone explain me what i am doing wrong?
If you want to perform an OR query, use OR explicitly:
q=fieldname:"string" OR fieldname:"otherstring"
The other versions will give varying results depending on the value q.op and the query parser in use.
q=fieldname:("string" OR "otherstring")
should be semantically identical.

Netsuite - Transfer Inventory error

I have been using NetSuite for only a short time, and already hate it. I am sorry if this is a stupid question, but I haven't been able to find an answer so far, either in the Netsuite docs, StackOverflow or other websites. In fact, the answers I found have resulted in an error.
My company requires a script to transfer inventory based on an EDI input file. Reading the file is no problem, even parsing it is working. However, actually inserting the data is proving problematic.
I have been able to insert normal records, but Inventory Transfer records are giving me problems.
From Stack Overflow I found and adapted some code into the following:
var xfer = nlapiCreateRecord("inventorytransfer");
xfer.setFieldValue("trandate", FormatDate("20160101"));
xfer.setFieldValue("location", 9);
xfer.setFieldValue("transferlocation", 9);
nlapiSelectNewLineItem('invt');
nlapiSetLineItemValue("invt","invtid",1, 189);
nlapiSetLineItemValue("invt","adjustqtyby", 1, "5");
nlapiCommitLineItem('invt');
var id = nlapiSubmitRecord(xfer);
The FormatDate function just exchanges the date from the text file into a system date NetSuite can understand.
However, when I run this code I get the following error:
USER_ERROR: You must enter at least one line item for this transaction.
I thought inserting the line item was the reason to use nlapiSelectNewLineItem, but I guess not. Also nlapiCreateNewLineItem doesn't seem to exist.
The values I am inserting are all just test data, as I'm testing this in the debugger. Location 9 exists, as does item 189.
My full script finds these id's based on string values from the text files. But since this is the section that doesn't work I have set it apart to test.
Can anyone help with this?
You did not specify the type of script you are using, but, looks like you are not setting the fields on the record object, but, setting the value on current record. Below, is the suggested code.
Also, there is no sublist named invt, it should be inventory. Also, there is no field as such invtid, I think most probably you want to setup item, the field Id should be item. You might want to refer SuiteScript Record Browser as well for a help on correct Ids
var xfer = nlapiCreateRecord("inventorytransfer");
xfer.setFieldValue("trandate", FormatDate("20160101"));
xfer.setFieldValue("location", 9);
xfer.setFieldValue("transferlocation", 9);
xfer.selectNewLineItem('inventory');
xfer.setCurrentLineItemValue("inventory", "item", 189);
xfer.setCurrentLineItemValue("inventory","adjustqtyby", "5");
xfer.commitLineItem('inventory');
var id = nlapiSubmitRecord(xfer);
If you are using Bin/Lot Numbered Items, please see help topic "Sample Scripts for Advanced Bin / Numbered Inventory Management"

Drop an Index from SQL Server by Object ID

Using SQL Server 2012 and I have somehow ended up w/ an index named:
<Name of Missing Index, sysname,>
I'm not sure how or when it happened. I've tried dropping this index using:
DROP INDEX EMAIL_ADDRESS.<Name of Missing Index, sysname,>
It won't of course since I receive the expected error message:
Incorrect syntax near '<'.
Querying the DMV in SQL Server tells me I should drop this index so it's even more frustrating that I cannot. This has been one of those little things that has gnawed at me for a few years now. I've looked for answers a few times now over the years. I've probably poured 4 hours into finding a way to drop an index by something other than name. Nothing.
Can someone help me? Running this query:
SELECT * FROM SYS.INDEXES WHERE NAME = '<Name of Missing Index, sysname,>'
Produces an OBJECT_ID of 281104092. Is there a way to drop the object using this ID? There must be, right? Am I just stuck w/ this crazy index forever?
Try to delete with Database Owner Name. It worked for me.
DROP INDEX ccl2.[TBL_HamdunSoft].[<Name of Missing Index, sysname,>]
Or
DROP INDEX dbo.[TBL_HamdunSoft].[<Name of Missing Index, sysname,>]

Using indexed types for ElasticSearch in Titan

I currently have a VM running Titan over a local Cassandra backend and would like the ability to use ElasticSearch to index strings using CONTAINS matches and regular expressions. Here's what I have so far:
After titan.sh is run, a Groovy script is used to load in the data from separate vertex and edge files. The first stage of this script loads the graph from Titan and sets up the ES properties:
config.setProperty("storage.backend","cassandra")
config.setProperty("storage.hostname","127.0.0.1")
config.setProperty("storage.index.elastic.backend","elasticsearch")
config.setProperty("storage.index.elastic.directory","db/es")
config.setProperty("storage.index.elastic.client-only","false")
config.setProperty("storage.index.elastic.local-mode","true")
The second part of the script sets up the indexed types:
g.makeKey("property").dataType(String.class).indexed("elastic",Edge.class).make();
The third part loads in the data from the CSV files, this has been tested and works fine.
My problem is, I don't seem to be able to use the ElasticSearch functions when I do a Gremlin query. For example:
g.E.has("property",CONTAINS,"test")
returns 0 results, even though I know this field contains the string "test" for that property at least once. Weirder still, when I change CONTAINS to something that isn't recognised by ElasticSearch I get a "no such property" error. I can also perform exact string matches and any numerical comparisons including greater or less than, however I expect the default indexing method is being used over ElasticSearch in these instances.
Due to the lack of errors when I try to run a more advanced ES query, I am at a loss on what is causing the problem here. Is there anything I may have missed?
Thanks,
Adam
I'm not quite sure what's going wrong in your code. From your description everything looks fine. Can you try the follwing script (just paste it into your Gremlin REPL):
config = new BaseConfiguration()
config.setProperty("storage.backend","inmemory")
config.setProperty("storage.index.elastic.backend","elasticsearch")
config.setProperty("storage.index.elastic.directory","/tmp/es-so")
config.setProperty("storage.index.elastic.client-only","false")
config.setProperty("storage.index.elastic.local-mode","true")
g = TitanFactory.open(config)
g.makeKey("name").dataType(String.class).make()
g.makeKey("property").dataType(String.class).indexed("elastic",Edge.class).make()
g.makeLabel("knows").make()
g.commit()
alice = g.addVertex(["name":"alice"])
bob = g.addVertex(["name":"bob"])
alice.addEdge("knows", bob, ["property":"foo test bar"])
g.commit()
// test queries
g.E.has("property",CONTAINS,"test")
g.query().has("property",CONTAINS,"test").edges()
The last 2 lines should return something like e[1t-4-1w][4-knows-8]. If that works and you still can't figure out what's wrong in your code, it would be good if you can share your full code (e.g. in Github or in a Gist).
Cheers,
Daniel

Python - CSV Module, Getting Information From a File

Here is the situation:
The first problem I'm having is with obtaining information from a CSV file. The purpose of the code I'm writing is to get a bunch of information on ZCTAs (zip codes), for a number of different cohorts (there are six currently being used, but the code is meant to be flexible to have any number of cohorts). One file contains the population, by cohort, for each ZCTA. Another file has the number of 'cases' (cases of cancer observed) for each cohort, for each ZCTA. Another file has the crude rate for each cohort, for the state of Iowa (the focus of this research), for the rate at which one can 'expect' to see the number of people who have cancer, for a population, by cohort. There are a couple of other files, but these are the focus, as this is where my issue is exhibited.
What my code does, initially, is to read the population file and get the population of each cohort by ZCTA. Each ZCTA, and the information, is stored in a list, which is then stored in a list of lists (nested), containing all of the ZCTAs. The code then gets the crude rate. Then, the crude rate is taken times the appropriate cohort, for each ZCTA and summed with all of the other cohorts within each ZCTA, to get the total number of people we can EXPECT to see having cancer, for each ZCTA. The population is also summed up. This information is stored in a another list, as well as a list containing all of the ZCTAs. This information will be the focus (The list of all of the ZCTAs, which each contain the total population and the total number of expected cases).
So, the problem is that I then need to take this newly acquired list and get the number of OBSERVED cases, for each cohort, sum those together, append it to the appropriate ZCTA and write it to a new file. I have code implemented that does this fine, EXCEPT that the bottom 22 or so ZCTAs don't get the number of observed cases. I don't know if it is the code, or what, but it works for all of the other 906, but doesn't get the bottom 22.
The reader will find sample data for the files I've discussed (the observed case file, and the output file) at: Gist
Here is the code I'm using:
`expectedcsv = open('ExpectedCases.csv', 'w', newline= '')
expectedwriter = csv.writer(expectedcsv, delimiter = ',')
expectedHeader = ['zcta', 'expected', 'pop', 'observed']
thecasesreader = csv.reader(thecasescsv, delimiter = ',')
for zcta in zctaPop:
caseCounter = 0
thecasescsv = open('NewCaseFile.csv', 'r', newline = '')
thecasesreader = csv.reader(thecasescsv, delimiter = ',')
for case in thecasesreader:
if case[0] == zcta[0]:
for i in range(3, len(case)):
caseCounter += int(case[i])
zcta.append(caseCounter)
expectedwriter.writerow(zcta)
expectedcsv.close()
thecasescsv.close()`
Something else I would also like to bring up is that later on in the code, the actual purpose for all of this, is to create an SMR filter, for each grid point. The grid points are somewhat arbitrary they have been placed (via coordinates) over the entire state of Iowa. The SMR is the number of observed divided by the number of expected cases. The threshold, that is, how many expected cases for a particular filter, is set by the user. So, if a user wants a filter created on 150 expected cases (for each grid point), the code goes through each ZCTA, summing up the expected cases until greater than 150 are found. The distance to this last ZCTA is the 'radius' of the filter.
To do this, I built a distance matrix (the distance from each grid point to every ZCTA) and then sorted it, nearest to furthest. Because of the size of the file (2300 X 930), I have to read this file line by line and get all of the information from other files. So, starting with the nearest ZCTA, I get the population, expected cases, and observed cases (the problem with this file was discussed above) and add these each to their respective counter (one for population, one for observed and one for expected). Then it goes to the next closest ZCTA and does the same, until the the threshold is exceeded.
The problem here is that I couldn't use the CSV Module to read these files, as I was already reading from another file and the index would be lost. So, I had to use just the regular filename.read(), which then required some interesting use of maketrans and .translate. I'm not sure its efficient or works great. Everything seems to be fine, but without the above problem being fixed, it's impossible to tell. I have included the code below, but was wondering if anybody had any better ideas/suggestions?
`expectedCSV = open('ExpectedCases.csv', 'r', newline = '')
table = str.maketrans('\r', ' ')
content = expectedCSV.read()
expectedCSV.close()
content = content.translate(table)
content = content.split(sep = '\n')
newContent = []
for item in content:
newContent.append((item.split(sep= ',')))
content = ' '
for item in newContent:
if item[0] == currentZcta:
expectedTotal += (float(item[1]))
totalPop += (float(item[2]))
totalObservedCount += (float(item[3]))`
Also, I couldn't figure out how to color the methods blue and the variables red, as some of the more awesome users of this site do. I would be very much interested in learning how to do that for future posts.
If anybody needs more info or anything clarified to help answer/formulate a solution, please, by all means, ask! Thanks for taking the time to read!
So, I ended up "solving" this by computing the observed along with the expected and population, by opening the file for each ZCTA computed. This did not really solve the issue I was dealing with, but rather found a way around it. I'm somewhat disappointed that more people didn't view and/or respond to this. If someone comes up with an answer to the actual problem, by all means, post it here. -Mike

Resources