Searching an array of strings in a file - node.js

I have a text file say, testFile.txt and an array of strings to be searched in the file as say, ['year', 'weather', 'USD 34235.00', 'sportsman', 'ಕನ್ನಡ']. I can break the file into tokens with NodeJS natural and maybe, create a large array (~100-200x the number of entries in the string array) out of it. Then, sort both the arrays and start the search. Or, use lodash directly?
A Found result is when at least one string from the search string array is found in the text file; else, it should be considered as NotFound.
What are some of the options to implement such a search?

I could suggest using Set for large array of tokens, then iterate through the search terms array, check if the tokens set has one of those terms. If the terms array is also large, you could considers using Set for that (MDN docs for Set)
You could see the performance comparision between array and set in context of large number of elements, from this comment
Below is the demo snippet
const tokens1 = ['ಕನ್ನಡ', 'asdasd', 'zxczxc', 'sadasd', 'wqeqweqwe', 'xzczxc']
const tokens2 = ['xzczcxz', 'asdqwdaxcxzc', 'asdxzcxzc', 'wqeqwe', 'zxczcxzxcasd']
const terms = ['year', 'weather', 'USD 34235.00', 'sportsman', 'ಕನ್ನಡ']
const set1 = new Set(tokens1)
const set2 = new Set(tokens2)
const find = (tokensSet, termsArray) => {
for (const term of termsArray) {
if (tokensSet.has(term)) {
return 'Found'
}
}
return 'Not Found'
}
console.log(find(set1, terms))
console.log(find(set2, terms))

Related

Transforming large array of objects to csv using json2csv

I need to transform a large array of JSON (that can have over 100k positions) into a CSV.
This array is created directly in the application, it's not the result of an uploaded file.
Looking at the documentation, I've thought on using parser but it says that:
For that reason is rarely a good reason to use it until your data is very small or your application doesn't do anything else.
Because the data is not small and my app will do other things than creating the csv, I don't think it'll be the best approach but I may be misunderstanding the documentation.
Is it possible to use the others options (async parser or transform) with an already created data (and not a stream of data)?
FYI: It's a nest application but I'm using this node.js lib.
Update: I've tryied to insert with an array with over 300k positions, and it went smoothly.
Why do you need any external modules?
Converting JSON into a javascript array of javascript objects is a piece of cake with the native JSON.parse() function.
let jsontxt=await fs.readFile('mythings.json','uft8');
let mythings = JSON.parse(jsontxt);
if (!Array.isArray(mythings)) throw "Oooops, stranger things happen!"
And, then, converting a javascript array into a CSV is very straightforward.
The most obvious and absurd case is just mapping every element of the array into a string that is the JSON representation of the object element. You end up with a useless CSV with a single column containing every element of your original array. And then joining the resulting strings array into a single string, separated by newlines \n. It's good for nothing but, heck, it's a CSV!
let csvtxt = mythings.map(JSON.stringify).join("\n");
await fs.writeFile("mythings.csv",csvtxt,"utf8");
Now, you can feel that you are almost there. Replace the useless mapping function into your own
let csvtxt = mythings.map(mapElementToColumns).join("\n");
and choose a good mapping between the fields of the objects of your array, and the columns of your csv.
function mapElementToColumns(element) {
return `${JSON.stringify(element.id)},${JSON.stringify(element.name)},${JSON.stringify(element.value)}`;
}
or, in a more thorough way
function mapElementToColumns(fieldNames) {
return function (element) {
let fields = fieldnames.map(n => element[n] ? JSON.stringify(element[n]) : '""');
return fields.join(',');
}
}
that you may invoke in your map
mythings.map(mapElementToColumns(["id","name","element"])).join("\n");
Finally, you might decide to use an automated for "all fields in all objects" approach; which requires that all the objects in the original array maintain a similar fields schema.
You extract all the fields of the first object of the array, and use them as the header row of the csv and as the template for extracting the rest of the elements.
let fieldnames = Object.keys(mythings[0]);
and then use this field names array as parameter of your map function
let csvtxt= mythings.map(mapElementToColumns(fieldnames)).join("\n");
and, also, prepending them as the CSV header
csvtxt.unshift(fieldnames.join(','))
Putting all the pieces together...
function mapElementToColumns(fieldNames) {
return function (element) {
let fields = fieldnames.map(n => element[n] ? JSON.stringify(element[n]) : '""');
return fields.join(',');
}
}
let jsontxt=await fs.readFile('mythings.json','uft8');
let mythings = JSON.parse(jsontxt);
if (!Array.isArray(mythings)) throw "Oooops, stranger things happen!";
let fieldnames = Object.keys(mythings[0]);
let csvtxt= mythings.map(mapElementToColumns(fieldnames)).join("\n");
csvtxt.unshift(fieldnames.join(','));
await fs.writeFile("mythings.csv",csvtxt,"utf8");
And that's it. Pretty neat, uh?

Flattening conditional nested lists to a single flat list

Lodash Flatten nested lists.
I have a function in my Typescript script, where I convert some objects.
Unfortunately, some of the objects are nested, so the return type will sometimes be a nested list.
const values: (IRobotics | IRobotics[][])[] //this is an interface type i described
I used the LoDash FlatMap function to try to flatten it out, but it still gives one nested level for each of the objects.
When the response comes, it only removes the outer list.
I tried using the _.flatMap() function initially,
const parsedValues = _.flatMap(values) as IRobotics[];
however The values within the nested lists, is still nested one level too deep (see image for example).
The types are correct, but i just want a single list of objects and not a nested list (which happens with the examples where it is IRobotics[][])
Anybody has a nice solution to flatten a conditional two-layer array from here?
Entire function for reference. The values attribute is where the initial values are returned
static createFromeData(item: EveryMatrix) {
const values = Object.entries(item).map(entry => {
return this.createFromEntry(entry);
});
const nestedValues = values.filter(entry => Array.isArray(entry));
const parsedValues = _.flatMap(values) as IRobotics[];
const IRobotics = new IRoboticsEveryMatrix(parsedValues);
return IRobotics;
}
Use Array.flatMap() instead of Array.map():
const values = Object.entries(item).flatMap(entry =>
this.createFromEntry(entry)
);
And if you're correct environment doesn't support it, use _.flatMap():
const values = _.flatMap(Object.entries(item), entry =>
this.createFromEntry(entry)
);
If you want to flatten an array of arrays (because you need to nested list for nestedValues), you can use Array.flat() or _.flatten():
const parsedValues = values.flat as IRobotics[];
Or
const parsedValues = _.flatten(values) as IRobotics[];

Search text for matching large number of strings

I have a use case where i have to check if a text which i receive as input contains any of 3 million strings i have.
I tried regex matching but once the list of strings crossed 50k the performance is very bad
i am doing this for each word in the search list
inText = java.util.regex.Pattern.compile("\\b" + findStr + "\\b",
java.util.regex.Pattern.CASE_INSENSITIVE).matcher(intext).replaceAll(repl);
I understand we can use search indexes like lucene, but i feel those primarily work to search a particular text from predefined text, but my use case is the opposite, i need to send a large text and check if any of the pre defined strings are there in the text
I think, you could take at it the other way around. Your predefined strings are documents stored in the inverted index, and your incoming text is a query, that you will test against your documents. Since predefined strings will not change much it will be very performant.
I prepared some Elasticsearch code, that will do the trick.
public void add(String string, String id) {
IndexRequest indexRequest = new IndexRequest(INDEX, TYPE, id);
indexRequest.source(string);
index(INDEX, TYPE, id, string);
}
#Test
public void scoring() throws Exception {
// adding your predefined strings
add("{\"str\":\"string1\"}", "1");
add("{\"str\":\"alice\"}", "2");
add("{\"str\":\"bob\"}", "3");
add("{\"str\":\"string2\"}", "4");
add("{\"str\":\"melanie\"}", "5");
add("{\"str\":\"moana\"}", "6");
refresh(); // otherwise we would not anything
indexExists(INDEX); // verifies that index exists
ensureGreen(INDEX); // ensures cluster status is green
// querying your text separated by space, if the hits length is bigger than 0, you're good
SearchResponse searchResponse = client().prepareSearch(INDEX).setQuery(QueryBuilders.termsQuery("str", "string1", "string3", "melani")).execute().actionGet();
SearchHit[] hits = searchResponse.getHits().getHits();
assertThat(hits.length, equalTo(1));
for (SearchHit hit: hits) {
System.out.println(hit.getSource());
}
}

How to get from CouchDB only certain fields of certain documents for a single request?

For example I have a thousands of documents with same structure, for example:
{
"key_1":"value_1",
"key_2":"value_2",
"key_3":"value_3",
...
...
}
And I need to get, let's say key_1, key_3 and key_23 from some set of documents with known IDs, for example, I need to process only 5 documents while my DB contains several thousands. Each time I have a different set of keys and document IDs. Is it possible to get that information for a one request?
You can use a list function (see: this, this, and this).
Since you know the ids, you can then query _all_docs with the list function:
POST /{db}/_design/{ddoc}/_list/{func}/_all_docs?include_docs=true&columns=["key_1","key_2","key_3"]
Accept: application/json
Content-Length: {whatever}
{
"keys": [
"docid002",
"docid005"
]
}
The list function needs to look at documents, and send the appropriate JSON for each one. Not tested:
(function (head, req) {
send('{"total_rows":' + head.total_rows + ',"offset":' + head.offset + ',"rows":[');
var columns = JSON.parse(req.query.columns);
var delim = '';
var row;
while (row = getRow()) {
var doc = {};
for (var k in columns) {
doc[k] = row.doc[k];
}
row.doc = doc;
send(delim + toJSON(row));
delim = ',';
}
send(']}');
})
Whether this is a good idea, I'm not sure. If your documents are big, and bandwidth savings important, it might.
Yes, that’s possible. Your question can be broken up into two distinct problems:
Getting only a part of the document (in your example: key_1, key_3 and key_23). This can be done using a view. A view is saved into a design document. See the wiki for more info on how to create views.
Retrieving only certain documents, which are defined by their ID. When querying views, you cannot only specify a single ID (or rather key), but also an array of keys, which is what you would need here. Again, see the section on querying views in the wiki for explanations and examples.
Even though you only need a subset of values from a document, you may find that the system as a whole performs better if you just ask for the entire document then select the values you need from that result.
To only get the specific key value pairs you need to create a view that has view entries with a multipart key consisting of the doc id and doc item name, with value of the corresponding doc item.
So your map function would look something like:
function(doc){
for(var i = 1; i < doc.keysInDoc; i++){
var k = "key_"+i;
emit([doc._id, k], doc.[k]);
}
}
You can then use multi key lookup with each key being of the form ["docid12345", "key_1"], ["docid56789", "key_23"], etc.
So a query like:
http://host:5984/db/_design/design/_view/view?&keys=[["docid002","key_8"],["docid005","key_7"]]
will return
{"total_rows":84,"offset":67,"rows":[
{"id":"docid002","key":["docid002","key_8"],"value":"value d2_k8"},
{"id":"docid005","key":["docid005","key_12"],"value":"value d5_k12"}
]}

Efficient way to search object from list of object in c#

I have a list which contains more than 75 thousand object. To search item from list currently I am using following code.
from nd in this.m_ListNodes
where
nd.Label == SearchValue.ToString()
select
nd;
Is this code is efficient?
How often do you need to search the same list? If you're only searching once, you might as well do a straight linear search - although you can make your current code slightly more efficient by calling SearchValue.ToString() once before the query.
If you're going to perform this search on the same list multiple times, you should either build a Lookup or a Dictionary:
var lookup = m_ListNodes.ToLookup(nd => nd.Label);
or
var dictionary = m_ListNodes.ToDictionary(nd => nd.Label);
Use a dictionary if there's exactly one entry per label; use a lookup if there may be multiple matches.
To use these, for a lookup:
var results = lookup[SearchValue.ToString()];
// results will now contain all the matching results
or for a dictionary:
WhateverType result;
if (dictionary.TryGetValue(SearchValue.ToString(), out result))
{
// Result found, stored in the result variable
}
else
{
// No such item
}
No. It would be better if you used a Dictionary or a HashSet with the label as the key. In your case a Dictionary is the better choice:
var dictionary = new Dictionary<string, IList<Item>>();
// somehow fill dictionary
IList<Item> result;
if(!dictionary.TryGetValue(SearchValue.ToString(), out result)
{
// if you need an empty list
// instead of null, if the SearchValue isn't in the dictionary
result = new List<Item>();
}
// result contains all items that have the key SearchValue

Resources