Best way to search through a very big dataset? - search

I have text files that contain about 12gbs worth of tweets and need to search through this dataset off of keywords. What is the best way to go about doing this?
Familiar with Java, Python, R. I don't think my computer can handle the files if, for example, I do some sort of script that goes through each text file in python

"Oh, Python, or any other language, can most-certainly do it." Might take a few seconds, but the job will get done. I suggest that the best approach to your problem is: "straight ahead." Write scripts that process the files one line at a time.
Although "12 gigabytes" sounds enormous to us, to any modern-day machine it's really not that big at all.
Build hashes (associative arrays) in memory as needed. Generally avoid database-operations (other than "SQLite" database files, maybe ...), but, if you happen to find yourself needing "indexed file storage," SQLite is a terrific tool.
. . . with one very-important caveat: "when using SQLite, use transactions, even when reading." By default, SQLite will physically-commit every write and physically-verify every read, unless you are in a transaction. Then, and only then, it will "lazy read/write," as you might have expected it to do all the time. (And then, "that sucker's f-a-s-t...!")

If you want to be exact, then you need to see at every file once, so if your computer can't take that load, then say goodbye to exactness.
Another approach, would be to use approximation algorithms which are faster than the exact ones, but come in the expense of loosing accuracy.
That should get you started and I will stop my answer here, since the topic is just too broad to continue with from here.

Related

Can I reuse a cucumber/gherkin Example block?

I've got two different scenarios that use the same example block. I need to run the example block for two different times of the day and I'm looking for a succinct way to do this (without copy+pasting my example block).
I'm replacing the yyymmdd with an actual date in my stepdef.
I'd like to reuse my Example block because in real life it's a MUCH longer list.
Scenario Outline: File arrives in the morning
Given a file <file> arrives in the morning
When our app runs
Then The file should be moved to <newFile>
And the date should be today
Examples:
|Filename|NewFilename|
|FileA|NewFileA_yyyymmdd|
|FileB|NewFileB_yyyymmdd|
Scenario Outline: File arrives in the evening
Given a file <file> arrives in the evening
When our app runs
Then The file should be moved to <newFile>
And the date should be tomorrow
Examples:
|Filename|NewFilename|
|FileA|NewFileA_yyyymmdd|
|FileB|NewFileB_yyyymmdd|
I'm implementing this in java, though I don't know if that's a relevant detail.
No, this is not supported in the Gherkin syntax. I don't often advised copy-and-paste, but this is one case where it is warranted due to a missing feature of the language.
Generally this should not be a big deal, as the example size should be small. It you really need a large number of examples then recreating this test in code only (Java, Python, C#, etc.) might be the best idea. Most unit test libraries offer some form of data driven tests that might provide a DRYer, more maintainable solution than Gherkin.
This is something thats better tested at a lower level. What you are testing here is your file renaming algorithm. You could write a unit test to do this which would
run much much faster (100 1000 or even 10K times faster is perfectly realistic)
be much more expressive
deal with edge cases better
Once you have that done I would write a single scenario that deals with the whole end to end process, and just ensures that the file is moved and renamed e.g.
Given a file has arrived
When our app runs
Then the file should be moved
And it should be renamed
And the new name should contain the current date
Cukes are expensive to create and particularly to run, so you need to get lots of functionality exercised for each one. When you use outlines and create identical scenarios you are just wasting loads of runtime and adding complexity for little benefit.

Maximum/Optimal number of Cucumber test scenarios in one file

I have a Cucumber feature file with over 66 scenarios! The title of the feature file does represent what the scenarios are all about.
But 66 (200 steps) feels like quite a large number. Does this suggest that my feature title is too broad?
What is the maximum number of scenarios one should have in a single feature file (from a best practice point of view)?
Thanks in advance :)
Although I don't know your system and feature file, I can surely say that there is a misunderstanding of scenarios and their purpose.
The purpose of scenarios is to bring a clarification for the feature by examples. Usually, people tend to write scenarios to cover all use cases. If you do scenarios that way, the feature loses the ability to be human-readable.
Keep in mind that acceptance tests are expensive to write and expensive to change. Write the minimum scenarios. If there is a scenario that doesn't bring any additional value for the understanding of the feature, then that scenario shouldn't be there. Move all use cases into a lower level of testing - unit tests.
In most cases, the feature has the number of scenarios in units, or tens if it's a complex feature.
Edit: If the number of scenarios would go close to 10, I would rather split the feature file into more files describing deeper part of the feature.
Yes, 200 is an unusually large number of scenarios for a single file. It is likely to be hard to find a particular scenario in the file or to keep it organized. (Multiple smaller files are easier to organize; a directory of files is easier for people to understand and maintain than a long file with comments or worse yet some uncommented ordering scheme.) It will also take a long time to run the file, which will make development difficult.
More importantly, 200 scenarios for a single feature might mean that the feature is extremely complex or that it is very broad. In either case it can probably be broken up into multiple smaller feature files. It also might mean that there are too many scenarios. There might be a scenario for every value of some variable (it might be sufficient to write a single scenario and not worry about different values) or a scenario for every detail of every feature (it might be better to write unit tests, which are smaller and more focused and faster, for details).
But, as with any software metric about the size of a piece of code, there might be a typical size, but every problem is different. Your feature might really be that complex. We can't say without understanding the domain and seeing the feature file.

Massive number of XML edits

I need to load a mid-sized XML file into memory, make many random access modifications to the file (perhaps hundreds of thousands), then write the result out to STDIO. Most of these modifications will be node insertion/deletions, as well as character insertion/deletions within the text nodes. These XML files will be small enough to fit into memory, but large enough that I won't want to keep multiple copies around.
I am trying to settle on the architecture/libraries and am looking for suggestions.
Here is what I have come up with so far-
I am looking for the ideal XML library for this, and so far, I haven't found anything that seems to fit the bill. The libraries generally store nodes in Haskell lists, and text in Haskell Data.Text objects. This only allows linear Node and Text inserts, and I believe that the Text inserts will have to do full rewrite on every insert/delete.
I think storing both nodes and text in sequences seems to be the way to go.... It supports log(N) inserts and deletes, and only needs to rewrite a small fraction of the tree on each alteration. None of the XML libs are based on this though, so I will have to either write my own lib, or just use one of the other libs to parse then convert it to my own form (given how easy it is to parse XML, I would almost just as soon do the former, rather than have a shadow parse of everything).
I had briefly considered the possibility that this might be a rare case where Haskell might not be the best tool.... But then I realized that mutability doesn't offer much of an advantage here, because my modifications aren't char replacements, but rather add/deletes. If I wrote this in C, I would still need to store the strings/nodes in some sort of tree structure to avoid large byte moves for each insert/delete. (Actually, Haskell probably has some of the best tools to deal with this, but I would be open to suggestions of a better choice of language for this task if you feel there is one).
To summarize-
Is Haskell the right choice for this?
Does any Haskell lib support fast node/text insert/deletes (log(N))?
Is sequence the best data structure to store a list of items (in my case, Nodes and Chars) for fast insert and deletes?
I will answer my own question-
I chose to wrap an Text.XML tree with a custom object that stores nodes and text in Data.Sequence objects. Because haskell is lazy, I believe it only temporarily holds the Text.XML data in memory, node by node as the data streams in, then it is garbage collected before I actually start any real work modifying the Sequence trees.
(It would be nice if someone here could verify that this is how Haskell would work internally, but I've implemented things, and the performance seems to be reasonable, not great- about 30k insert/deletes per second, but this should do).

what is the fastest word search on index?

i'm coding a query engine to search through a very large sorted index file. so here is my plan, use binary search scan together with Levenshtein distance word comparison for a match. is there a better or faster ways than this? thanks.
You may want to look into Tries, and in many cases they are faster than binary search.
If you were searching for exact words, I'd suggest a big hash table, which would give you results in a single lookup.
Since you're looking at similar words, maybe you can group the words into many files by something like their soundex, giving you much shorter lists of words to compute the distances to. http://en.wikipedia.org/wiki/Soundex
In your shoes, I would not reinvent the wheel - rather I'd reach for the appropriate version of the Berkeley DB (now owned by Oracle, but still open-source just as it was back when it was owned and developed by the UC at Berkeley, and later when it was owned and developed by Sleepycat;-).
The native interfaces are C and Java (haven't tried the latter actually), but the Python interface is also pretty good (actually better now that it's not in Python's standard library any more, as it can better keep pace with upstream development;-), C++ is of course not a problem, etc etc -- I'm pretty sure you can use if from most any language.
And, you get your choice of "BTree" (actually more like a B*Tree) and hash (as well as other approaches that don't help in your case) -- benchmark both with realistic data, btw, you might be surprised (one way or another) at performance and storage costs.
If you need to throw multiple machines at your indexing problem (because it becomes too large and heavy for a single one), a distributed hash table is a good idea -- the original one was Chord but there are many others now (unfortunately my first-hand experience is currently limited to proprietary ones so I can't really advise you here).
after your comment on David's answer, I'd say that you need two different indexes:
the 'inverted index', where you keep all the words, each with a list of places found
an index into that file, to quickly find any word. Should easily fit in RAM, so it can be a very efficient structure, like a Hash table or a Red/Black tree. I guess the first index isn't updated frequently, so maybe it's possible to get a perfect hash.
or, just use Xapian, Lucene, or any other such library. There are several widely used and optimized.
Edit: I don't know much about word-comparison algorithms but I guess most aren't compatible with hashing. In that case, R/B Trees or Tries might be the best way.

When is it a good use of time to refactor string literals?

I'm starting on a project where strings are written into the code most of the time. Many strings might only be used in a few places but some strings are common throughout many pages.
Is it a good use of my time to refactor the literals into constants being that the app is pretty well established and runs well? What would be the long-term benefits to doing so?
One common thing to consider would be i18n. If you (or your muckity-mucks) ever want to sell your product in Mexico or France (etc.) you're going to appreciate having those string literals not littered throughout the code base.
EDIT: I realize this doesn't directly answer your question, so I'm voting up some of the other answers re: rule of three, and the like. I understand you're talking about an existing code base, so it's a little late to talk about incorporating i18n from the start. It's so easy to do when you're in the habit from the start.
I like to apply the rule of three when refactoring. If it happens three or more times, then the code needs to be updated.
Only if this project needs to be supported into the future is this a good use of time. If you will be regularly maintaining/expanding this system; however, this is a great idea.
1) There is a large degree of risk associated with string literals as a single misspelling can usually only be detected at run time. The reduced risk of run time errors is a serious advantage as they can be embarrassing/frustrating.
2) Also, should they ever need to be changed, for example when they are used to reference another system (like table names, server names etc) they can be very difficult to update when those other system names change. Centralize them and it's a trivial issue.
If a string is used in more than one place, refactor it. If it is only used in one place, leave it alone.
If you've refactored out all your common strings, it makes it easier to internationalize/translate them. It's even easier if they're all in properties files, or whatever your language equivalent is.
Is it a good use of my time to refactor the literals into constants being that the app is pretty well established and runs well?
No, you better leave it like it is.
What would be the long-term benefits to doing so?
If no one ever touch that code, the benefits are none.
What you can do, however is avoid adding new literals. But I would pretty much leave existing the way they are.
You could probably refactor them in your free to sleep better.
Probably there are some other bugs already that need your attention. Fix those instead.
Finally, if you manage you add "refactoring" to your task list, go ahead!!!
I agree with JMD, just keep in mind that there is more to i18n than changing strings(currencies, UI must be adpated to Right-to-left languages etc)
Even if you don´t wish to 18n your application it would be useful to refactor your strings, since that string that is used today only once, maybe reused tomorrow several times, and if it´s hardcoded you may be not aware of it and star replicating string all over the place.
Best let sleeping dogs lie. If you need to change a string that's used eighteen-lumpy times, yeah, go ahead and turn it into a constant somewhere. If you find yourself working in a module that has a string that could be constant-ize, do it if you feel like it. But going through the whole app changing all strings to constants... that should be on the very bottom of the to-do list.

Resources