What language would provide stable processing of millions of dirty addresses into standardized format? - nlp

I'm currently using NodeJS to create a program that takes uncleaned, mistyped, dirty addresses and convert them into standardized formats with all components found or filled in, for further use in digital use cases (such as a computer being capable of recognizing all the addresses, which is improbable given how dirty they can all be)
However I've finally hit the limit of NodeJS capabilities for this execution I feel, mainly in the fact of there's so much data that it either takes hours to run or multithreading is unstable and crashes constantly, and so I'd like to get pros and cons of other languages that might be good for use with this case

I guess python will do the job

Related

Securely running user's code

I am looking to create an AI environment where users can submit their own code for the AI and let them compete. The language could be anything, but something easy to learn like JavaScript or Python is preferred.
Basically I see three options with a couple of variants:
Make my own language, e.g. a JavaScript clone with only very basic features like variables, loops, conditionals, arrays, etc. This is a lot of work if I want to properly implement common language features.
1.1 Take an existing language and strip it to its core. Just remove lots of features from, say, Python until there is nothing left but the above (variables, conditionals, etc.). Still a lot of work, especially if I want to keep up to date with upstream (though I just could also just ignore upstream).
Use a language's built-in features to lock it down. I know from PHP that you can disable functions and searching around, similar solutions seem to exist for Python (with lots and lots of caveats). For this I'd need to have a good understanding of all the language's features and not miss anything.
2.1. Make a preprocessor that rejects code with dangerous stuff (preferably whitelist based). Similar to option 1, except that I only have to implement the parser and not implement all features: the preprocessor has to understand the language so that you can have variables named "eval" but not call the function named "eval". Still a lot of work, but more manageable than option 1.
2.2. Run the code in a very locked-down environment. Chroot, no unnecessary permissions... perhaps in a virtual machine or container. Something in that sense. I'd have to research how to achieve this and how to make it give me the results in a secure way, but that seems doable.
Manually read through all code. Doable on a small scale or with moderators, though still tedious and error-prone (I might miss stuff like if (user.id = 0)).
The way I imagine 2.2 to work is like this: run both AIs in a virtual machine (or something) and constrain it to communicate with the host machine only (no other Internet or LAN access). Both AIs run in a separate machine and communicate with each other (well, with the playing field, and thereby they see each other's positions) through an API running on the host.
Option 2.2 seems the most doable, but also relatively hacky... I let someone's code loose in a virtualized or locked down environment, hoping that that'll keep them in while giving them free game to DoS or break out of the environment. Then again, most other options are not much better.
TL;DR: in essence my question is: how do I let people give me 'logic' for an AI (which I think is most easily done using code) and then run that without compromising the functionality of the system? There must be at least 2 AIs working on the same playing field.
This is really just a plugin system, so researching how others implement plugins is a good starting point. In particular, I'd look at web browsers like Chrome and Safari and their plugin systems.
A common theme in modern plugins systems is process isolation. Ideally you should run the plugin in its own process space in a sandbox. In OS X look at XPC, which is designed explicitly for this problem. On Linux (or more portably), I would probably look at NaCl (Native Client). The JVM is also designed to provide sandboxing, and offers a rich selection of languages. (That said, I don't personally consider the JVM a very strong sandbox. It's had a history of security problems.)
In general, my preference on these kinds of projects is a language-agnostic API. I most often use REST APIs (or "REST-like"). This allows the plugin to be highly restricted, while not restricting the language choice. I like simple HTTP for communications whenever possible because it has rich support in numerous languages, so it puts little restriction on the plugin. In fact, given your description, you wouldn't even have to run the plugin on your hardware (and certainly not on the main server). Making the plugins remote clients removes many potential concerns.
But ultimately, I think something like your "2.2" is the right direction.

Does different language = different performance in couchDB lists?

I am writing a list function in couchDB. I want to know if using a faster language than javascript would boost performance (i was thinking python, just because I know it).
Does anyone know if this is true, and has anyone tested whether it is true?
Generally the different view engines are going to give you the same speed.
Except erlang, which is much faster.
The reason for this is that erlang is what CouchDB is written in and for all other languages the data needs to get converted into standard JSON then sent to the view server, then converted back to the native erlang format for writing.
BUT, This performance "boost" only happens on view generation, which typically happens out -of-line of a request or only on the changed documents.
As in, real world usage performance difference between view servers is irrelevant most of the time.
Here is the list of all the view server implementations: http://wiki.apache.org/couchdb/View_server
I've never used the python ones, but if that is where you are comfortable, go for it.
You can use the V8 engine if you want for Couch. A guy from IrisCouch wrote couchjs to do this (I've seen him on Stack Overflow quite a bit too).
https://github.com/iriscouch/couchjs
Also for views, filtered replication, things like that, you can write the functions in Erlang instead of javascript. I've done that and seen around a 50% performance increase.
Seems you can write list functions in Erlang: http://tisba.de/2010/11/25/native-list-functions-with-couchdb/

Haskell for mission-critical systems [duplicate]

I've been curious to understand if it is possible to apply the power of Haskell to embedded realtime world, and in googling have found the Atom package. I'd assume that in the complex case the code might have all the classical C bugs - crashes, memory corruptions, etc, which would then need to be traced to the original Haskell code that
caused them. So, this is the first part of the question: "If you had the experience with Atom, how did you deal with the task of debugging the low-level bugs in compiled C code and fixing them in Haskell original code ?"
I searched for some more examples for Atom, this blog post mentions the resulting C code 22KLOC (and obviously no code:), the included example is a toy. This and this references have a bit more practical code, but this is where this ends. And the reason I put "sizable" in the subject is, I'm most interested if you might share your experiences of working with the generated C code in the range of 300KLOC+.
As I am a Haskell newbie, obviously there may be other ways that I did not find due to my unknown unknowns, so any other pointers for self-education in this area would be greatly appreciated - and this is the second part of the question - "what would be some other practical methods (if) of doing real-time development in Haskell?". If the multicore is also in the picture, that's an extra plus :-)
(About usage of Haskell itself for this purpose: from what I read in this blog post, the garbage collection and laziness in Haskell makes it rather nondeterministic scheduling-wise, but maybe in two years something has changed. Real world Haskell programming question on SO was the closest that I could find to this topic)
Note: "real-time" above is would be closer to "hard realtime" - I'm curious if it is possible to ensure that the pause time when the main task is not executing is under 0.5ms.
At Galois we use Haskell for two things:
Soft real time (OS device layers, networking), where 1-5 ms response times are plausible. GHC generates fast code, and has plenty of support for tuning the garbage collector and scheduler to get the right timings.
for true real time systems EDSLs are used to generate code for other languages that provide stronger timing guarantees. E.g. Cryptol, Atom and Copilot.
So be careful to distinguish the EDSL (Copilot or Atom) from the host language (Haskell).
Some examples of critical systems, and in some cases, real-time systems, either written or generated from Haskell, produced by Galois.
EDSLs
Copilot: A Hard Real-Time Runtime Monitor -- a DSL for real-time avionics monitoring
Equivalence and Safety Checking in Cryptol -- a DSL for cryptographic components of critical systems
Systems
HaLVM -- a lightweight microkernel for embedded and mobile applications
TSE -- a cross-domain (security level) network appliance
It will be a long time before there is a Haskell system that fits in small memory and can guarantee sub-millisecond pause times. The community of Haskell implementors just doesn't seem to be interested in this kind of target.
There is healthy interest in using Haskell or something Haskell-like to compile down to something very efficient; for example, Bluespec compiles to hardware.
I don't think it will meet your needs, but if you're interested in functional programming and embedded systems you should learn about Erlang.
Andrew,
Yes, it can be tricky to debug problems through the generated code back to the original source. One thing Atom provides is a means to probe internal expressions, then leaves if up to the user how to handle these probes. For vehicle testing, we build a transmitter (in Atom) and stream the probes out over a CAN bus. We can then capture this data, formated it, then view it with tools like GTKWave, either in post-processing or realtime. For software simulation, probes are handled differently. Instead of getting probe data from a CAN protocol, hooks are made to the C code to lift the probe values directly. The probe values are then used in the unit testing framework (distributed with Atom) to determine if a test passes or fails and to calculate simulation coverage.
I don't think Haskell, or other Garbage Collected languages are very well-suited to hard-realtime systems, as GC's tend to amortize their runtimes into short pauses.
Writing in Atom is not exactly programming in Haskell, as Haskell here can be seen as purely a preprocessor for the actual program you are writing.
I think Haskell is an awesome preprocessor, and using DSEL's like Atom is probably a great way to create sizable hard-realtime systems, but I don't know if Atom fits the bill or not. If it doesn't, I'm pretty sure it is possible (and I encourage anyone who does!) to implement a DSEL that does.
Having a very strong pre-processor like Haskell for a low-level language opens up a huge window of opportunity to implement abstractions through code-generation that are much more clumsy when implemented as C code text generators.
I've been fooling around with Atom. It is pretty cool, but I think it is best for small systems. Yes it runs in trucks and buses and implements real-world, critical applications, but that doesn't mean those applications are necessarily large or complex. It really is for hard-real-time apps and goes to great lengths to make every operation take the exact same amount of time. For example, instead of an if/else statement that conditionally executes one of two code branches that might differ in running time, it has a "mux" statement that always executes both branches before conditionally selecting one of the two computed values (so the total execution time is the same whichever value is selected). It doesn't have any significant type system other than built-in types (comparable to C's) that are enforced through GADT values passed through the Atom monad. The author is working on a static verification tool that analyzes the output C code, which is pretty cool (it uses an SMT solver), but I think Atom would benefit from more source-level features and checks. Even in my toy-sized app (LED flashlight controller), I've made a number of newbie errors that someone more experienced with the package might avoid, but that resulted in buggy output code that I'd rather have been caught by the compiler instead of through testing. On the other hand, it's still at version 0.1.something so improvements are undoubtedly coming.

Finding Vulnerabilities in Software

I'm insterested to know the techniques that where used to discover vulnerabilities. I know the theory about buffer overflows, format string exploits, ecc, I also wrote some of them. But I still don't realize how to find a vulnerability in an efficient way.
I don't looking for a magic wand, I'm only looking for the most common techniques about it, I think that looking the whole source is an epic work for some project admitting that you have access to the source. Trying to fuzz on the input manually isn't so comfortable too. So I'm wondering about some tool that helps.
E.g.
I'm not realizing how the dev team can find vulnerabilities to jailbreak iPhones so fast.
They don't have source code, they can't execute programs and since there is a small number of default
programs, I don't expect a large numbers of security holes. So how to find this kind of vulnerability
so quickly?
Thank you in advance.
On the lower layers, manually examining memory can be very revealing. You can certainly view memory with a tool like Visual Studio, and I would imagine that someone has even written a tool to crudely reconstruct an application based on the instructions it executes and the data structures it places into memory.
On the web, I have found many sequence-related exploits by simply reversing the order in which an operation occurs (for example, an online transaction). Because the server is stateful but the client is stateless, you can rapidly exploit a poorly-designed process by emulating a different sequence.
As to the speed of discovery: I think quantity often trumps brilliance...put a piece of software, even a good one, in the hands of a million bored/curious/motivated people, and vulnerabilities are bound to be discovered. There is a tremendous rush to get products out the door.
There is no efficient way to do this, as firms spend a good deal of money to produce and maintain secure software. Ideally, their work in securing software does not start with a looking for vulnerabilities in the finished product; so many vulns have already been eradicated when the software is out.
Back to your question: it will depend on what you have (working binaries, complete/partial source code, etc). On the other hand, it is not finding ANY vulnerability but those that count (e.g., those that the client of the audit, or the software owner). Right?
This will help you understand the inputs and functions you need to worry about. Once you localized these, you may already have a feeling of the software's quality: if it isn't very good, then probably fuzzing will find you some bugs. Else, you need to start understanding these functions and how the input is used within the code to understand whether the code can be subverted in any way.
Some experience will help you weight how much effort to put at each task and when to push further. For example, if you see some bad practices being used, then delve deeper. If you see crypto being implemented from scratch, delve deeper. Etc
Aside from buffer overflow and format string exploits, you may want to read a bit on code injection. (a lot of what you'll come across will be web/DB related, but dig deeper) AFAIK this was a huge force in jailbreaking the iThingies. Saurik's mobile substrate allow(s) (-ed?) you to load 3rd party .dylibs, and call any code contained in those.

When to mix languages?

What are some situations where languages should be mixed?
I'm not talking about using ASP.NET with C# and HTML or an application written in C accessing a SQL database through SQL queries. I'm talking about things like mixing C++ with Fortran or Ada with Haskell etc. for example.
[EDIT]
First of all: thank you for all your answers.
When I asked this question I had in mind that you always read "every language has its special purpose".
In general, you can get almost everything done in any language by using special libraries. But, if you are interested in learning different languages, why not take the programming language that serves your purpose best instead of a library that solves a problem your language wasn't originally designed for?
For example, in video games we use different languages for different purpose :
Application (Game) code : have to be fast, organized and most of the time cross-platform (at least win, MacOS is to be envisaged), often on constraint-heavy platforms (consoles), so C++ (and sometimes C and asm) is used.
Development Tools : level design tools generates data that the game code will play with. Those kind of tool don't need to run on the target platform (but if you can it's easier to debug) so often they are made with fast-development languages such as C#, Python, etc.
Script system : some parts of the games will have to be tweaked by the designers, using variables or scripts. It's really easier and cheap to embed a scripting language instead of writing one so Lua or other similar scripting languages are often used.
Web application : sometimes a game will require to provide some data online, most often in a database accessed with SQL. The web application then is written in a language that might be C#, Ruby(R.O.R.), Python, PHP or anything else that is good for the job. As it's about the web, you then have to use HTML/Javascript too.
etc...
In my game I use HTML/Javascript for GUI too.
[EDIT]
To answer your edit : the language you know the best is not always the most efficient tool for the work. That's why for example I use C++ for my home-made game because I know it best (I could use a lot of other languages as the targets are Win/Mac/Linux, not consoles) but I use Python for everything related to build process, file manipulation etc. I don't know Python in depth but it's fare easier to do quick file manipulation with it than with C++. I wouldn't use C++ for web application for obvious reasons.
In the end, you use what is efficient for the job. That's what you learn by working in real world constraints, with money, time and quality in mind.
Well, the most obvious (and the most common) situation would be when you use some high level language to make most of your program, reaping the benefits of fast development and robustness, while using some lower level language like C or even assembly to gain speed where it is important.
Also, many times it is necessary to interface with other software written in some other language. A good example here are APIs exposed by the operating system - they're usually written with C in mind (though I remember some old MacOS versions using Pascal). If you don't have a native binding for your language-compiler infrastructure, you have to write some interface code to "glue" your program with "the other side".
There are also some domain-specific languages that are tuned specifically to efficiently express some type of computation. You usually don't write your entire program in them, just some parts where it is the appropriate tool. Prolog is a good example.
Last but not least, you sometimes have heaps of old and tested code written in another language at hand, which you could benefit from using. Instead of reinventing the wheel in a new and better language, you may simply want to interface it to your new program. This is probably the most usual (if not the only) case when languages geared for similar uses are mixed together (was that C++ and Fortran you mentioned?).
Generally, different languages have different strengths and weaknesses. You should try to use the appropriate tool for the job at hand. If the benefits from expressing some parts of the program in a different language are greater than the problems this introduces, then it's best to go for it.
I know in your question you sort of ruled this out, but different languages are used for different domains.
Right now I am working on a data visualizer, the data is in a database so of course there is some SQL, but that hardly counts because it's small and required frequently. The data is turned into a series of graphs, I'm using R, which is like MATLAB but open source. It is a unique statistical language with some advanced plotting features.
A data visualizer isn't just a graph generator, so there needs to be a way to browse and navigate this pile of image files. We opted to use html with embedded javascript to build an offline "application" that can be easily distributed. It's offline in the sense that it is self contained, that html is carefully generated and the js inside it is carefully crafted to allow the user to browse thousands of images sorting or filtering by a number of criteria.
How do you carefully craft javascript and html based on a database structure that changes as the rest of my team makes progress? They are made by a perl program (single pass script really) that reads into the db for some structure and key information, and then outputs over 300 kilobytes of html/js. It's not entirely trivial html either, imagemaps that are carefully aligned with the R plots and some onclick() javascript allow the user to actually interact with a plain image plot so this whole thing feels like a real data browser/visualizer application.
That's four 'languages', five if you count SQL, just to make a single end product.
I dont think doing this in a single language would be a good choice, because we are exploiting the capabilities of a real web browser to give us a free GUI and frontend.
An excellent current example would be to write methods for creating XML documents in VB.NET, which has an "XML Literals" feature, which C# lacks. Since they're both .NET languages, there's no reason not to call one from the other:
Public Function GetEmployeeXml (ByVal salesTerritoryKey As Integer) As XElement
Using context As New AdventureWorksDW2008Entities
Dim x = <x>
<%= From s In context.DimSalesTerritory _
Where s.SalesTerritoryKey = salesTerritoryKey _
Select _
<SalesTerritory
region=<%= s.SalesTerritoryRegion %>
country=<%= s.SalesTerritoryCountry %>>
<%= From e in s.DimEmployee _
Select _
<Employee firstName=<%= e.FirstName %> lastName=<%= e.LastName %>>
<%= From sale in e.FactResellerSales _
Select _
<Sale
orderNumber=<%= sale.SalesOrderNumber %>
price=<%= sale.ExtendedAmount %>/> %>
</Employee> %>
</SalesTerritory> %>
</x>
Return x
End Using
End Function
The biggest reason you would mix langauges is because one language has advantages in certain areas, where another has advantages in another. So, you try to harness the capabilities of both by throwing them together. A common example is using C and ASM, because C is more abstract and makes it easy to do the more complex stuff in a program, however, you would want to use ASM to do base level stuff with hardware and the processor. So, they are often mixed, given the nature of C and its use in embedded systems.
Depending on what paradigm a language falls under, it has its pros and cons. For example, one might want to use the GUI capabilities of C# but use the efficient backtracking or Artificial Intelligence capabilities of Prolog (a language in the logical paradigm) to compute some details before displaying them.
A simplified example
Imagine trying to write a program that allows a user to play Chess against a computer.
(Potentially) I would:
1.) Create the visuals with Windows GUI Libraries.
2.) Calculate the possible moves using Prolog, and choose the best/most viable move.
3.) Retrieve the results from step 2 from Prolog in my C# code, and render the results.
4.) Allow for the gameplay and rapid development of visuals and UI in C# and rely on the Prolog for the calculations/backtracking
This most often happens when you already have code written in two or more different languages and you notice that it makes sense to combine the programs. Rewriting is expensive and takes time.
In the financial world you often have to keep programs alive (or replaced) 50 years. With technology replacement every 10 years, new contracts (mortgages, life insurance) are created in the newest language/environment. The four older ones just handle the monthly payments and changes to existing contracts. To know how the company is doing, you need to integrate data from all five systems.
I suppose keeping the old ones alive is cheaper than migrating each time to the newest technology. From a risk avoiding point of view it makes sense.
Web applications should probably not be mixed language for the developer. Smalltalk does just fine, with Gemstone for persistence and Seaside as web application framework. The multiple languages (javascript) can be hidden in the framework.
When 2 heads are better than one.
I've commonly seen games where flash was embedded with C# - with the AI and other heavy code running off a C++ DLL.
When forced to
Writing new code to augment a new system supported by an old framework
For instance in games (which are pretty hardcore applications) you usually have a very tight c++ engine that does all the heavy lifting and a scripting language (such as Lua) that's accessible and suited for making that collection of special cases that we call 'game' happen.
People here have most of the reason. I'll just have this one:
There is also the case of graceful degradation. For instance I am working on a legacy intranet, and we're changing little by little from a language to another so at this point we have different languages in the same system.
Device Programming - typically you would want to program the UI with the OS's native UI language (Java for Android for example) but would need to program the device drivers with something that gets more into the low level (like C).

Resources