I am looking for a natural language tool that can automatically de-identify English text. For example, every email address should be renamed or obscured. But proper names should be de-identified, as should addresses and what not.
There is a MITRE Identification Scrubber Toolkit. I don't know how well it works.
My questions:
Are there any other tools out there?
Does anyone have experience with the MITRE tool? How well does it work?
Thanks.
De-identification (perhaps more often referred to as anonymization) is a very active research area as its success is obviously a requirement for the use of authentic text corpora in such fields as NLP for healthcare, medicine and the like. I recommend that you look at the tools listed in the answer to this question on CrossValidated. If you follow the links further, you will find research papers describing how these tools work with further references and results evaluations.
I am looking for some resources pertaining to the parsing and understanding of English (or just human language in general). While this is obviously a fairly complicated and wide field of study, I was wondering if anyone had any book or internet recommendations for study of the subject. I am aware of the basics, such as searching for copulas to draw word relationships, but anything you guys recommend I will be sure to thoroughly read.
Thanks.
Check out WordNet.
You probably want a book like "Representation and Inference for Natural Language - A First Course in Computational Semantics"
http://homepages.inf.ed.ac.uk/jbos/comsem/book1.html
Another way is looking at existing tools that already do the job on the basis of research papers: http://nlp.stanford.edu/index.shtml
I've used this tool once, and it's very nice. There's even an online version that lets you parse English and draws dependency trees and so on.
So you can start taking a look at their papers or the code itself.
Anyway take in consideration that in any field, what you get from such generic tools is almost always not what you want. In the sense that the semantics attributed by such tools is not what you would expect. For most cases, given a specific constrained domain it's preferable to roll your own parser, and do your best to avoid any ambiguities beforehand.
The process that you describe is called natural language understanding. There are various algorithms and software tools that have been developed for this purpose.
The title may seem slightly self-contradictory, and I accept that you can't really learn a language quickly. However, an experienced programmer that already has knowledge of a few languagues and different styles (functional, OO, imperative etc.) often wants to get started quickly. I've seen a few websites doing effective "translations" in the form of "just show me syntax equivalence". I can't remember the sites now, but for related languages (e.g. Perl/PHP) it's quite common.
Is there a better resource that covers more languages? Is there a resource that covers idioms as well as syntax? I think this would be incredibly useful for doing small amounts of work on existing code bases where you are not familiar with the language. Looking at the existing code, as we know, is not always a good indicator of quality. Likewise, for "learn by doing" weekend project I always have the urge to write reasonably idiomatic, clean code from the start. Such a resource could also link to known good example projects of varying sizes for those that prefer to learn by reading. Reading a well-written medium sized code base can also be much more practical when access to development environments might be limited.
I think it's possible to find tutorials and summaries for individual languages that provide some of this functionality in disparate web locations but I'm hoping there is a good, centralised, comparative place that the busy programmer can turn to.
You generally have two main things to overcome:
Syntax
Reference
Syntax you can pick up fairly quickly with a language tutorial and a stack of samplecode.
Reference (library/API calls) you need to find a proper guide to; perhaps the language reference, or perhaps google...
With those two in place, following a walkthrough (to get you used to using the development environment) will have you pretty much ready - you'll be able to look up what you want to say (reference), and know how to say it (syntax).
This, of course, applies principally to procedural/oop languages; languages that require a paradigm switch (ML/Haskell) you should go to lectures for ;)
(and for the weirder moments, there's SO!)
In the past my favour was "learning by doing". So e.g. I know a little bit of C++ and a lot of C#.Net but I must write a FTP Tool in Python.
So I sit for an hour and so the syntax differences by a tutorial, than I develop the form itself and look at the generated code. Then I search a open source Python FTP Client and get pieces of code (Not copy and paste, write it self to see, feel and remember the code!)
After a few hours I get it.
So: The mix is the best. A book, a piece of good code, the willing to learn and a free night with much coffee.
At the risk of sounding cheesy, I would start with the language's website tutorial and/or FAQ, followed by asking more specific questions here. SO is my centralized location for programming knowledge.
I remember when I learned Perl. I was asked to modify some Perl code at work and I'd never seen the language before. I had experience with several other languages, however, so it wasn't hard to figure out the syntax with the online Perl docs in one window and the code in another, side-by-side. I don't know that solely reading existing code is necessarily the best way to learn. In my case, I didn't know Perl but I could tell that the person who originally wrote the code didn't know Perl either. I'm not sure I could've distinguished between good Perl and really confusing Perl. It would've been nice to be able to ask questions here at the time.
Language isn't important. What is important is learning your ways around designing algorithms and the proper application of design patterns. Focus on the technique, not the language that implements a certain technique. Once you understand the proper development techniques, any programming language will just become real easy, no matter how obscure they are...
When you put a focus on a language, you're restricting your own knowledge.
http://devcheatsheet.com/ seems to be a step in the right direction: it aggregates cheat sheets/quick references and they are (somewhat) manually reviewed. It's also wide-ranging. It still comes up short a bit in terms of "idiomatic" quick reference: for example, the page on Ruby doesn't mention yield.
Rosetta Code appears to be an excellent resource that includes hints on coding idiomatically and moves from simple (like for-loops) to things like drawing. I haven't checked out how comprehensive it is, but there are a large number of languages and tasks listed. The drawbacks re: original question are:
Some of the linking is not accurate
(navigating Python->ForLoop will
take you to the top of the ForLoop
page, not the Python section). It's a
wiki, this can be improved.
Ideally you could "slice" the wiki
however you chose to see e.g. the top
20 tasks for two languages
side-by-side.
http://hyperpolyglot.org/ seems to be an almost perfect match for what I was looking for. The quality is not always there, or idiom can be lacking, but it has the same intention and is pretty comprehensive.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
Have you ever tried learning a language while on a project? I have, and from my personal experience I can say that it takes courage, effort, time, thinking, lots of caffeine and no sleep. Sometimes this has to be done without choice, other times you choose to do it; if you are working on a personal project for example.
What I normally do in this kind of situation, and I believe everyone does, is "build" on top of my current knowledge of languages, structures, syntax and logic. What I find difficult to cope with, is the difference of integrity in some cases. Some languages offer a good background for future learning and "language study", they pose as a good source of information or a frame of reference and can give a "firm" grasp of what's to come. Other languages form or introduce a new way of thinking and are harder to get used to.
Sometimes you unintentionally think in a specific language and when introduced to a new way of thinking, a new language, can cause confusion or make you get lost between the "borders" of your new and your current knowledge of languages.
What can be a good solution in this case? What should be used to broaden the knowledge of the new language, a new way of thinking, and maintain or incorporate the current knowledge of other languages inside the "borders" of the new language?
I find I need to do a project to properly learn a language, but those can be personal projects. When I learned Python on the job, I first expected (and found) a significant slowdown in my productivity for a while. I read the standard tutorials, coding standards and I lurked on the Python list for a while, which gave me a much better idea of the best practices of the language.
Doing things like coding dojos and stuff when learning a language can help you get a feel for things. I just recently changed jobs and went back to Java, and I spent some time working on toy programs just to get back in the feel for things (I'm also reading Effective Java, 2nd edition as my previous major experience had been with Java 1.4).
I think, in some respects no matter what the impetus for learning the language, you have to start by imitating good patterns in the new language. Whether that means finding a good book, with excellent code examples, good on-line tutorials, or following the lead of a more experienced developer, you have to absorb what it means to write good code in a particular language first. Once you have developed a level of comfort, you can start branching out and and experimenting with alternatives to the patterns that you've learned, looking for ways to apply things you've learned from other languages, but keeping within the "rules" of the language. Eventually, you'll get to the point where you know you can 'break the rules" that you learned earlier because you have enough experience to know when they do/don't apply.
My personal preference, even when forced to learn a new language, is to start with some throw away code. Even starting from good tutorials, you'll undoubtedly write code that later you will look back on and not understand how stupid you could have been. I prefer, if possible, to write as my first foray into a language code that will be thrown away and not come back to haunt me later. The alternative is to spend a lot of time refactoring as you learn more and more. Eventually, you'll end up doing this, too.
I would like to mention ALT.NET here
Self-organizing, ad-hoc community of developers bound by a desire to improve ourselves, challenge assumptions, and help each other pursue excellence in the practice of software development.
So in the spirit of ALT.NET, it is challenging but useful to reach out of your comfort zone to learn new languages. Some things that really helped me are as follows:
Understand the history behind a language or script. Knowing evolution helps a lot.
Pick the right book. Research StackOverflow and Amazon.com to find the right book to help you ease the growing pains.
OOP is fairly common in most of the mature languages, so you can skip many of the chapters related to OOP in many books. Syntax learning will be a gradual process. I commonly bookmark some quick handy guides for that.
Read as many community forums as possible to understand the common pitfalls of the new language.
Attend some local meetups to interact with the community and share your pains.
Take one pitch at a time by building small not so complicated applications and thereby gaining momentum.
Make sure you create a reference frame for what you need to learn. Things like how security, logging, multithreading are handled.
Be Open minded, you can be critical, but if you hate something then do not learn that language.
Finally, I think it is worthwhile to learn one strong languages like C# or Java, one functional language and one scripting language like ruby or python.
These things helped me tremendously and I think will help all software engineers and architects to really gear for any development environment.
I learned PHP after I was hired to be the project lead on the Zend Framework project.
It helped that I had 20 years of professional programming background, and good knowledge of C, Java, Perl, JavaScript, SQL, etc. I've also gravitated towards dynamic scripting languages for most of my career. I've written applications in awk, frameworks in shell, macro packages in troff, I even wrote a forum using only sed.
Things to help learn a language on the job:
Reading code and documentation.
Listening to mailing lists and blogs of the community.
Talking to experts in the language, fortunately several of whom were my immediate teammates.
Writing practice code, and asked for code reviews and coaching.(Zend_Console_Getopt was my first significant PHP contribution).
Learning the tools that go along with the language. PHPUnit, Xdebug, phpDoc, phing, etc.
Of course I did apply what I knew from other programming languages. Many computer science concepts are language-universal. The differences of a given language are often simply idiomatic, a way of stating something that can be done another way in another language. This is especially true for languages like Perl or PHP, which both borrow a lot of idioms from earlier languages.
It also helped that I took courses in Compiler Design in college. Having a good foundation in how languages are constructed makes it easier to pick up new languages. At some level, they're all just ways of abstracting runtime stacks and object references.
If you're a junior member of the team and don't know the language, this is not necessarily an issue at all. As long as there is some code review and supervision, you can be a productive.
Language syntax is one issue, but architectural differences are a more important concern. Many languages are also development platforms, and if you don't have experience with the platform, you don't know how to create a viable solution architecture. So if you're the project lead or working solo, you'd better have some experience on the platform before you do your design work.
For example, I would say an experienced C# coder with no VB experience would probably survive a VB.NET project just fine. In fact, it would be more difficult for a developer who only had experience in C#/ASP.NET to complete a C# WPF project than a VB ASP.NET project. An experienced PHP developer might hesitate a bit on a ColdFusion project, but they probably won't make any serious blunders because they are familiar with a script based web development architecture.
Many concepts, such as object modelling and database query strategies, translate just fine between languages. But there is always a learning curve for a new platform, and sometimes it can be quite nasty. The worst case is that the project must be thrown out because the architure is too wrong to refactor.
I like to learn a new language while working on a project, because a real project will usually force me to learn aspects of the language that I might otherwise skip. One of the first things I like to to is read code in that language, and jump in. I find resources (such as books and various internet sites) to help as I go along.
Then, after I've been working on it for a while, I like to read (or re-read) books or other resources on the language. By this time I have some knowledge, so this will help solidify some things and also point out areas where I am flat-out wrong in my understanding. For instance, I can see that I was making incorrect assumptions about similarities between languages.
This also applies to tools -- after using a tool for a while and learning the basics, reading (or skimming) the documentation can teach me a lot.
In my opinion, you should try to avoid that. I know, most of the times you can't but in any case try not to mix the new language with the old one, and never add to the mixture old habits, practices and patterns.
Always try to find resources that will help you get through the new language in the way the language works, not in the way other languages do; that will never have a happy ending, and if it does it will be very hard to modify it to the right way.
Cheers.
Yes I have.
I mean, is there another way? The only language I ever learned that was not on a project was ABC basic, which was what you used on my first computer.
I would recommend if you start with a certain language, stick with it. I only say that because many times in the past I tried more and more different ones, and the one I started out with was the best :D
Everytime I have/want to learn a new language, I force myself to find something to code.
But to be sure I did it well, I always want to be able to check my code and what it ouputs.
To do so, I just try to do the same kind of stuff with languages I know and to compare the outputs. For that, I created a little project (hosted on Github) with an exercise sheet and the correction for every language I learnt. It's a good way to learn in my opinion because it gives you a real little project.
I have no programming experience but am interested in learning a language.
So reading this section "http://wiki.freaks-unidos.net/weblogs/azul/principles-of-software#extend-your-language-to-match-your-domain" made me curious about programming a single application in 2 or more languages.
How is it actually done?
A few thoughts:
The page you linked to explains pretty clearly how it's done
If you are interested in learning a language, this is probably not the place to start
Programing a single application in two or more languages is only marginally related to the linked document.
Still, in the face of all that, I'll try to give an example of how this works by analogy.
Suppose you need to work with a group of people on some technical task--ranking chess puzzles by difficulty or testing marshmallows for contamination or something. Suppose further that one of the people on your team speaks only Japanese, another only Portuguese, and the third only Esperanto.
Being blessed with the ability to speak all of these languages fluently, your best bet is to make up an artificial language specialized to the task at hand; this is called a Domain Specific Language, or DSL. It should have all the terminology you need to talk about knights and rooks or silicate nanoparticles or whatever for the task, and not much else. Teach this to each of your team members, and then you can give them all their instructions at the same time. They can talk to each other about what they are doing, ask for help (so long as it's related to something covered by your language) as if they all spoke the same language.
That's roughly what he's talking about.
I think you may be trying to run before you can walk. The concepts in there probably require a little programming experience to start with.
The thrust of the article (and frankly poorly expressed) is that when you are programming you often encounter tasks that benefit from a declarative syntax, i.e. you should be able to express the intent of what you want to do and leave the implementation details to a library. A good example is querying a database, it's much more readable (usually) to be able to declaratively describe what you want to do and let some middleware figure out the best way to do it, SQL and Linq are 2 examples of a declarative mechanism for querying data.
This is a very interesting topic, but honestly if you have no programming experience it's probably more of a 201 subject than a 101 subject, get your basics down first.