10 March 2012

Stanford NLP Class

These are my notes on Stanford's online NLP class.

The first few lectures said that a lot of the hard work in NLP, notably in tokenizers, is done with regular expressions. This was not entirely surprising as a good fraction of the string processing I have done in my professional career has been done with regular expressions.

Programming exercises can be done in Python or Java. I chose Python as I have found it well suited to simple string manipulation programs in the past.

The first programming exercise is to extract phone numbers and email addresses from web pages. A training set of Stanford computer science faculty home pages was supplied along with some starter code to show the required formatting. The starter code helpfully computed lists of true positives, false positives and false negatives.

My experience with problems like these is to

  1. get the test samples to pass, by 
    1. loosening matches and adding more detection to detect all the addresses and phone numbers
    2. tightening matches to avoid false positives
  2. while taking care to make decisions that are likely to generalize well to as yet unseen samples
2 takes some judgement as it is not clear what will generalize well. e.g In the samples " DOT " was used to mask "." in email addresses. It seems wises to match on all cases of  " DOT " but then I found that " DOM " was used for the "." alias by one faculty member . The question was then whether to generalize from " DOT " and " DOM " to " DO<any character> " or treat " DOM " as a one-off as it had been observed only once.

2 comments:

Unknown said...

Neuro Linguistic Programming certainly is the study of how we think, communicate and behave. NLP can be used by customers from all walks of lifetime for each Particular and Specialized expansion and advancement. Within the last ten many years they have got been motivating, educating and entertaining folks from all over the planet with their easy to abide by pragmatic approach, their highly heralded linguistic elegance along with their acute perception of humour. neuro linguistic programming

albina N muro said...

We are offering this course on Natural Language Processing free and online to students worldwide, continuing Stanford's exciting forays into large scale online. neuro linguistic programming