# Regular Expressions ''' Basics: sub - replaces a given pattern with something within a string, returns the result match - returns a match object if a pattern is found in a string... useful for testing for matches ''' import re # from Gulliver's Travels by Jonathan Swift (Part III, Chapter II) text = ''' After he had left me, I placed all my words, with their interpretations, in alphabetical order. And thus, in a few days, by the help of a very faithful memory, I got some insight into their language. The word, which I interpret the flying or floating island, is in the original Laputa, whereof I could never learn the true etymology. Lap, in the old obsolete language, signifies high; and untuh, a governor; from which they say, by corruption, was derived Laputa, from Lapuntuh. But I do not approve of this derivation, which seems to be a little strained. I ventured to offer to the learned among them a conjecture of my own, that Laputa was quasi lap outed; lap, signifying properly, the dancing of the sunbeams in the sea, and outed, a wing; which, however, I shall not obtrude, but submit to the judicious reader. ''' # we'll transform the text into a bare-bones string with words only and no punctuation # we don't care about punctuation text = re.sub("[^\w]", " ", text) # nor do we care about case text = text.lower() # we might now have lots of whitespace--we only want spaces text = re.sub("\s+", " ", text) # get rid of any whitespace at the beginning and end of the string text = text.lstrip() text = text.rstrip() print text