Well, it was my 22nd birthday 11 days back, and while the real-world was quite uneventful, I managed to create a small stir in the virtual-world.
For this birthday, I decided to do something cool and what is cooler (and a greater sign of laziness) than an AI program that replies to all the birthday wishes on my Facebook wall? This was definitely cool and quite possible given a basic understanding of HTTP and some Artificial Intelligence. After experimenting for 2 days with the Facebook Graph API and FQL, I had all the know-how to create my little bot.
Note: This is from a guy who has never taken a single course on Natural Language Processing and who has next to zero exposure programming NLP programs. Basically, I am a complete NLP noob and this hack is something I am really proud of.
But one major problem still remained: How to create a NLP classifier that would classify wall-posts as birthday wishes? I tried looking for a suitable dataset so I could build either a Support-Vector Machine or Naive Bayes Classifier, but all my search attempts were futile. Even looking for related papers and publications were in vain. That’s when I decided to come up with a little hack of my own. I had read Peter Norvig’s amazing essay on How to Build a Toy Spell Checker and seen how he had used his intuition to create a classifier when he lacked the necessary training dataset. I decided to follow my intuition as well and since my code was in Python (a language well suited for NLP tasks), I started off promptly. Here is the code I came up with:
The first thing I do is create a list of keywords one would normally find in a birthday wish, things like “happy”, “birthday” and “returns”. My main intuition was that when wishing someone, people will use atleast 2 words in the simplest wish, e.g. “Happy Birthday”, so any messages just containing the word “Happy” will be safely ignored, and thus I simply have to check the message to see if atleast 2 such keywords exist in the message.
What I do first is remove all the punctuations from the message and get all the characters to lower-case to avoid string mismatching due to case sensitivity. Then I split the message into a list of words, the delimiter being the default whitespace. This is done by :
</p> <p>s = ''.join(c for c in message if c not in string.punctuation and c in string.printable)<br /> t = s.lower().split()</p> <p>
However, I later realized that there exist even lazier people than me who simply use wishes like “HBD”. This completely throws off my Atleast-2-Words theory, so I add a simple hack to check for these abbreviations and put in the expanded form into the message. Thus, I created a dictionary to hold these expansions and I simply check if the abbreviations are present. If they are, I add the expanded form of the abbreviation to a new list that contains all the other non-abbreviated message words added in verbatim [lines 15-20]. Since I never check for locations of keywords, where I add the expanded forms are irrelevant.
Then the next part is simple, bordering on trivial. I iterate through the list of words in my message and check if it is one of the keywords and simply maintain a counter telling me how many of the keywords are present. Python made this much, much easier than C++ or Java.
But alas, another problem: Some people have another bad habit of using extra characters, e.g. “birthdayyyy” instead of “birthday” and this again was throwing my classifier off. Yet another quick fix: I go through all the keywords and check if the current word I am examining has the keyword as a substring. This is done easily in Python strings using the count method [lines 31-34].
Finally, I simply apply my Atleast-2-Words theory. I check if my counter has a value of 2 or more and return True if yes, else False, thus completing a 2 class classifier in a mere 40 lines of code. In a true sense, this is a hack and I didn’t expect it to perform very well, but when put to work, it really managed to do a splendid job and managed to flummox a lot of my friends who tried posting messages that they thought could fool the classifier. Safe to say, I had the last laugh.
Hope you enjoyed reading this and now have enough intuition to create simple classifiers on your own. If you find any bugs or can provide me with improvements, please mention them in the comments.