2 Month Notice

Well, technically it’s been more than 2 months since I joined my new workplace, but I guess it is high time I gave an update as to what I am up to.

As expected from working at one of the top tech companies in the world, there is a lot of work (and fun), but there is a great potential for learning as well. And man have I learned a lot!! In my first month here, I worked on a Windows 8 Modern UI app and understood the architecture of building a Modern app hands-on. Not only that, but I also had to integrate the app with a web service using Javascript (which, by the way, is my weakest programming language), and after a lot of fumbling in the dark, I can now bend JS to my absolute will (Evil laughter)!!

In my free time, I got together with a senior of mine, Prakhar Gupta, who works at the same company albeit in the Bangalore office, and quickly coded up a Windows Phone 8 app. The app basically acts as a birthday reminder for all those like me who are poor at remembering dates. Expect to see the app in the Windows Marketplace soon! Along the way, I have also been drawn to Cloud Computing, thanks to the amazing Windows Azure (they have tutorials on creating Android apps with an Azure back-end), and hope to soon gain certifications in Cloud Computing. This along with some other projects that I really can’t talk about (Non Disclosure Agreement, you see) have made my life coding bliss!!

Oh, and did I mention that I have also started development on the Leap Motion? Expect to see more on that and Kinect development in my next few posts. This is from a practical standpoint. From a knowledge standpoint, I am learning everyday. I have learned about good design and best practices while coding in C# and am also re-exploring functional programming with F#. SQL and database querying now seem to come more naturally than ever, and I have also started looking into query execution plans to further optimize my SQL code. I have also been trying to read up on the Common Language Runtime (CLR) which so far looks great with the way the CLR handles managed modules and the variety of support provided for different languages, but with all the work and coding going on, I am having a hard time actually removing time for myself to read more. Will have to stretch more on the reading front!

In the pipeline are some more apps (maybe on Android?) and reading papers and texts on NLP (for WishWasher) and Computer Vision (which is still my favoured field). I do seem to be loaded with work, but hopefully, I will keep inventing things and inspiring you to try new things. Keep an eye out for more on this domain.


Birthday Wish NLP Hack

Well, it was my 22nd birthday 11 days back, and while the real-world was quite uneventful, I managed to create a small stir in the virtual-world.

For this birthday, I decided to do something cool and what is cooler (and a greater sign of laziness) than an AI program that replies to all the birthday wishes on my Facebook wall? This was definitely cool and quite possible given a basic understanding of HTTP and some Artificial Intelligence. After experimenting for 2 days with the Facebook Graph API and FQL, I had all the know-how to create my little bot.

Note: This is from a guy who has never taken a single course on Natural Language Processing and who has next to zero exposure programming NLP programs. Basically, I am a complete NLP noob and this hack is something I am really proud of.

But one major problem still remained: How to create a NLP classifier that would classify wall-posts as birthday wishes? I tried looking for a suitable dataset so I could build either a Support-Vector Machine or Naive Bayes Classifier, but all my search attempts were futile. Even looking for related papers and publications were in vain. That’s when I decided to come up with a little hack of my own. I had read Peter Norvig’s amazing essay on How to Build a Toy Spell Checker and seen how he had used his intuition to create a classifier when he lacked the necessary training dataset. I decided to follow my intuition as well and since my code was in Python (a language well suited for NLP tasks), I started off promptly. Here is the code I came up with:

The first thing I do is create a list of keywords one would normally find in a birthday wish, things like “happy”, “birthday” and “returns”. My main intuition was that when wishing someone, people will use atleast 2 words in the simplest wish, e.g. “Happy Birthday”, so any messages just containing the word “Happy” will be safely ignored, and thus I simply have to check the message to see if atleast 2 such keywords exist in the message.

What I do first is remove all the punctuations from the message and get all the characters to lower-case to avoid string mismatching due to case sensitivity. Then I split the message into a list of words, the delimiter being the default whitespace. This is done by :

<p>s = ''.join(c for c in message if c not in string.punctuation and c in string.printable)<br />
t = s.lower().split()</p>

However, I later realized that there exist even lazier people than me who simply use wishes like “HBD”. This completely throws off my Atleast-2-Words theory, so I add a simple hack to check for these abbreviations and put in the expanded form into the message. Thus, I created a dictionary to hold these expansions and I simply check if the abbreviations are present. If they are, I add the expanded form of the abbreviation to a new list that contains all the other non-abbreviated message words added in verbatim [lines 15-20]. Since I never check for locations of keywords, where I add the expanded forms are irrelevant.

Then the next part is simple, bordering on trivial. I iterate through the list of words in my message and check if it is one of the keywords and simply maintain a counter telling me how many of the keywords are present. Python made this much, much easier than C++ or Java.
But alas, another problem: Some people have another bad habit of using extra characters, e.g. “birthdayyyy” instead of “birthday” and this again was throwing my classifier off. Yet another quick fix: I go through all the keywords and check if the current word I am examining has the keyword as a substring. This is done easily in Python strings using the count method [lines 31-34].

Finally, I simply apply my Atleast-2-Words theory. I check if my counter has a value of 2 or more and return True if yes, else False, thus completing a 2 class classifier in a mere 40 lines of code. In a true sense, this is a hack and I didn’t expect it to perform very well, but when put to work, it really managed to do a splendid job and managed to flummox a lot of my friends who tried posting messages that they thought could fool the classifier. Safe to say, I had the last laugh.

Hope you enjoyed reading this and now have enough intuition to create simple classifiers on your own. If you find any bugs or can provide me with improvements, please mention them in the comments.


Artificial Intelligence Vs Computer Security

I encountered a very interesting thing about Google a few days back and I’ve been itching to blog about it. Since this post is about something that touches almost everybody with an internet connection (namely, Web Search), I’ll try keeping it as low tech as possible. Also, the issue here is not exactly Security as we know it, but can be thought of as a branch and a way for the author to bring up this point.

First, a short description of what happened: I was using Google image search and was searching for images of penthouses and Google would not return me any images. Instead what I saw, a message stating that the word “penthouse” had been filtered out of the search query and since the query was that word only, the resulting query was a blank. When I deactivated search filtering, that was when I realized what was making Google filter out the word. Apparently, the word “Penthouse” is the name of a famous adult magazine. You have got to be kidding me.

Google Fail

Google Fail

Now this brings me to the main point. For those of you who are unaware of the way Google Search works, I’ll put it down in one paragraph. There are 2 main components to their search: Natural Language Processing and Web Page Indexing. Natural Language Processing (or NLP) is a sub-field of Artificial Intelligence which tries to give machines the ability to comprehend human-like speech and language. We think understanding human language is easy, since we are humans, but this is a very hard problem for machines due to the subtleties of human language like ambiguity and sarcasm. Google sends its query to one of its many servers which runs its NLP programs and tries to make heads and tails of our query as well as pick out the important words on which to return results on. So a query like, “What is WordPress?” will have the keywords “what” indicating the type of sentence/query and “wordpress” which gives the servers a hint at what to narrow their search on. There are many sub-algorithms used to achieve this, like Stemming, Word normalization, etc., and since Google uses Statistical NLP, there is also a lot of Machine Learning involved. The second component, Web Page Indexing, is simply the way Google’s servers store various webpages, so as to allow efficient retrieval of matching, relevant webpages. This where the famous PageRank algorithm comes to play, the algorithm that started Google from the minds of 2 Stanford students. Now, I may not be an expert in these fields, but I did attend a lecture on this technology by two experts from Microsoft Research at Microsoft, so you can take my word for now if you feel you are a bit lost.

So we have AI doing most of the heavy lifting. Where does Security come into play? Well, the filtering of my search results can be construed as the security part. An important part of security is Access Control, maintaining what people can/cannot do on a system. Google filtered my results in order to protect me from data that I might have found offensive, but what it did was also remove legitimate search results.

Now the bigger question is, did I fail to get results because the AI failed to understand my intent, or because the security measures put into place were too stringent to allow the results to come through? From the AI perspective, I would say that the system failed because of something called High Bias and a lack of smoothing in Machine Learning. However, from a Security perspective, I can argue that Google should know better than to block all results of the word “penthouse” and be so strong about it. So even though they have all these PhDs and other super-smarties working for them, this sure was a dumb mistake. This leads to the point of debate whether Computer Security is affecting Artificial Intelligence and vice versa, and in what ways. This is an interesting question for me, as I am personally involved in AI research but, thanks to some friends at Amrita University, have developed a keen interest in security. I am sure, with some probing, we can find some other such examples of conflict. This little problem of Google’s will (hopefully) be resolved in the next few weeks, but the point of debate may not be resolved so easily.