I encountered a very interesting thing about Google a few days back and I’ve been itching to blog about it. Since this post is about something that touches almost everybody with an internet connection (namely, Web Search), I’ll try keeping it as low tech as possible. Also, the issue here is not exactly Security as we know it, but can be thought of as a branch and a way for the author to bring up this point.
First, a short description of what happened: I was using Google image search and was searching for images of penthouses and Google would not return me any images. Instead what I saw, a message stating that the word “penthouse” had been filtered out of the search query and since the query was that word only, the resulting query was a blank. When I deactivated search filtering, that was when I realized what was making Google filter out the word. Apparently, the word “Penthouse” is the name of a famous adult magazine. You have got to be kidding me.
Now this brings me to the main point. For those of you who are unaware of the way Google Search works, I’ll put it down in one paragraph. There are 2 main components to their search: Natural Language Processing and Web Page Indexing. Natural Language Processing (or NLP) is a sub-field of Artificial Intelligence which tries to give machines the ability to comprehend human-like speech and language. We think understanding human language is easy, since we are humans, but this is a very hard problem for machines due to the subtleties of human language like ambiguity and sarcasm. Google sends its query to one of its many servers which runs its NLP programs and tries to make heads and tails of our query as well as pick out the important words on which to return results on. So a query like, “What is WordPress?” will have the keywords “what” indicating the type of sentence/query and “wordpress” which gives the servers a hint at what to narrow their search on. There are many sub-algorithms used to achieve this, like Stemming, Word normalization, etc., and since Google uses Statistical NLP, there is also a lot of Machine Learning involved. The second component, Web Page Indexing, is simply the way Google’s servers store various webpages, so as to allow efficient retrieval of matching, relevant webpages. This where the famous PageRank algorithm comes to play, the algorithm that started Google from the minds of 2 Stanford students. Now, I may not be an expert in these fields, but I did attend a lecture on this technology by two experts from Microsoft Research at Microsoft, so you can take my word for now if you feel you are a bit lost.
So we have AI doing most of the heavy lifting. Where does Security come into play? Well, the filtering of my search results can be construed as the security part. An important part of security is Access Control, maintaining what people can/cannot do on a system. Google filtered my results in order to protect me from data that I might have found offensive, but what it did was also remove legitimate search results.
Now the bigger question is, did I fail to get results because the AI failed to understand my intent, or because the security measures put into place were too stringent to allow the results to come through? From the AI perspective, I would say that the system failed because of something called High Bias and a lack of smoothing in Machine Learning. However, from a Security perspective, I can argue that Google should know better than to block all results of the word “penthouse” and be so strong about it. So even though they have all these PhDs and other super-smarties working for them, this sure was a dumb mistake. This leads to the point of debate whether Computer Security is affecting Artificial Intelligence and vice versa, and in what ways. This is an interesting question for me, as I am personally involved in AI research but, thanks to some friends at Amrita University, have developed a keen interest in security. I am sure, with some probing, we can find some other such examples of conflict. This little problem of Google’s will (hopefully) be resolved in the next few weeks, but the point of debate may not be resolved so easily.