Search engines indexing more than you wanted?
Category: Website design
Last week, Wired ran a story on a publication released by the NSA about internet research techniques. The document was released after a freedom of information request and covers a range of topics including how you can use Google to be a spy. The pdf document described how you can use advanced search operators in mainstream search engines to find things which may not have been intended for public viewing.
While the information is unlikely to surprise anyone who knows how to use the more advanced features search engines offer, for many it may come as a shock to see how easily sensitive information can end up appearing in Googles search results.
Modern search engines like Google are extremely powerful. Fortunately you do not need to understand how search engines work to use them but unfortunately this also means few understand the potential risks search engines can pose. In this blog post we are going to look at those risks highlighted in the NSA publication and help you understand how you can protect your sensitive information.
The first rule to know is that if google can index something it will! There are ways you can tell search engines that you do not want something indexed but they all rely on the search engines complying with your request voluntarily. While I would always expect major search engines like Google and Bing to follow your instructions, for the purposes of security you need to assume that they would ignore you completely. You also need to remember that there are plenty of other search engines out there who may not be as nice. If it is hosted online and publicly available then it could appear in the results.
One of the most common mistakes we see is assuming that because a web page is behind a user login all the files that page then links to are also secure. The web page is just the avenue to find those files, securing that access route does not mean you have secured all access routes.
Search engines use a range of methods to find content to add to its index. One which many people are not aware of is through email. Running an email service like gmail provides Google with a number of benefits including the ability to use the links in emails which pass through its system to discover new content. If you send someone a link to a pdf hosted on your website which is not for public release and either you or the person receiving the email is using gmail, you could find that pdf document appearing in Google's search index for all to see. What's more, just because an email address does not end in gmail.com does not nesseserily mean it is not a gmail account.
Another big mistake people make is assuming that Google cannot read most file types. While Google cannot read every file type, it can read more than most people realise. Just because it isn't a web page does not mean that Google cannot open it up, have a read, understand what it is and help people find it via its search results. Even those files which Google cannot read can still appear in the index.
What can you do about it
If you think you may have fallen victim to an overly efficient Google indexing things you didn't want it to, the first thing is to go to the horses mouth. We can do this by using the very same advanced search techniques to check which of our files Google knows about. The three main ones you will want to be aware of are the site operator (e.g. site:mywebsite.com), filetype operator (e.g. filetype:pdf) and how to use quote marks.
The site operator allows you to narrow down the search results to only those from a specific site. In this instance we can use it to see what Google can find on your website. To use it all you need to do is type site: followed by your domain name (without the www).
Depending on how large your website is the site operator may well be all you need but for larger site you may need to narrow the search down a bit. We can do this by adding the filetype operator. for this we want to type filetype: followed by the file extension we are looking for. Together you should end up with something like this:
If you suspect there may be some of your excel spreadsheets in Googles index you can use the site operator to search your site and the filetype operator to narrow the search down to just files ending in xls.
The last trick to learn is using quote marks. When you search for something in Google normally, quite often you may notice that the results do not mirror the phrase you searched for. This is because Google is trying to understand what you are looking for so will include similar phrases or alternative phrases for what you are looking for. By adding quote marks to the beginning and end of our search query we can tell Google we only want to see results which contain that exact phrase.
Using quote marks can be a great way to check if something specific has been indexed. All you need is to find a unique section of text which only exists in that document.
Once you are happy nothing has been indexed which shouldn't be, the aim of the game is to check that neither search engines or users can publicly access information which is supposed to stay private. Ensuring that your website and server are setup correctly is a must. If you have to host sensitive information online then take some time to try and access that information by alternative directions such as via a directory listing or typing the file path directly into a browser.
Ideally we always recommend avoiding hosting anything sensitive online if you can help it. When you have no choice only host it online for as long as required. Removing sensitive information once the need to host it has passed can greatly reduce the risks of unauthorised access.