How to prevent your files from being indexed while your application uses S3 for storage and CloudFront as distribution ?

#webdev #aws #cloudfront #python

When would you face such a scenario ?

While hosting your web application in an S3 and CloudFront distribution in AWS, sometimes we need to prevent some files(direct download links) from being indexed in the search engine results. Files like PDFs, Docs, Mp3s, Videos, Spreadsheets, PPts, etc., get indexed with the direct download link in the SERP. This isn't desirable as users directly get the file from the SERP without actually visiting your application/site which decreases site visits. To stop these files from ever being indexed in a SERP or to deindex the files which have already been indexed, we need to add an HTTP response header to these files that is X-Robots-Tag: noindex.

Common mistakes-

While there are tons of articles & guides available which claim one should block the search engine bots from accessing those files or folders and they simply won't be indexed.

Well, that's not true at all. The blocked bots will not crawl those folders but if they find those file links from other sources [ e.g., pages with internal links to those files ], they will still crawl and will index those files. We have faced this issue with some of our clients before and thus I decided to make a comprehensive guide on this topic.

Complications-

There is no direct method to add HTTP response headers to the files present inside the S3 bucket. There is an option to add custom user-defined meta headers to the files inside the S3 bucket with the prefix x-amz-meta-header-. CloudFront will serve them with those user-defined headers. So if we check those files for response headers with any HTTP headers testing method available [e.g., checking with https://securityheaders.com/ ], we can see these custom user-defined headers present on those files. Suppose we want to add the header X-Robots-Tag: noindex to those files in the S3 bucket. We have to add it like x-amz-meta-header-X-Robots-Tag: noindex. The crawlers won't recognise it.

Proposed solution-

We have to use the Lambda@Edge function to edit those origin response headers while accessing those files via CloudFront URL. [ custom domain you've connected to your CloudFront distribution ]. From the response HTTP header, we have to remove x-amz-meta-header-from the 'keyname' of the user-defined header, So the crawlers will find X-Robots-Tag: noindex as HTTP header while accessing those files and follow it's protocols.

Execution mechanism-

CloudFront dispatches four events [Viewer Request, Viewer Response, Origin Request, Origin Response].

These four events can be worked with AWS Lambda@Edge, Part of an AWS-lambda to execute methods to customize the contents CloudFront delivers. Lamba@Edge scales automatically and runs in the CloudFront location closer to the viewer.

In our case, we have to use the CloudFront event [Origin-Response] with lambda@edge to modify the header.

(Origin Response - Dispatched right after CloudFront gets a response from the Origin and before the object is cached in the response.)

Steps at a glance-

Create a lambda function with any language environment you're acquainted with [ here in the example; we've used Python 3.7]
Select the default service role, select the CloudFront distribution
Add the custom user-defined header x-amz-meta-X-Robots-Tag: noindex to those files in the S3 bucket.
Write the necessary code to remove x-amz-meta- from x-amz-meta-X-Robots-Tag
In your CloudFront distribution create an invalidation.
Then check your files with any method to check HTTP headers, and you'll find the X-Robots-Tag: noindex header in those files.
So when the crawlers get this response header, they will know that these files shouldn't be indexed. Eventually, indexed files will be removed from the SERP.

This is one of the few definitive ways to prevent your files from being indexed.