How does Google index PDF files?

PDFs in Google search results

Q: Can Google index any type of PDF file?

A: Generally we can index textual content (written in any language) from PDF files that use various kinds of character encodings, provided they’re not password protected or encrypted. The general rule of the thumb is that if you can copy and paste the text from a PDF document into a standard text document, we should be able to index that text.

Q: What happens with the images in PDF files?
A: Currently the images are not indexed. In order for us to index your images, you should create HTML pages for them.

Q: How are links treated in PDF documents?
A: Generally links in PDF files are treated similarly to links in HTML: they can pass PageRank and other indexing signals, and we may follow them after we have crawled the PDF file. It’s currently not possible to “nofollow” links within a PDF document.

Q: How can I prevent my PDF files from appearing in search results; or if they already do, how can I remove them?
A: The simplest way to prevent PDF documents from appearing in search results is to add an X-Robots-Tag: noindex in the HTTP header used to serve the file. If they’re already indexed, they’ll drop out over time if you use the X-Robot-Tag with the noindex directive. For faster removals, you can use the URL removal tool in Google Webmaster Tools.

Q: Is it considered duplicate content if I have a copy of my pages in both HTML and PDF?
A: Whenever possible, we recommend serving a single copy of your content. If this isn’t possible, make sure you indicate your preferred version by, for example, including the preferred URL in your Sitemap or by specifying the canonical version in the HTML or in the HTTP headers of the PDF resource. For more tips, read our Help Center article about canonicalization.

Q: How can I influence the title shown in search results for my PDF document?
A: We use two main elements to determine the title shown: the title metadata within the file, and the anchor text of links pointing to the PDF file. To give our algorithms a strong signal about the proper title to use, we recommend updating both.

Pros & Cons of Using PDFs

Pros

  1. East to Create. PDFs can be very helpful for those with smaller teams or limited resources. Great for people without knowledge of HTML (p.s. learning HTML, CSS, and JavaScript is free, it’s a requirement to at least know the basics).
  2. Contains Meta Data. When crafting the title and description for a PDF, follow the same rules as a normal webpage.
  3. Contains Links. Search engines can crawl PDF links and are able to use pick up the anchor text.
  4. Indexable Content. To ensure that the text is readable, it should be created as text, not as an image, making it ideal to create the PDF from the originating program, like Word or Illustrator.

Cons

  1. Lack of Navigation. This means that when a site visitor arrives at the website, they have no simple way to reach other pages on the site.
  2. Length of Document. Because it’s so easy to save a document as a PDF file, it’s not common to break up a PDF into multiple, smaller documents. This isn’t really ideal for SEO in some cases because longer documents contain more text and often multiple topics.
  3. Lack of Page Organization/Control. Certainly one of the greatest benefits of using a content management system for a website is page organization and control. PDFs, however, don’t often work within the organizational structures of CMS as pages but rather as downloads.
  4. Lack of Code Editing Capabilities. Certainly one of the benefits of HTML pages is the flexibility that HTML authors have to edit the website code.
  5. Can’t Implement Structured Markup. Structured markup and the rich snippets they can generate have been shown through various studies to improve SERP visibility and click-through rate in organic search. But because PDFs are can not include HTML, they do not receive this benefit.
  6. Lack of Tracking Mechanisms. I find the greatest disadvantage of using PDFs to be the lack of tracking mechanisms I can apply to PDF documents. Google Analytics can perform tracking through onclick event tracking for PDF downloads, but other tracking within the PDF is not as simple. For more on tracking check out this post and this post by LunaMetrics.

How to Optimize PDF Documents for Search

General rule for all of these is: write it as if it’s a webpage.

Checklist for PDF Optimization:

  • Search-friendly filenames .Treat the filename as you would treat the filename of any other webpage. Use words that are useful to users and consider search volume.
  • Keyword-optimized titles
  • Informative, concise descriptions
  • Company name in “Author” field
  • Use several relevant keywords in “Keyword” field. This is equivalent to meta keywords, it may not be a ranking factor in the major search engines but it’s still good form
  • Make sure to fill out all available fields – there is an option to view “Additional Metadata” (in Adobe Acrobat)
  • Add tags to and accessibility options to your document.  In Adobe Acrobat, you can go up to the Advanced menu and find a sub-menu for Accessibility. The options within this sub-menu will let you add functionality for your document to be read by screen readers and magnifiers. You should also be able to add Tags in order to better categorize your document.
  • Don’t forget about Alt tags for images
  • Add links back to relevant pages on the main website
  • Write-protect the document. You want to make it difficult for others to edit your documents, both for protection of your intellectual property and for preservation of the links back to your site.
  • Offer HTML version of the document. SPM// Use HTTP header in the PDF to canonical back to the HTML page.

Resources