Regex of All gTLD & ccTLDs

The following Regex will match every gTLD and ccTLD.

(\.ac/|\.ad/|\.ae/|\.af/|\.ag/|\.ai/|\.al/|\.am/|\.an/|\.ao/|\.aq/|\.ar/|\.as/|\.at/|\.au/|\.aw/|\.ax/|\.az/|\.ba/|\.bb/|\.bd/|\.be/|\.bf/|\.bg/|\.bh/|\.bi/|\.bj/|\.bl/|\.bm/|\.bn/|\.bo/|\.br/|\.bs/|\.bt/|\.bv/|\.bw/|\.by/|\.bz/|\.ca/|\.cc/|\.cd/|\.cf/|\.cg/|\.ch/|\.ci/|\.ck/|\.cl/|\.cm/|\.cn/|\.co/|\.cr/|\.cu/|\.cv/|\.cx/|\.cy/|\.cz/|\.de/|\.dj/|\.dk/|\.dm/|\.do/|\.dz/|\.ec/|\.ee/|\.eg/|\.eh/|\.er/|\.es/|\.et/|\.eu/|\.fi/|\.fj/|\.fk/|\.fm/|\.fo/|\.fr/|\.ga/|\.gb/|\.gd/|\.ge/|\.gf/|\.gg/|\.gh/|\.gi/|\.gl/|\.gm/|\.gn/|\.gp/|\.gq/|\.gr/|\.gs/|\.gt/|\.gu/|\.gw/|\.gy/|\.hk/|\.hm/|\.hn/|\.hr/|\.ht/|\.hu/|\.id/|\.ie/|\.il/|\.im/|\.in/|\.io/|\.iq/|\.ir/|\.is/|\.it/|\.je/|\.jm/|\.jo/|\.jp/|\.ke/|\.kg/|\.kh/|\.ki/|\.km/|\.kn/|\.kp/|\.kr/|\.kw/|\.ky/|\.kz/|\.la/|\.lb/|\.lc/|\.li/|\.lk/|\.lr/|\.ls/|\.lt/|\.lu/|\.lv/|\.ly/|\.ma/|\.mc/|\.md/|\.me/|\.mg/|\.mh/|\.mk/|\.ml/|\.mm/|\.mn/|\.mo/|\.mp/|\.mq/|\.mr/|\.ms/|\.mt/|\.mu/|\.mv/|\.mw/|\.mx/|\.my/|\.mz/|\.na/|\.nc/|\.ne/|\.nf/|\.ng/|\.ni/|\.nl/|\.no/|\.np/|\.nr/|\.nu/|\.nz/|\.om/|\.pa/|\.pe/|\.pf/|\.pg/|\.ph/|\.pk/|\.pl/|\.pm/|\.pn/|\.pr/|\.ps/|\.pt/|\.pw/|\.py/|\.qa/|\.re/|\.ro/|\.rs/|\.ru/|\.rw/|\.sa/|\.sb/|\.sc/|\.sd/|\.se/|\.sg/|\.sh/|\.si/|\.sj/|\.sk/|\.sl/|\.sm/|\.sn/|\.so/|\.sr/|\.st/|\.su/|\.sv/|\.sy/|\.sz/|\.tc/|\.td/|\.tf/|\.tg/|\.th/|\.tj/|\.tk/|\.tl/|\.tm/|\.tn/|\.to/|\.tp/|\.tr/|\.tt/|\.tv/|\.tw/|\.tz/|\.ua/|\.ug/|\.uk/|\.um/|\.us/|\.uy/|\.uz/|\.va/|\.vc/|\.ve/|\.vg/|\.vi/|\.vn/|\.vu/|\.wf/|\.ws/|\.ye/|\.yt/|\.yu/|\.za/|\.zm/|\.zw/|\.aero/|\.asia/|\.biz/|\.cat/|\.com/|\.coop/|\.edu/|\.gov/|\.info/|\.int/|\.jobs/|\.mil/|\.mobi/|\.museum/|\.name/|\.net/|\.org/|\.pro/|\.tel/|\.travel/)

The following regex will allow you to clean inbound links so that you can sum the number of inbound links coming from each domain. It grabs the everything up to the TLD and adds a tab after the TLD. This separates the root domain from the URI.

Find:

(.*)(\.ac/|\.ad/|\.ae/|\.af/|\.ag/|\.ai/|\.al/|\.am/|\.an/|\.ao/|\.aq/|\.ar/|\.as/|\.at/|\.au/|\.aw/|\.ax/|\.az/|\.ba/|\.bb/|\.bd/|\.be/|\.bf/|\.bg/|\.bh/|\.bi/|\.bj/|\.bl/|\.bm/|\.bn/|\.bo/|\.br/|\.bs/|\.bt/|\.bv/|\.bw/|\.by/|\.bz/|\.ca/|\.cc/|\.cd/|\.cf/|\.cg/|\.ch/|\.ci/|\.ck/|\.cl/|\.cm/|\.cn/|\.co/|\.cr/|\.cu/|\.cv/|\.cx/|\.cy/|\.cz/|\.de/|\.dj/|\.dk/|\.dm/|\.do/|\.dz/|\.ec/|\.ee/|\.eg/|\.eh/|\.er/|\.es/|\.et/|\.eu/|\.fi/|\.fj/|\.fk/|\.fm/|\.fo/|\.fr/|\.ga/|\.gb/|\.gd/|\.ge/|\.gf/|\.gg/|\.gh/|\.gi/|\.gl/|\.gm/|\.gn/|\.gp/|\.gq/|\.gr/|\.gs/|\.gt/|\.gu/|\.gw/|\.gy/|\.hk/|\.hm/|\.hn/|\.hr/|\.ht/|\.hu/|\.id/|\.ie/|\.il/|\.im/|\.in/|\.io/|\.iq/|\.ir/|\.is/|\.it/|\.je/|\.jm/|\.jo/|\.jp/|\.ke/|\.kg/|\.kh/|\.ki/|\.km/|\.kn/|\.kp/|\.kr/|\.kw/|\.ky/|\.kz/|\.la/|\.lb/|\.lc/|\.li/|\.lk/|\.lr/|\.ls/|\.lt/|\.lu/|\.lv/|\.ly/|\.ma/|\.mc/|\.md/|\.me/|\.mg/|\.mh/|\.mk/|\.ml/|\.mm/|\.mn/|\.mo/|\.mp/|\.mq/|\.mr/|\.ms/|\.mt/|\.mu/|\.mv/|\.mw/|\.mx/|\.my/|\.mz/|\.na/|\.nc/|\.ne/|\.nf/|\.ng/|\.ni/|\.nl/|\.no/|\.np/|\.nr/|\.nu/|\.nz/|\.om/|\.pa/|\.pe/|\.pf/|\.pg/|\.ph/|\.pk/|\.pl/|\.pm/|\.pn/|\.pr/|\.ps/|\.pt/|\.pw/|\.py/|\.qa/|\.re/|\.ro/|\.rs/|\.ru/|\.rw/|\.sa/|\.sb/|\.sc/|\.sd/|\.se/|\.sg/|\.sh/|\.si/|\.sj/|\.sk/|\.sl/|\.sm/|\.sn/|\.so/|\.sr/|\.st/|\.su/|\.sv/|\.sy/|\.sz/|\.tc/|\.td/|\.tf/|\.tg/|\.th/|\.tj/|\.tk/|\.tl/|\.tm/|\.tn/|\.to/|\.tp/|\.tr/|\.tt/|\.tv/|\.tw/|\.tz/|\.ua/|\.ug/|\.uk/|\.um/|\.us/|\.uy/|\.uz/|\.va/|\.vc/|\.ve/|\.vg/|\.vi/|\.vn/|\.vu/|\.wf/|\.ws/|\.ye/|\.yt/|\.yu/|\.za/|\.zm/|\.zw/|\.aero/|\.asia/|\.biz/|\.cat/|\.com/|\.coop/|\.edu/|\.gov/|\.info/|\.int/|\.jobs/|\.mil/|\.mobi/|\.museum/|\.name/|\.net/|\.org/|\.pro/|\.tel/|\.travel/)(.*)

Replace with:

$1$2\t$3

International Content

Multi-regional and Multilingual Sites

A multilingual website is any website that offers content in more than one language. A multi-regional website is one that explicitly targets users in different countries. Some sites are both multi-regional and multilingual (for example, a site might have different versions for the USA and for Canada, and both French and English versions of the Canadian content).

Managing Multilingual Versions of Your Site

Make sure the page language is obvious

Google only uses the visible content on the page to determine its language. They do not use code-level language information such as lang attributes. Avoid using side by side translations (i.e. English nav and Spanish content or vice versa).

Google recommends using robots.txt to block search engines from crawling content that is programmatically translated. Google states: automated translations don’t always make sense and could be viewed as spam.

Make sure each language version is easily discoverable

  • Keep content for each language on separate URLs.
  • Do NOT use cookies to show translated versions of the page.
  • Consider cross-linking each language version of a page. Cross linking can be done with a language selection feature. The feature should take users to another translation of the exact same page. This way user who accidentally end up on the wrong language can easily switch languages.
  • Avoid automatic redirection based on the user’s perceived language. This could prevent users and search engines from viewing all available content.

Choosing a URL Structure

There are three recommended ways to launch language specific content; purchase a ccTLD, launch a subdomain, or host content in a subdirectory.

Options for Hosting SEO Considerations
ccTLD Domain Authority
Subdomain Geotargeting
Subdirectory Maintenance Cost

From the perspective of the organic search marketer, choosing how to host international content comes down to domain authority, geo targeting, and maintenance costs.

Domain Authority is a score (on a 100-point scale) developed by Moz, an SEO company, that predicts how well a website will rank in search engines. Moz uses 40 ranking signals to determine how each domain scores, but the score is largely effected by the quantity and quality of links pointing to a domain.

Moz (and Google) looks at the quantity and quality of links a domain has accrued over time to determine if the site deserves to be ranking for a user’s query. How international content is implemented will determine to what degree the new content takes advantage of the established authority of the company’s root domain.

Implementation Shared Domain Authority
ccTLD None
Subdomain Some
Subdirectory Most

Obviously setting up a separate country code top level domain means that none of the authority from the current domain (i.e. .com domain) will be applied to the new domain, they’re separate entities. But the differences between a subdomain and subdirectory implementation are harder to quantify. SEO experts have debated the affect hosting content on a subdomain has on rankings. It is known that search engines keep separate metrics for subdomains, but Googles official stance is that it treats subdomains as part of the main website, as if it were a subdirectory. Experience has shown this not to be true. When moving content from blog.example.com to example.com/blog/, a site can experience a drastic improvement in rankings.

If a company has the time and money to invest in developing, maintaining, and growing a separate domain, there can be clear benefits to doing so. But for companies simply looking to have a copy of their site translated for international users, a subdomain or subdirectory implementation will make the most sense.

ccTLDs aka Country Code Top Level Domains

Example: example.jp
Pros Cons
  • Clear geotargeting
  • Server location irrelevant (ccTLD overrides server location
  • Easy separation of sites
  • Increased click-through rate from users who prefer local domains
  • Expensive (can have limited availability)
  • Requires more infrastructure. Requires IT resources to set up and maintain
  • Strict ccTLD requirements (sometimes)
  • Splits link authority among several domains. Will not benefit from the established authority of the .com domain.

Subdomains with gTLD

Example: jp.example.com
Pros Cons
  • Easy to set up
  • Can use Search Console geotargeting
  • Allows different server locations
  • Easy separation of sites
  • Will maintain some of the authority of root domain, but not all of it
  • Users might not recognize geotargeting from the URL alone (is “de” the language or country?)
  • Does not benefit from root domain authority as much as a subdirectory would.
  • Requires IT resources to setup and maintain
  • Split link authority among several subdomains

Subdirectories with gTLDs

Example: example.com/jp/
Pros Cons
  • Easy to set up and manage
  • Can use Google Search Console geotargeting
  • Low maintenance (same host)
  • Receives full benefit of the established root domains authority
  • Additional links to international directories helps the whole domain. A rising tide raises all boats situation.
  • Users might not recognize geotargeting from the URL alone
  • Single server location
  • Separation of sites harder
  • May not perform as well in country specific search engines (i.e. google.co.jp)
  • Potentially confusing to a user expecting a ccTLD
  • Much weaker signal to search engines than ccTLD

URL Parameters

Example: example.com?loc=jp
Pros Cons
  • Not recommended
  • URL-based segmentation difficult
  • Users might not recognize geotargeting from the URL alone
  • Geotargeting in Search Console is not possible

How to Choose

Thanks to the authority of the root domain, subdirectories can give new international content an immediate leg up in search results. But subdirectories are also the hardest to geotarget. Google Search Console offers the ability to geotarget subdomains and subdirectories, but this signal is not as strong as a ccTLDs signal. Geotargeting is an important consideration because Google uses device location to ensure it is serving the most relevant results to its users. Not properly geotargeting content, or failing to add language tags to translated content, can impact how a site appears in international search results.

A company also needs to consider maintenance costs. Evaluating how much of a company’s resources can be dedicated to the upkeep and growth of international content is fundamental to ranking international content. Search engine optimization is an iterative process, it requires a lot of research, analysis and implementation to achieve meaningful results. It also requires staying on top of technical issues, keyword research, and link building efforts so that search engines see that the site is worth ranking.

When it comes to content creation, it is not recommended that a company use a programmatic translator. Google uses natural language processing to determine if content has been generated programmatically. If the translation of content is poor, it will result in poor performance in organic search results. The recommended process is to have a native speaker translate the content for the company. This ensures the best user experience and the avoidance of organic rankings issues.

One of the benefits of using a ccTLD or a subdomain is that they provide clear separation of content from the root domain. However, if resources are limited, this also means that every new international site requires additional time and money to maintain. That means additional time doing keyword research and building links, and additional budget allocated to fixing technical issues. Few companies have the resources, or are willing to allocate the resources, to support this effort. Which is part of the reason why so many international companies use a subdomain or subdirectory implementation.

International Targeting

To ensure that your content reaches the correct audience, you will use two general mechanisms; URL-level targeting or Site-wide targeting.

Scenarios where rel=”alternate” hreflang=”x” is recommended:

  • You keep the main content in a single language and translate only the template, such as the navigation and footer. Pages that feature user-generated content, like forums, typically do this.
  • Your content has small regional variations with similar content in a single language. For example, you might have English-language content targeted to the US, GB, and Ireland.
  • Your site content is fully translated. For example, you have both German and English versions of each page.

URL Level Targeting

  • Add one or more HTML link elements to page header. Use the <link rel=”alternate” hreflang=”x” href=”alternateURL”> tag in the <head> section of your pages to list alternate language versions for each page. Each page should provide an hreflang tag that links to all other language variants of itself, as well as a tag that refers back to itself.
    Single Language Example: 
  • <link rel="alternate" hreflang="es" href="http://es.example.com/" />

Multiple Language Example:

<link rel="alternate" hreflang="es" href="http://es.example.com/" />
<link rel="alternate" hreflang="fr" href="http://fr.example.com/" />
  • Add one or more link elements to the HTTP header. You can use an HTTP header to indicate a different language version of a URL. This is helpful for non-HTML files (like PDFs). You could also achieve this using a sitemap.
    Single Language Example:
  • Link: <http://es.example.com/>; rel="alternate"; hreflang="es"

Multiple Language Example:

Link: <http://es.example.com/>; rel="alternate"; hreflang="es",<http://de.example.com/>; rel="alternate"; hreflang="de"
  • Sitemap. Instead of using markup, you can submit language version information in a Sitemap.
    Single Language Example:
  <url>
    <loc>http://www.example.com/spanish/</loc>
    <xhtml:link 
                 rel="alternate"
                 hreflang="es"
                 href="http://www.example.com/spanish/"
                 />
  </url>

Multiple Language Example:

  <url>
    <loc>http://www.example.com/spanish/</loc>
    <xhtml:link 
                 rel="alternate"
                 hreflang="es"
                 href="http://www.example.com/spanish/"
                 />
    <xhtml:link 
                 rel="alternate"
                 hreflang="en"
                 href="http://www.example.com/english/"
                 />
  </url>

More Granular Targeting: Add Language and Country Values

For more granular targeting, you can use the hreflang attribute to indicate language and country combinations (e.g. en-ie, en-ca, en-us).

Site Wide Targeting

In addition making sure your site URLs map to alternate language variants, you will also likely use geographic-specific domains or configure your entire site structure to deliver content to a specific geographic and language preference.

Once you have configured multi-language or multi-regional sites and pages, you can use two sections in the International targeting pages to keep your international presence healthy:

  1. The Language section—this helps you ensure your hreflang tags use the correct locale codes (language and optional country).  More commonly, you can make sure that alternate pages have tags that link back to the pages for your site.
  2. The Country section—you can use this tool to set a site-wide country target for your entire site, if necessary.

References:

Resources:

Anchoring

Definition: A cognitive bias that describes the common human tendency to rely too heavily on the first piece of information offered (the “anchor”) when making decisions. During decision making, anchoring occurs when individuals use an initial piece of information to make subsequent judgments. Once an anchor is set, other judgments are made by adjusting away from that anchor, and there is a bias toward interpreting other information around the anchor.

Canonicals

What is a canonical?

A canonical link is an HTML element used to distinguish the “original” page from derivative pages carrying the same content. It is used to prevent duplicate content issues on the site and tells search engines which page it should index.

 

canonical link diagram

 

How are canonical links used?

How a canonical is used depends on the site and the types of content it contains. Here are the six common instances where canonicals should be used:

  1. Self Referring Canonicals
  2. Duplicate Pages
  3. View All Pages
  4. Faceted Navigation
  5. Non-HTML Content
  6. Cross-Domain Syndication

Self-Referring

This type of canonical points to itself. This is used as a confidence indicator to confirm that the page the search engine has found is indeed the page that should be indexed. This type of canonical is particularly useful when redirecting pages to a new location. Search engines will follow a 301 redirect and use the self referring canonical to confirm that the page it has arrived on is the new page that should be indexed.

Example of a self-referring canonical:

URL
http://www.example.com/breakdancing-grizzly-bear

Canonical
<link rel="canonical" href="http://www.example.com/breakdancing-grizzly-bear" />

Duplicate Content

In it’s most basic form duplicate content means that two or more URLs have the same content. Normally this is not done on purpose, but rather the Content Management System (CMS) is producing URLs that will render the content on different URLs.

An important thing to remember about duplicate content is that if a URL can be modified and the site still renders the content on the original URL, then you have a potential duplicate content issue.

Common ways modifying a URL can produce duplicate content:

http vs https

These would technically be considered duplicates

http://www.example.com/services
https://www.example.com/services

www vs non-www. This happens when a CMS does not force the domain to use either www or non-www. Having a www in the URL is really declaring a subdomain. So being able to render content on the www version of the URL is like

These would technically be considered duplicates

http://example.com/services
http://www.example.com/services

Capitalization. If you can modify a URL by capitalizing one or more of its characters and the content still renders, that is considered duplicate content. It would be rare to see this type of duplicate being indexed by search engines, but it can have an effect on the way a page accumulates authority. If another site links to a piece of content using capitalization, authority will be passed to that URL, instead of attributing authority to the lower case version of the link.

These would technically be considered duplicates

http://www.example.com/services
http://www.example.com/seRvices

Development Sites. When a site is undergoing a redesign a development site is typically set up to test the new site in a live environment. If the developers fail to add a noindex tag to the page, then there is potential for duplicate content issues. Developement sites are usually hosted on a subdomain or seperate domain. In either case developers should included a noindex tag and block all search engines from crawling that content.

These would technically be considered duplicates

http://www.example.com/
http://dev.example.com/
http://www.development-domain.com/

View All Pages

Duplicate content can be created when a website has a single view all page and individual pages that contain pieces of content from the view all page. This is common with publishers who product list type content where the view all pages has all ten items on one page, but also breaks each item out onto it’s own page.

view all canonical diagram

The problem with this type of content is that it often competes with itself in organic rankings. To prevent this, the site should add a canonical from breakout pages to the view all page. This eliminates duplicate content issues and consolidates link metrics, making the view all page the one page that will be indexed and ranked.

Examples of a view all page canonical: 

View All URL
http://www.example.com/top-5-bill-murray-movies

Individual Pages
URL: http://www.example.com/top-5-bill-murray-movies/groundhog-day
Canonical: http://www.example.com/top-5-bill-murray-movies

URL: http://www.example.com/top-5-bill-murray-movies/ghostbusters
Canonical: http://www.example.com/top-5-bill-murray-movies

URL: http://www.example.com/top-5-bill-murray-movies/lost-in-translation
Canonical: http://www.example.com/top-5-bill-murray-movies

URL: http://www.example.com/top-5-bill-murray-movies/caddyshack
Canonical: http://www.example.com/top-5-bill-murray-movies

URL: http://www.example.com/top-5-bill-murray-movies/scrooged
Canonical: http://www.example.com/top-5-bill-murray-movies

Faceted navigation

Infinite scroll

Non-HTML Content

Cross-domain Duplicate Content

Some times content is cross published on multiple sites that are owned by the same company. This is still duplicate content and each piece of content has the ability to compete for rankings. To ensure the correct domain ranks for an article or piece of content a cross-domain canonical can be added to the page.

Resources:

Cross-domain URL selection – Search Console Help

Handling Legitimate Cross-Domain Canonicals – Google Webmaster Blog

Cross-Domain Canonical The New 301? – Whiteboard Friday – Moz

Does Google support cross-domain rel=”canonical”? – Google Webmaster on YouTube

How is it implemented?

There are two way to implement a canonical link. The first, and most  common, is by adding a <link> HTML tag to the <head> of a page.

Additional Resources

Ecommerce SEO: Product Variation, Colors, and Sizes – Merkle

rel=canonical: the ultimate guide – Yoast