Regular Expressions for SEO: How and Where to Use Them?

Vasiliy Pupkin Dmitriy

Regular expressions are a useful tool that makes life easier for many SEO specialists. 

Sometimes you come across regular expressions in .htaccess or Google Analytics, and at first, everything seems very confusing, but as soon as you start to understand RegEx, you realize how these constructs make life easier and become a powerful tool for working with data.

In this blog, we will explain in details what RegEx (Regular Expressions) are, will give examples, and show how to put them into practice. The article will be useful to everyone who is somehow related to data processing in SEO.

What are Regular Expressions?

A regular expression (or simply RegExp) is a specific pattern or construction for finding occurrences (of any kind) in a text string.

Using this language, you can extract some data from the text, for example, phone numbers, email addresses, any pieces of text, and so on.

RegExp is often used by programmers when checking input data or when coding parsers, but SEO specialists also face regular expressions when working with Google Analytics, RewriteRule in .htaccess, or even in text editors to quickly find and strings replacements.

Regular Expression basics

Let's see a popular example of using regular expressions to set up a redirect on a site from “non-www” version to the www-domain.

RewriteCond %{HTTP_HOST} !^www\.(.*) [NC]

RewriteRule ^(.*)$ http://www.%1/$1 [R=301,L]

Regular expressions are marked in bold here. What do these dots and other signs mean?

Looks very confusing. And to understand this, you need to understand the RegExp syntax.

"^" - caret, circumflex or just a tick. Line start

This character is used to indicate the start of a line (unless used within a "[ ]" construction). For example, if you want to find an email message where the subject line starts with the word "buy", the expression would simply be: ^buy. Without this mark, all keywords containing the word "buy" will be found, not necessarily at the start of a line.

For example, you can use this in advanced Google Analytics filters.

You may say: why use regular expressions, where you can do without them? 

Google Analytics filters have a "starts with" filter. It is absolutely true, and this example was given only to clarify the syntax, further, we will see that a combination of different constructions solves problems that are difficult to solve without using regular expressions.

"$" is the dollar sign. End of line

Unlike a tick, a dollar signifies the end of a line. It is already clear that the Kyiv$ construction will find all phrases ending with the word "Kyiv".

"." - dot. Any character

The dot stands for any character, but only one. The period itself is rarely used and occurs more often together with other signs, for example, ".*".

"*" - multiplication sign, asterisk. Any number of previous characters

An asterisk defines any number of characters (or groups of characters) that are written before this character, including the absence of this character. 

Together with the previous “dot” character, we get the convenient expression “.*”, which means any number of any characters. For example, the expression

RewriteRule ^(.*)$ http://www.%1/$1 [R=301,L]

It is already becoming clearer, the expression above is a redirect of any of the pages to a new URL.

"+" is a plus sign. Any positive number of previous characters.

The plus sign is different from the previous “*” sign: the symbol must occur at least once.

"?" - question mark. Optional occurrence of last character

The question mark indicates that the last character or group may or may not occur in the text (that is, their occurrence is not required).

It is convenient when you don't know, for example, whether there will be a slash at the end of the address or not:

^/articles/?$

Or, for example, when you're looking for keywords and looking for certain misspellings:

buy a l?dder

This expression will find all the keywords where the phrases “buy a ladder” and “buy a ledder” appeared.

"( )" - round brackets. Grouping structures.

Similar to the use in mathematics, parentheses in regular expressions are used for grouping. And further for a group of characters or rules, you can specify other rules.

For example, we need to redirect all users from the "domain.com/blog/" subfolder to the blog.domain.com subdomain:

RewriteRule ^blog/(.*)$ http://blog.domain.com/$1 [R=301,L]

Here the ^blog/(.*)$ rule means that the address starts with blog/, then some sequence of characters can follow (for example, the address of some blog article).

"|" - vertical line. "OR" sign.

The vertical line indicates the OR operator when we need to list certain options in the search. Let's say we are looking for keywords where the word s"buy" or "purchase" occur:

buy|purchase

Or we want to see statistics for several sections - articles (/articles/) and press releases (/pr/):

^/(articles|pr)/

Or let's take another example. Suppose we want to close the folders “admin”, “login”, “register” and some others from being indexed by search engines.

In order not to interfere into the website code, you can do this with a few lines of code in .htaccess using the X-Robots-Tag HTTP header, which is understood by most search engines.


Header set X-Robots-Tag "noindex, nofollow"


The most well-known and widely used search engines (Google, Yandex, Bing, DuckDuckGo) understand most disallow directives. But in different countries, local search engines are sometimes used, which may not understand, for example, information about indexing through HTTP headers.

Before optimizing a site for a particular location, you should find out what search engines local residents use and study the peculiarities of working with them.

"[ ]" - square brackets. Any of the listed characters.

You can list characters in square brackets and one of them may appear in the searched text. If the first character in this construction is "^" (hat/tick), then the expression works the other way around — the character should not match what is listed in brackets.

In order not to list some popular sequences, such as the entire alphabet or a series of numbers, you can use a range: 0-9 means the range from 0 to 9, a-c is the range of characters from "a" to "c".

Let's say I want to know how people found the site when looking for guidelines or instructions (articles start with "Top 10..." or "Top 15...").

^[0-9]+

"{ }" - curly brackes. Repeating a character multiple times.

Curly brackets are used to specify exactly how many times a character or group of characters should occur. If two numbers are indicated in brackets, separated by a comma, then this will be the interval "from and to".

For example, to find a zip code in text that is 6 digits long and starts with 14, you can use the following regular expression.

14[0-9]{4}

Here we have indicated 14, and then a sequence of numbers repeated 4 times, the total length will be 6.

More complex example:

www\.domain\.[a-z]{2,6}

This expression finds all domain zones, the main domain, including www.domain.ua and www.domain.travel.

An even more complex example — we need data for 2, 3 and 4 words separately. To do this, in Google Analytics, in the keyword report, we use the filter:

^[^\s]+(\s[^\s]+){2}$

The character "s" means a space, it is used to separate words. Here, [^s]+ indicates that the phrase must start with any number of non-spaces, followed by a space, and another word.

The last two rules “space + word” can occur exactly 2 times (“( ){2}” construction). This is how we get a list of all 3-word phrases and data on them.

"\" is a backslash. Screening service characters.

Regular Expression syntax uses periods, question marks, and other characters that may also be of interest to search. In this case, the backslash sign helps. 

For example, to search for a dot, we screen it —  "\.", the same with other signs.

For example, in Google Analytics, we set up one of the goals — the use of internal search.

A person uses search if we see "/?q=" in the URL.

Where can an SEO specialist use RegEx

In the settings it looks like this: "/\?q\=".

There are other characters for operating with regular expressions, a complete list can be found on Wikipedia. But the above should be enough for the basic tasks of an SEO specialist.

Where can an SEO specialist use RegEx?

Google Analytics

Google Analytics is considered one of the main SEO tools.

Google Analytics supports regular expressions, which allows you to create more flexible definitions for filters, goals, segments, audiences, content groups, channel groups, and more.

Very often, analyzing the behavior and paths of users on the website helps to find new and effective SEO methods for website promotion. RegEx can be used to segment the most popular pages and then analyze the popularity of groups of pages.

For example, using RegEx to segment pages allows you to analyze traffic and bounce rate based on content types on a much larger scale than using traditional filters.

Google Search Console

Determining user intent is an important task for an SEO specialist, and regular expressions help to segment data by the main intent of users, that is, to determine the reason why someone is looking for something. This is an important component of any digital marketing strategy.

RegEx is most commonly used for branded and non-branded analysis. By using RegEx to specify patterns and match, data can be segmented in a couple of clicks.

RegEx patterns can be used to segment your audience based on what they were thinking and searching for when they found your website.

They can also be used to divide URLs with RegEx filters so you can understand where the traffic is going and what is driving it. The intent of customers who find a website is consistent with pages they land on.

Rankings

RegEx can be used to segment ranking data based on page types for the highest ranked URL for a keyword.

Using the same RegEx patterns as in GSC is possible to analyze rankings by keyword segments, for example how SERPs show rankings for branded and non-branded keywords.

Website audit in crawlers

RegEx can be used to create patterns that help match a string or text. When auditing a site, it can be used to:

  • Segmentation of crawled pages based on URL patterns to control crawl analysis for a large group of pages on a corporate site.

  • Search for text from websites when crawling.

Log analysis

Regular expressions also help parse the crawl files of your site by search bots. 

Log files are usually broken down and parsed based on the User-Agent for different search engine bots.

Since log files for large sites can contain a large number of pages, using RegEx patterns to segment crawled URLs simplifies general analysis and allows filtering based on complex criteria.

Examples of using Regex in SEO tools

  1. An example of using regular expressions in Google Search Console

For example, we are looking for all mentions of coronavirus in queries:

(?i)([ck]ovid|coron[ao]\s?virus)

This query will find all matches, case insensitive, of the following phrases kovid, covid, coronavirus, coronovirus, corona virus, corono virus.

Or we are looking for what queries users use to find a gastronomy establishment called fairy house:

(?i)(cafe|restaurant|bar)\s(f[ae]ir[yi]|fe[ij]ri|fabulous)\s?(h[ao]u[sz]e*|how[sz] |house)

With this query, we will find a bunch of variations of the name from a fairy house bar or a fairy house cafe to a fairy house restaurant, taking into account various possible user errors.

The enumeration of characters in brackets [ ] goes alphabetically. In the example above, [зс] will work correctly, but [сз] will select only options with the letter “s”.

  1. Example of use in Google Analytics

For example, if you want to exclude statistics about visits to your site by your employees, you can set up a regular expression filter for the view, which will determine all the IP addresses of the company. 

Let's say this is the IP range 198.51.100.1 — 198.51.100.25. To avoid entering each of the 25 IP addresses, create a regular expression like 198\.51\.100\.\d* to match the entire range.

If you need a filter that only includes campaign data from two cities, you can create a regular expression like Dnipro|Kyiv (Dnipro or Kyiv).

  1. Disallowing the WordPress admin panel in .htaccess, opening it for your IP address only


Order Deny, Allow

Deny from All

Allow from 200.20.21.145


Where 200.20.21.145 is, for example, your IP address.

  1. Highlighting non-branded search traffic

Let's say you have an online store called "goodshop.com". Through Google Analytics, you would like to separate search traffic for queries that do not contain your store name from branded search traffic.

Tracking the dynamics of changes in unbranded traffic is one way to evaluate the effectiveness of website SEO measures. 

To solve this problem, you can create a personalized report in Google Analytics with an assigned filter that will filter out branded queries. 

There can be a lot of different spellings for your store name (don't forget about typos and incorrect keyboard layout). Using a regular expression will eliminate the need to multiply the filter fields for each of the options.

Regular expression

goodshop|goodshop|good shop|good shop|good-shop|pschwyrschz

Special signs used:

| - the symbol enables the operator OR

When forming this regular expression, we simply list all the main possible queries related to the name of your store in order to exclude them from the report. And don't forget to set the match type to "Regular Expression" in the filter settings.

How to use RegEx for SEO

5. Choose a specific category of pages on the site

Sometimes, when studying statistics on the interaction of visitors with the content of the site, it is necessary to select a certain group of pages for analysis. 

For example, compare page engagement rates from a particular section of the directory. Suppose we have a website selling various electronic gadgets. 

The site has a section on mobile phones with a three-level hierarchy:

- 1st level — the main page of the subdirectory on mobile phones:

/catalog/mobile/

- 2nd level - collected mobile phones of certain brands:

/catalog/mobile/apple

/catalog/mobile/samsung

/catalog/mobile/htc

- 3rd level - directly product cards:

/catalog/mobile/apple/iphone5

/catalog/mobile/samsung/galaxys3

/catalog/mobile/htc/desirev

We need to highlight only products pages in the Google Analytics content report.

Regular expression

/catalog/mobile/.+/.+

Special signs used:

. — stands for any character: punctuation mark, letter, number.

+ — indicates the number of repetitions of the previous character for 1 time or more.

In this case, the combination of special characters .+ denotes any string consisting of at least one arbitrary character. With a clear site structure, we know that the product card URL consists of four fragments separated by a slash.

We only need to specify the first two of them explicitly in the regular expression, since we only need pages from the section on mobile phones.

At the same time, we know that the product card of a mobile phone certainly contains two more parts: brand and model. We set them by using two combinations of .+, separated by a slash.

Thus, we have defined a template for the address of the product card page, which we copy into the report filter field.

6. Tracking of actions and conversions on a website

Sometimes regular expressions can come in handy when setting goals in Google Analytics. 

Let's take as an example a website of a foreign language school that offers its students courses in four foreign languages. The site has an application form that visitors can submit indicating a foreign language to learn.

At the same time, visitors can choose more than one language as the desired language to learn. The project management sets itself the task of finding out how often visitors choose more than one language. 

Accordingly, we need to set up a goal in Google Analytics. After submitting the form, a page with one of the following URLs is displayed.

/order?lang=eng

/order?lang=eng&esp

/order?lang=eng&esp&ita

/order?lang=eng&esp&ita&fra

Obviously, in the goal settings, you need to specify a regular expression that will match the last three URLs.

Regular expression

/order\?lang=.{3}&

Special signs used:

{ } - in curly brackets, you can specify the number of repetitions of the previous character. Accordingly, the combination of characters .{3} denotes a sequence of any three characters. 

In our case, three characters are the designation of the language. Don't forget to screen the question mark, which is also a special character in regular expressions.

Thus, the regular expression will match all URLs that have three characters after the “equal” sign followed by an ampersand. These pages are displayed when more than one language is specified in the submitted form. Which is what needed to be tracked.

How to check if a regular expression is correct

In order to avoid errors when using regular expressions, you can test them before implementation by applying an advanced filter in any analytics reports.

You can also check regular expressions on a test set using special tools. We recommend using the Regex Pal website or the RegExp Tester browser extension.

For those who are just starting to learn regular expressions, as well as for experienced professionals who want to test their skills in using RegEx, we highly recommend visiting this website. Here you can practice writing regular expressions in a playful way.




Leave a comment