Regex to detect domain name mismatches

LuisHermogenes · October 10, 2021, 1:56pm

Hi all,

I am trying to create a regular expression that would detect an exact mismatch of a domain name between source and target.

For instance, I have “sitename.com”, “sitename.de”, “sitename.com.ca”, and “sitename.com.mx”.
I can easily enough find any mismatches to mismatches where “sitename.com” is not used.
For example, this detects if “sitename.de” is used in the source and not in the target:

Source: “((sitename.[a-z]{1,3})=1”
Target: -@1

However, because the last two (“sitename.com.ca” and “sitename.com.mx”) have “.com” in the name, Xbench considers them a match (even if I select “Match whole word” or I use the “End of word” regex (e.g., “sitename.[a-z]{1,3}>”).)

I’ve tried even creating a Key term list with the domain names, but it seems Xbench considers the second period (“sitename.com.XX”) the end of the word, so it is thinks “sitename.com” is the full match.

Any help here? Thanks!

omartin · October 10, 2021, 9:49pm

Hi Luis,

I would use the following regex to find all those domain names:

Source: "(sitename((\.[a-z]+))+)=1"
Target: -@1

Search mode: regular expressions
Match Whole Word and PowerSearch: on.

By the way, replace sitename with the domain you want to find.

An alternative to get all site names, would be to change the source term to "([a-z0-9\-]{1,63}((\.[a-z]+))+)=1"

However, you may get too many false errors.

I hope this helps.

Best regards,
Oscar.

LuisHermogenes · October 11, 2021, 8:17am

Hi Óscar,

Thanks for your quick reply! However, the issue is still happening.
It does not detect an issue when the source is “sitename.com” and target is “sitename.com.ca”, for instance. I think it is because Xbench does not read “sitename.com.ca” as a single full word, but instead reads it as 3 words (sitename, com, and ca); as it considers the period a parsing character.

What can we do?

omartin · October 11, 2021, 8:44am

The following search works fine:

Source: "(sitename(\.[a-z]+)+)=1"
target: -@1

Search mode: regular expressions
Match Whole Word and PowerSearch: on.

LuisHermogenes · October 11, 2021, 9:51am

Hi Oscar,

Sure, but I mean the other way around. It will detect when “sitename.com.ca” is in source and not in target appears, but not the other way around.

I tried reverting the expression, and it works:

Source: -@1
target: “(sitename(.[a-z]+)+)=1”

Can you help me confirm this would detect the same issues as in the one you showed me above? I.e.:

Source: “(sitename(.[a-z]+)+)=1”
target: -@1

EDIT: Meaning, would it detect all issues of mismatches in source and target?

Thanks!

omartin · October 11, 2021, 10:19am

This search will detect all segments that contain a domain name in source but is missing in the target.

You should create a checklist entry for each search.

LuisHermogenes · October 11, 2021, 10:42am

Hi Oscar.
Got it, I’ll create two searches, one for missing in source and one for missing in target.

Thanks!

Topic		Replies	Views
How to find multi domain errors? Technical Support	1	505	December 8, 2020
Find mismatching occurrences between source and target of a given word/expression Technical Support	2	1887	December 22, 2016
Regex for detecting content in brackets not identical in target Technical Support	10	1143	September 22, 2020
Regex to find URL mismatch Technical Support	1	290	August 9, 2023
Detect different source and target Technical Support	2	778	April 11, 2019

Regex to detect domain name mismatches

Related topics