ApSIC Xbench Forum

Regex with word boundary and anchors inside groups

Hello,

I am trying to find improperly translated segments where a number without thousand separators in source needs to have those thousand separators in target.

Examples:
(1) 10000 => 10 000
(2) 200000 => 200 000
(3) It costs 20.222222 per unit. => the rule must not match this segment because the number is after the decimal point.
(4) 10000m3 => 10 000m3

So, the following conditions must be met:

  • The first digit in source must not be preceded by , or . another digit.
  • The first digit may be preceded by symbol (e.g. currency or mathematic operators).
  • The last digit may be at word foundary or followed by unit of measurement / symbol without space (i.e. no word boundary).

My initial expression was this (a similar string would work in .NET regular expressions):
“(^|[^.,0-9])([0-9]{1,3})=1([0-9]{3})=2(>|[:letter:]|[:symbol:])”
However, this expression fails in my tests, producing very few matches.

In the end I had to create 4 different ones to avoid the initial and final groups, and these expression cover all the real errors:

  1. “^([0-9]{1,3})=1([0-9]{3})=2>” – matches at start of segment, word boundary at end
  2. ^.,0-9=1([0-9]{3})=2>” – matches at other places inside segment, word boundary at end
  3. “^([0-9]{1,3})=1([0-9]{3})=2([:letter:]|[:symbol:])” – matches at start of segment, symbol or unit of measurement after the number
  4. ^.,0-9=1([0-9]{3})=2([:letter:]|[:symbol:])” – matches at other places inside segment, symbol or unit of measurement after the number

As far as I understand, < or > cannot be used inside alternative groups by themselves, and ^ cannot be used by itself either.
Is there any way to create a single expression rather than 4 permutations?

Thank you!

It appears that the regex was cleaned up by the forum, so I am reiterating the regex code with spaces inserted to avoid clean-up:

  1. " [ ^ \ . , 0-9 ] ([0-9])=1([0-9]{3})=2>" – matches at other places inside segment, word boundary at end
  2. “[ ^ \ . , 0-9] ([0-9])=1([0-9]{3})=2([:letter:]|[:symbol:])” – matches at other places inside segment, symbol or unit of measurement after the number

You could stat with this for sequences of 4 or more digits that are found both in source and also in target:

  • Source: "(<[:digit:]{4,}>)=1" -"\.@1"
  • Target: @1
  • Regex: Enabled
  • PowerSearch: Enabled

This can be read as:

Find in the source text a sequence of 4 or more consecutive digits (which are captured in variable @1) that is not preceded by a dot and that also appears in the target text.

If you have more specializations, such as having specific UoM after the digits, I recommend that you add other checklist items to support the specific specialization.

Hello,
Thank you very much!
Denis