Regex with word boundary and anchors inside groups

Denis77 · March 26, 2021, 7:40am

Hello,

I am trying to find improperly translated segments where a number without thousand separators in source needs to have those thousand separators in target.

Examples:
(1) 10000 => 10 000
(2) 200000 => 200 000
(3) It costs 20.222222 per unit. => the rule must not match this segment because the number is after the decimal point.
(4) 10000m3 => 10 000m3

So, the following conditions must be met:

The first digit in source must not be preceded by , or . another digit.
The first digit may be preceded by symbol (e.g. currency or mathematic operators).
The last digit may be at word foundary or followed by unit of measurement / symbol without space (i.e. no word boundary).

My initial expression was this (a similar string would work in .NET regular expressions):
“(^|[^.,0-9])([0-9]{1,3})=1([0-9]{3})=2(>|[:letter:]|[:symbol:])”
However, this expression fails in my tests, producing very few matches.

In the end I had to create 4 different ones to avoid the initial and final groups, and these expression cover all the real errors:

“^([0-9]{1,3})=1([0-9]{3})=2>” – matches at start of segment, word boundary at end
“^.,0-9=1([0-9]{3})=2>” – matches at other places inside segment, word boundary at end
“^([0-9]{1,3})=1([0-9]{3})=2([:letter:]|[:symbol:])” – matches at start of segment, symbol or unit of measurement after the number
“^.,0-9=1([0-9]{3})=2([:letter:]|[:symbol:])” – matches at other places inside segment, symbol or unit of measurement after the number

As far as I understand, < or > cannot be used inside alternative groups by themselves, and ^ cannot be used by itself either.
Is there any way to create a single expression rather than 4 permutations?

Thank you!

Denis77 · March 26, 2021, 7:43am

It appears that the regex was cleaned up by the forum, so I am reiterating the regex code with spaces inserted to avoid clean-up:

" [ ^ \ . , 0-9 ] ([0-9])=1([0-9]{3})=2>" – matches at other places inside segment, word boundary at end
“[ ^ \ . , 0-9] ([0-9])=1([0-9]{3})=2([:letter:]|[:symbol:])” – matches at other places inside segment, symbol or unit of measurement after the number

pcondal · March 27, 2021, 4:25pm

You could stat with this for sequences of 4 or more digits that are found both in source and also in target:

Source: "(<[:digit:]{4,}>)=1" -"\.@1"
Target: @1
Regex: Enabled
PowerSearch: Enabled

This can be read as:

Find in the source text a sequence of 4 or more consecutive digits (which are captured in variable @1) that is not preceded by a dot and that also appears in the target text.

If you have more specializations, such as having specific UoM after the digits, I recommend that you add other checklist items to support the specific specialization.

Denis77 · March 28, 2021, 8:02am

Hello,
Thank you very much!
Denis

Topic		Replies	Views
Regex that detects in target only all numbers where full stop as thousand separator is not used General Discussion	1	134	June 4, 2024
Check for no thousand separators Technical Support	2	1739	July 29, 2019
Check for missing full stop as thousand seperator General Discussion	4	144	April 4, 2024
Check number formatting Technical Support	8	2745	February 9, 2018
Detecting number ranges General Discussion	2	322	December 31, 2022

Regex with word boundary and anchors inside groups

Related topics