Hello,
I am trying to find improperly translated segments where a number without thousand separators in source needs to have those thousand separators in target.
Examples:
(1) 10000 => 10 000
(2) 200000 => 200 000
(3) It costs 20.222222 per unit. => the rule must not match this segment because the number is after the decimal point.
(4) 10000m3 => 10 000m3
So, the following conditions must be met:
- The first digit in source must not be preceded by , or . another digit.
- The first digit may be preceded by symbol (e.g. currency or mathematic operators).
- The last digit may be at word foundary or followed by unit of measurement / symbol without space (i.e. no word boundary).
My initial expression was this (a similar string would work in .NET regular expressions):
“(^|[^.,0-9])([0-9]{1,3})=1([0-9]{3})=2(>|[:letter:]|[:symbol:])”
However, this expression fails in my tests, producing very few matches.
In the end I had to create 4 different ones to avoid the initial and final groups, and these expression cover all the real errors:
- “^([0-9]{1,3})=1([0-9]{3})=2>” – matches at start of segment, word boundary at end
- “^.,0-9=1([0-9]{3})=2>” – matches at other places inside segment, word boundary at end
- “^([0-9]{1,3})=1([0-9]{3})=2([:letter:]|[:symbol:])” – matches at start of segment, symbol or unit of measurement after the number
- “^.,0-9=1([0-9]{3})=2([:letter:]|[:symbol:])” – matches at other places inside segment, symbol or unit of measurement after the number
As far as I understand, < or > cannot be used inside alternative groups by themselves, and ^ cannot be used by itself either.
Is there any way to create a single expression rather than 4 permutations?
Thank you!