Issue with a non-greedy quantifier

Hi!

I have a source text that contains two variables in square brackets. I would like to match both of them but only them without the text between. I’ve been always using the “non-greedy” quantifier for such purposes and also here I’ve used it:
\[(.*?)\]

And from what I expect it should be OK but it matches the same as if I wouldn’t use the “?” (which makes the quantifier non-greedy). So it matches the same as \[(.*)\]

image

As you can see I’ve used the same regexp as in xBench. Am I not aware of something or is it a program issue? Can you help me solving this?

Thank you in advance for any comment on this!

Xbench uses POSIX Regex, which does not have the non-greedy operator in PERL or .NET flavors.

Therefore with this Regex flavor you must make your expression non-greedy like this:

\[[^\[]*\]

That said, with regards to highlighting, currently in Regex mode there is only highlighting of the first match (i.e. the match that makes the segment to be part of the search results), although this will be changed in the future to show all matches in the segment.

1 Like

Thank you very much for your help! This explains the issue. I’ve tested it and it works, thanks!

Hello!

I have the same problem, I want to check that anything between the and </codeph> tags is left untranslated, so the regex code I used was <codeph>.*?</codeph> (with a backslash \ before the < and > characters, I’m not sure why they are not shown here). But this is what I get:

Of course, I need it to look like the 2nd example in the printscreen, thank you very much!

Kind regards,

Bogdan Dusa

The following regex search should suit your needs:

Source: "(\<codeph\>[^\<]*\</codeph\>)=1"
Target: @1
Search mode: Regular Expressions
PowerSearch: On. (Press Ctrl+P)

Hi Bogdan!

Because in Xbench a star * is always greedy and you cannot make it lazy. Try this pattern instead:
\<codeph\>[^\<]*\</codeph\>
The difference is that between the tags you match everything that is not <, so it will not match as much as it can, cause it will stop each time the < appears.
PS. If you write here and want to write backslash you have to use double backslash to make it visible (\\). image

Hello!

Thank you for your explanations, I understood now the logic behind. For some reasons that I don’t understand, the combination you suggested (source = “(\<codeph\>[^\<]*\</codeph\>)=1”, target =@1) won’t work (“No errors found!” message pops-up), so I used the same code for both source and target - “\<codeph\>[^\<]*\</codeph\>”. My Xbenh version is 3.0, Build 1498 64-bit edition.

Anyway, this partially solves my problem, as I understand for the time being Xbench is only able to highlight the first match, although a segment may contain several matches. Thank you for letting us know when this issue will be solved in a further Xbench update!

Kind regards,

Bogdan Dusa

I think the pattern worked but it simply returned no matches/hits. And the reason for that is that you have to use the ‘minus expression’ at the beginning of the target. You have captured a group in the source which is for example <codeph>Report2018 or Report2019</codeph> and then by using only @1 in target you have simply tried to evoke the same match in the target. So it’s obvious it hasn’t returned matches as you most probably have all these instances translated in a file (at least that’s what I can tell from the screenshots). So it would return matches if you had these phrases between the tags untranslated. To achieve the result opposite to this, so to display needlessly translated phrases between the tags you have to use -"@1" in the target. So once again:
Source: "(\<codeph\>[^\<]*\</codeph\>)=1"
Target: -"@1"
Search mode: Regular Expressions
PowerSearch: On (Press Ctrl+P)
Ignore tags: Off
When you will use this, you won’t have to bother with Xbench highlighting only the first match. It will display you all the instances where the text between <codeph> tags is translated. And it should be 100% accurate for sure.

Hi, Kacper

Actually, it doesn’t :). It still displays only the first instance where the text between <codeph> tags is translated.

Nevertheless, this is exactly what I need: to highlight the text between the tags both in source and in target, so searching with a -"@1" (i.e.searching for a text not existing in the target) is not an option.
Developing on your solution, I found out that the using the Word wildcards feature better matches my needs, although still partially. Here are the codes I used:
Source: (\<codeph\>[!\<]@\</codeph\>)
Target: @1
Search mode: MS Word Wildcard
PowerSearch: Off (or On, with quotation marks surrounding the codes in source and target)
And here’s the result:


So, with this instruction, all instances in the source are displayed, the problem lies with the target text. Still, it’s the best option for the time being :slight_smile:

Kind regards,

Bogdan

Hi Kacper,

I’m not sure why my Xbench version keeps on giving me different results for the same codes I use…
Anyway, it works without using the “@1” in the target, instead I use the same code as in the source, either with Regex codes or Ms Word Wilcards
Option 1:
Source: (\<codeph\>[^\<]*</codeph\>)
Target: (\<codeph\>[^\<]*</codeph\>)
Search mode: Regular Expression
PowerSearch: Off
Option 2:
Source: (\<codeph\>[!<]@\</codeph\>)
Target: (\<codeph\>[!<]@\</codeph\>)
Search mode: MS Word Wildcard
PowerSearch: Off

Here’s the result:

Thank you very much for the support!

Kind regards,

Bogdan

Hi Bogdan,

what do you exactly mean by different results?

Hi Kacper,

Sorry, I think I misused one code in the target and it gave me different results. Here are all the tests I made:

Source: “(\<codeph\>[^\<]*</codeph\>)=1”
Target: @1
Search mode: Regular Expression
PowerSearch: On
Result:
No match found (“No errors found!” message pops-up).

Source: “(\<codeph\>[!\<]@</codeph\>)=1”
Target: @1
Search mode: MS Word Wilcard
PowerSearch: On
Result: Only the 1st instance in the source is highlighted, with unwanted text highlighted in the target (for illustration purposes, what is highlighted is in bold below).
For example, <codeph>Report 2018 or Report 2019</codeph> returns the same results as Report 2018, Report 2019.
De exemplu, <codeph>Raport 2018 sau Report 2019 returnează aceleaşi rezultate ca <codeph>Raport 2018, Raport 2019</codeph>.

Source: “(\<codeph\>[^\<]*</codeph\>)”
Target: @1
Search mode: Regular Expression
PowerSearch: On or Off
Result:
Error - Target term error: Undefined variable (@1)

Source: “(\<codeph\>[!\<]@</codeph\>)”
Target: @1
Search mode: MS Word Wilcard
PowerSearch: Off
Result:
No match found (“No errors found!” message pops-up).

Source: “(\<codeph\>[!\<]@</codeph\>)”
Target: @1
Search mode: MS Word Wilcard
PowerSearch: On
Result: Unwanted text highlighted in both source and target (strangely, the highlight in the target stops just before the digit 9)
For example, <codeph>Report 2018 or Report 2019</codeph> returns the same results as Report 2018, Report 2019.
De exemplu, <codeph>Raport 2018 sau Report 2019 returnează aceleaşi rezultate ca <codeph>Raport 2018, Raport 2019</codeph>.

Now, with the same codes in both source and target fields, Regular Expression or MS Word Wildcard, with quotation marks (for illustration purposes, I’m only showing the codes for Regular Expression, but the same goes for MS Word Wildcard):

Source: “(\<codeph\>[^\<]*</codeph\>)”
Target: “(\<codeph\>[^\<]*</codeph\>)”
Search mode: Regular Expression
PowerSearch: Off
Result:
No match found (“No errors found!” message pops-up).

Source: “(\<codeph\>[^\<]*</codeph\>)”
Target: “(\<codeph\>[^\<]*</codeph\>)”
Search mode: Regular Expression
PowerSearch: On
Result:
Only the 1st instance highlighted in both source and target, without any unwanted text being also highlighted:
For example, <codeph>Report 2018 or Report 2019</codeph> returns the same results as Report 2018, Report 2019.
De exemplu, <codeph>Raport 2018 sau Report 2019 returnează aceleaşi rezultate ca <codeph>Raport 2018, Raport 2019</codeph>.

With the same codes in both source and target fields, Regular Expression or MS Word Wildcard, without quotation marks (for illustration purposes, I’m only showing the codes for Regular Expression, but the same goes for MS Word Wildcard):

Source: (\<codeph\>[^\<]*</codeph\>)
Target: (\<codeph\>[^\<]*</codeph\>)
Search mode: Regular Expression
PowerSearch: On
Result:
Only the 1st instance highlighted in both source and target, without any unwanted text being also highlighted:
For example, <codeph>Report 2018 or Report 2019</codeph> returns the same results as Report 2018, Report 2019.
De exemplu, <codeph>Raport 2018 sau Report 2019 returnează aceleaşi rezultate ca <codeph>Raport 2018, Raport 2019</codeph>.

Source: (\<codeph\>[^\<]*</codeph\>)
Target: (\<codeph\>[^\<]*</codeph\>)
Search mode: Regular Expression
PowerSearch: Off
Result:
Exactly what I need - only the text between the <codeph> tags is highlighted:
Only the 1st instance highlighted in both source and target, without any unwanted text being also highlighted:
For example, <codeph>Report 2018 or Report 2019</codeph> returns the same results as Report 2018, Report 2019.
De exemplu, <codeph>Raport 2018 sau Report 2019</codeph> returnează aceleaşi rezultate ca <codeph>Raport 2018, Raport 2019</codeph>.

To sum it up, the solution to highlight only the text between those tags is to have the same codes both in source and target, either Regular Expression or MS Word Wilcard, without quotation marks, with PowerSearch unchecked (off):

Kind regards,

Bogdan

Hi,

I do not use MS Wildcard, so I cannot tell here. I use only regex mode in Xbench and just a note: if you enclose the text to search in quotation marks "text", you will always have to use Power Search. So if you have had quotation marks and haven’t used Power Search, then it didn’t show up anything and this is correct.

When it comes the -"@1" variant, if you use it you will get only the instances with translated text in the target (or literally where the text between the tags is different between the source and target) - so the segment with an issue. But of course you won’t get any highlight in the target as you in fact do not match anything in it, but you search for absence of something.

If you need to highlight both in source and target, indeed the only option is to use the same pattern in both source and target, without using capturing groups. So like this:
source: "\<codeph\>[^\<]*\</codeph\>"
target: "\<codeph\>[^\<]*\</codeph\>"
And if you use quotation marks, then indeed it will highlight only the first instance.

But if you do not use quotation marks (in fact you don’t need them here if you do not use capturing groups):
source: \<codeph\>[^\<]*\</codeph\>
target: \<codeph\>[^\<]*\</codeph\>
It highlights in both source and target and also it highlights both instances! See the screenshot:

The only problem (quite big…) is that it will highlight both translated and untranslated phrases… So if you generate such report, someone will have to manually go through it and check which one is correct and which is not.

Hi Kacper,

I’m not sure why you’re talking about translated and untranslated phrases, or tags that are different between the source and target. Correct me if I’m wrong, but I think there is no such thing, since the text between the tags is rendered by the same code: [^\<]*, i.e. any character (be it alpha- or non-alphanumeric), except for the < character, which occurs 0 or more times. In this way, Xbench does not distinguish between English or Romanian texts to flag that they are different. It just highlights a sequence of characters (which may be identical or not) between those tags.
Yep, what I need in the end is to have all the texts between the tags highlighted in order to go through each of them to check whether it has been accidentally translated or not.

Otherwise, may I ask what version of Xbench do you use? The interface seems different than the one I use (Xbench 3.0 Build 1498 64-bit Edition).

Kind regards,

Bogdan

Hi Bogdan,

Mine is 1490. So it seems yours is newer!

I understood that the text between the tags shouldn’t be translated, right?

So if I had such a task to do I would work on this in the following way: I would use the option with capturing group to capture the text with tags and text inbetween "(\<codeph\>[^\<]*</codeph\>)=1" and then look for absence of this matched text in the target -"@1" (so, for example you can have the following source text: <codeph>do not translate</codeph>, and whenever it looks exactly the same way in the target it won’t flag this on the report). And this is what I would like to achieve: do not have correctly treated segments on the report because then I would still have to filter out the correct ones. It would only flag the incorrectly treated segments, so where the text between the tags WAS translated. That’s how I would like to do it. And yes, if you just use [^\<]+ in both source and target instead of capturing it in the source and requiring the same phrase in the target, it will highlight everything that is between the tags both no matter if it’s different than in the source or the same.

It’s just the matter of what you want to achieve. If you’d like to have only incorrect segment, use capturing group, if you’d like to have everything no matter how it was treated, use the whole phrase with [^\<]* in both source and target.

You can read about capturing groups here: https://www.regular-expressions.info/ or here: https://www.rexegg.com/regex-capture.html