MultiTerm- KeyTerm Mismatches - Reducing False positives results

Cony · February 19, 2016, 9:23pm

I have a client with a very large glossary. A Studio MultiTerm Term Base has been created, the file has been prepped so that if a source term has multiple target terms, each target term has its own row. As you can imagine this creates a huge list of false positives. Is there a quick way to program Personal Checklist in ApSIC? What is the best way to do this? The glossary currently has 6K rows.
For example:
Example 1
health care atención médica
health care atención de salud
health care cuidado de la salud
Example 2
discharge 1. A flowing out or pouring forth; emission; secretion. secreción
discharge 1. A flowing out or pouring forth; emission; secretion. supuración
discharge 1. A flowing out or pouring forth; emission; secretion. flujo
discharge 1. A flowing out or pouring forth; emission; secretion. exudación
discharge 2. The point at which the patient leaves the hospital alta hospitalaria
discharge 2. The point at which the patient leaves the hospital dar de alta

Also, if the part of a source keyterm is in the text it will also come up as a keyterm mismatch:

Any insight is appreciated!

pcondal · February 20, 2016, 12:11am

You can avoid this issue if you define these entries as synonyms when you prepare the MultiTerm term base.

For example, if in MultiTerm you have one concept with “health care” as English term and “atención médica”, “atención de salud”, and “atención de la salud” as synonyms, the Xbench Key Terms check will only flag an issue if none of the synonyms is found.

Cony · February 20, 2016, 12:28am

Thank you for the prompt response!

Cony · February 24, 2016, 7:48pm

In the sample ApSIC QA report above, is there a way to not get false positives when part of the Source word is being flagged, for example, Key Term Mismatch (health/ salud) when the source segment is actually Health Care Ambassadors Program.

Hope this makes sense.

omartin · February 25, 2016, 7:27am

This checks cannot be performed via key term mismatch as it will find any segments that contain source term but target term (including synonyms) are missing.

Checklists enable you to perform this check in the following way:

Source term: “health” -" Health Care Ambassadors Program"
Target term: -"salud"
Enable the Powersearch check box.

In this way, Xbench will evaluate those strings that contain “health” but will omit those that contain “Health Care Ambassadors Program”.

You can include synonyms in target if health is translated in different ways such as “health care”.

Source: "health care"
Target: -“atención médica” -“atención de salud” -"atención de la salud"
Enable the Powersearch check box.

Raphael_Toussaint · February 26, 2016, 3:27pm

Checklists enable you to perform this check in the following way:

Source term: “health” -" Health Care Ambassadors Program"
Target term: -“salud”
Enable the Powersearch check box.

The only risk with this first approach is that segments containing both “health” and “Health Care Ambassadors Program” will be skipped.

It might be safer to check for it this way:

Source term: “health”
Target term: -“salud” -“atención médica” (or what ever the “health care” part of “Health Care Ambassadors Program” is translated as.
Enable the Powersearch check box.

The most efficient solution should be to add “Health Care Ambassadors Program” as a term in your Keyterm list.

omartin · February 26, 2016, 3:58pm

You are right, segments containing "health and “Health Care Ambassadors Program” will be skipped.

The following search using regular expressions as search mode and powersearch will work:

Source: "<health( [^c]|[:punct:])" OR "<health$"
Target: -"<salud>"

This search will display segments that comply with these two conditions:

Source contains health at the end of the string or followed by punctuation or any word that do not begin with c.
Target does not contain salud (match whole word).

omartin · February 26, 2016, 4:35pm

In general, the following approach can be used in order to avoid false positives if a source word is included in another term such as “health” and “Health Care Ambassador Program”.

Source: "health" OR ("health>.* Health Care Ambassadors Program" OR "Health Care Ambassadors Program .* <health>")
Target: -"<salud>"

Use Regular Expressions as Search Mode and PowerSearch.

This search will display all segments that contain “health” OR both terms in source but “salud” is missing in target.

Firstly, Xbench looks for all segments that contain health in source and then evaluates if this term is preceded or followed by “Health Care Ambassadors Program”. Then, it displays all segments where “salud” is missing in target.

awijas · March 11, 2016, 1:43pm

I’ve been trying to achieve a very similar result: to have Xbench ignore occurrences of the source term when the context is different, like “health” when it occurs in “healthcare”/“health care”, or “check” in “checkbox”/“check box”, including:

cases when the term appears in both contexts (the one that should be checked and the one that should be ignored, i.e. both “health” and “healthcare” in one segment);
cases when a term appears in yet another context, but it’s not covered by the checklist, and therefore should be shown as error (e.g. “healthcentre”).

I tried the methods described here as well as my own ideas, and nothing seems to work as I’d like it to.

I don’t exactly understand what’s the purpose of using < and > in the target here (as in: -"<salud>"), but in my case > can’t be used because the target terms may have inflected forms.

To describe the problem in the most detailed way possible, I gathered:

source and target (TXT) and bilingual (XLIFF) files;
detailed description of what the result should be and why (RTF);
Xbench checklist (XBCKL) with 5 different methods used (I actually tried much more options that didn’t work either…);
Xbench report with comments of what’s correct and what’s not, and which method was used.

All of the above is here: http://bit.ly/1RbW4Dm
ZIP password: Xb_2016

I’d appreciate any suggestions.

pcondal · March 11, 2016, 7:17pm

@awijas, I believe that some of your checklist items have errors. For example, this one…

Source: "<check" OR -"checkbox|check box"
Target: "-<sprawd"

…means the following: “Show all segments that either have a word that starts with “check” OR all segments that do not contain “checkbox” or “check box”.”

Please note that the avoid condition is matched also by all segments that contain “health” because they do not contain “checkbox” or “check box” and they do not have a word that starts with “sprawd”. Probably you want to remove the OR in the source expression so that it becomes an AND condition (because in the Powersearch grammar “nothing” means AND).

Also, I recommend using the Test function in the Checklist Manager: Just select the checklist item, right-click and choose Test (also Ctrl+T). It will run the selected checklist item against the files you have loaded in Xbench and it will show the matching segments. This way you will corroborate if the checklist item is effective or if it requires further tweaking. Especially for complex regular expressions, it is a good idea to test them individually against a sizeable corpus.

Also, be careful when using regular expressions, they are tricky, a double-edged sword. For example, the expression "<check$" actually means show all segments that end with the word “check” because $ means “end of segment”.

awijas · March 14, 2016, 8:29am

@pcondal:

I know that all the expressions are incorrect, but I haven’t managed to create a correct one. I also know about the Test fuction, but I just wanted to create a file showing my attempts so that maybe someone can indicate how to modify one of them to achieve the goal I want to achieve
If I change to AND as you suggest, it won’t catch segments that contain both “check” (without “box”) and “checkbox”/“check box” and at the same time don’t contain “sprawd”.
I know the meaning of $ and other special characters, but it was something @omartin suggested above (as I indicated, I do not fully understand what’s the purpose of some of his suggestions), so I just wanted to check the result.

pcondal · March 14, 2016, 9:20am

@awijas, the recommended approach is the AND heuristic mentioned for legibility and maintainability, although since it is an heuristic there might be false negatives.

If you cannot afford the occasional false negatives, you can often dig a little bit more in Regex to find a fully deterministic solution (for one instance of check, not two) instead of an heuristic one. For example, in this case it would be:

Source : "<check$" or "<check[,\.]" or "<check [^b][^o][^x]"
Target: "-<sprawd"

Which can be read as: "Find ‘check’ at end of the segment, or the word check followed by a comma or period, or the word check plus a space followed by something that is not “box”.

In general checklist items will be heuristics that can be improved over time as false positives and false negatives are detected. Obviously the limits are the Regex and Powersearch search grammar, which can be very high if mastered, but often the outcome will be a heuristic.

If the heuristic capabilities provided by the search engine still come short for some of your checks, there is always the resource of writing a Xbench QA plugin in C++ or Delphi (samples here), whose flexibility can elevate the processing of the segment to even machine learning levels, while still leveraging the whole Xbench environment (parsing and Edit Source, for example).

awijas · March 14, 2016, 11:15am

@pcondal, thanks, but, unfortunately, your version:

doesn’t cover all my options, e.g. “checkbox” without a space
doesn’t find all errors, e.g. finds nothing for “health”, even though I made the check exactly the same way as for “check”

I made some adjustments:

used [:punct:] to cover all punctuation, since not only commas or spaces may occur
tried to handle the version without a space as well, but with no luck
for both “check” and “health” I got 1 more correct result, and 1 that shouldn’t appear

I don’t want to go into plugins for such a (I thought it should be) simple issue.

My versions:
"<check$" OR "<check[:punct:]" OR "<check[:space:][^b][^o][^x]" OR "<check[^b][^o][^x]"
or
"<check$" OR "<check[:punct:]" OR "<check([:space:]?)[^b][^o][^x]"

See: bit.ly/1P7WCZR
Pass: Xb_2016

pcondal · March 14, 2016, 1:15pm

I’m not sure if I follow it. I acknowledge I forgot about the “checkbox” exclusion but, does’t your solution below address it? Or you are referring to an issue with another specific checklist item in your checklist?

In any case, please note that – without resourcing to Xbench QA plugins – your upper limit for check sofistication is ultimately limited by the Xbench search grammar, which is a combination of POSIX Regex in one search layer and Xbench Powersearch on the layer on top.

awijas · March 15, 2016, 11:15am

Sorry, I didn’t express myself clearly.

The problem is that my solution made Xbench:

find more real errors: “Check book”, “Health centre” (it’s good)
find more false positives: “Check box”, “Health care” (it’s bad - they shouldn’t be found)
still not find some real errors: “Checkbook”, “Healthcentre” (it’s bad - they should be found)

You can compare the RTF file with the XLSX QA report (see link in my previous post) to see exactly what should (not) be found according to my intensions (in RTF) and what is (not) found with the use of my expressions (in XLSX).

I think regexps should be enough to achieve this, especially that most cases are already covered. Maybe some ORs are used incorrectly or some parts of the expression should be enclosed in (), or there’s yet another simple answer?..

Topic		Replies	Views
Multi-term entries in Key Terms check Technical Support	0	21	March 18, 2025
Expected Key Term can ensure the Term Occur Times are same in source and target General Discussion	5	1680	May 22, 2025
How to check correct term or lack of it General Discussion	1	373	November 25, 2022
Adding a glossary as a DO NOT TRANSLATE list Technical Support	8	9401	January 28, 2020
Key term doesn't flag mismatch Technical Support	1	1246	June 22, 2018

MultiTerm- KeyTerm Mismatches - Reducing False positives results

Related topics