MultiTerm- KeyTerm Mismatches - Reducing False positives results


#1

I have a client with a very large glossary. A Studio MultiTerm Term Base has been created, the file has been prepped so that if a source term has multiple target terms, each target term has its own row. As you can imagine this creates a huge list of false positives. Is there a quick way to program Personal Checklist in ApSIC? What is the best way to do this? The glossary currently has 6K rows.
For example:
Example 1
health care atención médica
health care atención de salud
health care cuidado de la salud
Example 2
discharge 1. A flowing out or pouring forth; emission; secretion. secreción
discharge 1. A flowing out or pouring forth; emission; secretion. supuración
discharge 1. A flowing out or pouring forth; emission; secretion. flujo
discharge 1. A flowing out or pouring forth; emission; secretion. exudación
discharge 2. The point at which the patient leaves the hospital alta hospitalaria
discharge 2. The point at which the patient leaves the hospital dar de alta

Also, if the part of a source keyterm is in the text it will also come up as a keyterm mismatch:

Any insight is appreciated!


#2

You can avoid this issue if you define these entries as synonyms when you prepare the MultiTerm term base.

For example, if in MultiTerm you have one concept with “health care” as English term and “atención médica”, “atención de salud”, and “atención de la salud” as synonyms, the Xbench Key Terms check will only flag an issue if none of the synonyms is found.


#3

Thank you for the prompt response!


#4

In the sample ApSIC QA report above, is there a way to not get false positives when part of the Source word is being flagged, for example, Key Term Mismatch (health/ salud) when the source segment is actually Health Care Ambassadors Program.

Hope this makes sense.


#5

This checks cannot be performed via key term mismatch as it will find any segments that contain source term but target term (including synonyms) are missing.

Checklists enable you to perform this check in the following way:

Source term: “health” -" Health Care Ambassadors Program"
Target term: -"salud"
Enable the Powersearch check box.

In this way, Xbench will evaluate those strings that contain “health” but will omit those that contain “Health Care Ambassadors Program”.

You can include synonyms in target if health is translated in different ways such as “health care”.

Source: "health care"
Target: -“atención médica” -“atención de salud” -"atención de la salud"
Enable the Powersearch check box.


#6

Checklists enable you to perform this check in the following way:

Source term: “health” -" Health Care Ambassadors Program"
Target term: -"salud"
Enable the Powersearch check box.

The only risk with this first approach is that segments containing both “health” and “Health Care Ambassadors Program” will be skipped.

It might be safer to check for it this way:

  • Source term: “health”
  • Target term: -“salud” -“atención médica” (or what ever the “health care” part of “Health Care Ambassadors Program” is translated as.
  • Enable the Powersearch check box.

The most efficient solution should be to add “Health Care Ambassadors Program” as a term in your Keyterm list.


#7

You are right, segments containing "health and “Health Care Ambassadors Program” will be skipped.

The following search using regular expressions as search mode and powersearch will work:

Source: "<health( [^c]|[:punct:])" OR "<health$"
Target: -"<salud>"

This search will display segments that comply with these two conditions:

  1. Source contains health at the end of the string or followed by punctuation or any word that do not begin with c.

  2. Target does not contain salud (match whole word).


#8

In general, the following approach can be used in order to avoid false positives if a source word is included in another term such as “health” and “Health Care Ambassador Program”.

Source: "health" OR ("health>.* Health Care Ambassadors Program" OR "Health Care Ambassadors Program .* <health>")
Target: -"<salud>"

Use Regular Expressions as Search Mode and PowerSearch.

This search will display all segments that contain “health” OR both terms in source but “salud” is missing in target.

Firstly, Xbench looks for all segments that contain health in source and then evaluates if this term is preceded or followed by “Health Care Ambassadors Program”. Then, it displays all segments where “salud” is missing in target.


#9

I’ve been trying to achieve a very similar result: to have Xbench ignore occurrences of the source term when the context is different, like “health” when it occurs in “healthcare”/“health care”, or “check” in “checkbox”/“check box”, including:

  • cases when the term appears in both contexts (the one that should be checked and the one that should be ignored, i.e. both “health” and “healthcare” in one segment);
  • cases when a term appears in yet another context, but it’s not covered by the checklist, and therefore should be shown as error (e.g. “healthcentre”).

I tried the methods described here as well as my own ideas, and nothing seems to work as I’d like it to.

I don’t exactly understand what’s the purpose of using < and > in the target here (as in: -"<salud>"), but in my case > can’t be used because the target terms may have inflected forms.

To describe the problem in the most detailed way possible, I gathered:

  • source and target (TXT) and bilingual (XLIFF) files;
  • detailed description of what the result should be and why (RTF);
  • Xbench checklist (XBCKL) with 5 different methods used (I actually tried much more options that didn’t work either…);
  • Xbench report with comments of what’s correct and what’s not, and which method was used.

All of the above is here: http://bit.ly/1RbW4Dm
ZIP password: Xb_2016

I’d appreciate any suggestions.


#10

@awijas, I believe that some of your checklist items have errors. For example, this one…

  • Source: "<check" OR -"checkbox|check box"
  • Target: "-<sprawd"

…means the following: “Show all segments that either have a word that starts with “check” OR all segments that do not contain “checkbox” or “check box”.”

Please note that the avoid condition is matched also by all segments that contain “health” because they do not contain “checkbox” or “check box” and they do not have a word that starts with “sprawd”. Probably you want to remove the OR in the source expression so that it becomes an AND condition (because in the Powersearch grammar “nothing” means AND).

Also, I recommend using the Test function in the Checklist Manager: Just select the checklist item, right-click and choose Test (also Ctrl+T). It will run the selected checklist item against the files you have loaded in Xbench and it will show the matching segments. This way you will corroborate if the checklist item is effective or if it requires further tweaking. Especially for complex regular expressions, it is a good idea to test them individually against a sizeable corpus.

Also, be careful when using regular expressions, they are tricky, a double-edged sword. For example, the expression "<check$" actually means show all segments that end with the word “check” because $ means “end of segment”.


#11

@pcondal:

  1. I know that all the expressions are incorrect, but I haven’t managed to create a correct one. I also know about the Test fuction, but I just wanted to create a file showing my attempts so that maybe someone can indicate how to modify one of them to achieve the goal I want to achieve
  2. If I change to AND as you suggest, it won’t catch segments that contain both “check” (without “box”) and “checkbox”/“check box” and at the same time don’t contain “sprawd”.
  3. I know the meaning of $ and other special characters, but it was something @omartin suggested above (as I indicated, I do not fully understand what’s the purpose of some of his suggestions), so I just wanted to check the result.

#12

@awijas, the recommended approach is the AND heuristic mentioned for legibility and maintainability, although since it is an heuristic there might be false negatives.

If you cannot afford the occasional false negatives, you can often dig a little bit more in Regex to find a fully deterministic solution (for one instance of check, not two) instead of an heuristic one. For example, in this case it would be:

  • Source : "<check$" or "<check[,\.]" or "<check [^b][^o][^x]"
  • Target: "-<sprawd"

Which can be read as: "Find ‘check’ at end of the segment, or the word check followed by a comma or period, or the word check plus a space followed by something that is not “box”.

In general checklist items will be heuristics that can be improved over time as false positives and false negatives are detected. Obviously the limits are the Regex and Powersearch search grammar, which can be very high if mastered, but often the outcome will be a heuristic.

If the heuristic capabilities provided by the search engine still come short for some of your checks, there is always the resource of writing a Xbench QA plugin in C++ or Delphi (samples here), whose flexibility can elevate the processing of the segment to even machine learning levels, while still leveraging the whole Xbench environment (parsing and Edit Source, for example).


#13

@pcondal, thanks, but, unfortunately, your version:

  • doesn’t cover all my options, e.g. “checkbox” without a space
  • doesn’t find all errors, e.g. finds nothing for “health”, even though I made the check exactly the same way as for “check”

I made some adjustments:

  • used [:punct:] to cover all punctuation, since not only commas or spaces may occur
  • tried to handle the version without a space as well, but with no luck
  • for both “check” and “health” I got 1 more correct result, and 1 that shouldn’t appear

I don’t want to go into plugins for such a (I thought it should be) simple issue.

My versions:
"<check$" OR "<check[:punct:]" OR "<check[:space:][^b][^o][^x]" OR "<check[^b][^o][^x]"
or
"<check$" OR "<check[:punct:]" OR "<check([:space:]?)[^b][^o][^x]"

See: bit.ly/1P7WCZR
Pass: Xb_2016


#14

I’m not sure if I follow it. I acknowledge I forgot about the “checkbox” exclusion but, does’t your solution below address it? Or you are referring to an issue with another specific checklist item in your checklist?

In any case, please note that – without resourcing to Xbench QA plugins – your upper limit for check sofistication is ultimately limited by the Xbench search grammar, which is a combination of POSIX Regex in one search layer and Xbench Powersearch on the layer on top.


#15

Sorry, I didn’t express myself clearly.

The problem is that my solution made Xbench:

  1. find more real errors: “Check book”, “Health centre” (it’s good)
  2. find more false positives: “Check box”, “Health care” (it’s bad - they shouldn’t be found)
  3. still not find some real errors: “Checkbook”, “Healthcentre” (it’s bad - they should be found)

You can compare the RTF file with the XLSX QA report (see link in my previous post) to see exactly what should (not) be found according to my intensions (in RTF) and what is (not) found with the use of my expressions (in XLSX).

I think regexps should be enough to achieve this, especially that most cases are already covered. Maybe some ORs are used incorrectly or some parts of the expression should be enclosed in (), or there’s yet another simple answer?..