ApSIC Xbench Forum

Regular expression for checking split sentences


#1

Hello,

I was wondering if it is possible to use regular expressions to check splitting sentences when a translator is instructed not to split them and does it anyway. Basically, to check if there is the same number of sentences in the target as in the source.

Example:
Source segment: There is an incredibly long sentence in the source language which would be better split into more shorter sentences in the target language but for whatever reason, the translator mustn’t do so. (1 sentence)

Target segment: Short sentence one. Short sentence two. Short sentence three. (3 sentences)

The desired checker would flag this example as an error. Perhaps it would be possible to use a regexp that would check the number of periods but on the other hand that might lead to false alerts when there are ordinals in the target segment as well.

Thank you,
Kate


#2

Hi Kate,

This check would require developing a plugin.

At the Xbench github page there is a plugin sample for Visual Studio C++ that has two functions:

  1. Show all segments that have a suspicious length (source text too long compared to target or vice-versa)
  2. Show the 3 longest target strings

Visual Studio is required to open and compile the C++ project.

Oscar.


#3

Perhaps you can search for a sequence of period + space + uppercase as a heuristic.

Something to start with could be that source does not have such sequence and target does. You could be missing cases where source has two sentences and target has 3, but I think it is unlikely because probably your segmentation rules will prevent source from having two sentences.

If ordinals in languages such as German produce two many false positives, perhaps you could refine it by searching “not digit” + period + space + uppercase.


#4

Thank you both for your replies.

I think the plugin option is not exactly what I was looking for as the sements legth would not differ much, just the number of sentences within the segment might be different. It is also way beyond my tech skills :slight_smile:

pcondal’s suggestion could work well for my intended purpose. Excluding digits might not be necessary because in Czech, my TL, uppercase is used only at the beginning of a sentence and for names/titles.

Thank you again for your ideas!

Edit: I’ve tested pcondal’s idea with the following sequence and it works perfectly on my test file.
Target: \.[:sep:][:letter:]%
Mode: RegEx, Case sensitive checked.