Regular expression for checking split sentences

tomika · March 13, 2019, 6:52pm

Hello,

I was wondering if it is possible to use regular expressions to check splitting sentences when a translator is instructed not to split them and does it anyway. Basically, to check if there is the same number of sentences in the target as in the source.

Example:
Source segment: There is an incredibly long sentence in the source language which would be better split into more shorter sentences in the target language but for whatever reason, the translator mustn’t do so. (1 sentence)

Target segment: Short sentence one. Short sentence two. Short sentence three. (3 sentences)

The desired checker would flag this example as an error. Perhaps it would be possible to use a regexp that would check the number of periods but on the other hand that might lead to false alerts when there are ordinals in the target segment as well.

Thank you,
Kate

omartin · March 14, 2019, 9:06am

Hi Kate,

This check would require developing a plugin.

At the Xbench github page there is a plugin sample for Visual Studio C++ that has two functions:

Show all segments that have a suspicious length (source text too long compared to target or vice-versa)
Show the 3 longest target strings

Visual Studio is required to open and compile the C++ project.

Oscar.

pcondal · March 15, 2019, 10:34am

Perhaps you can search for a sequence of period + space + uppercase as a heuristic.

Something to start with could be that source does not have such sequence and target does. You could be missing cases where source has two sentences and target has 3, but I think it is unlikely because probably your segmentation rules will prevent source from having two sentences.

If ordinals in languages such as German produce two many false positives, perhaps you could refine it by searching “not digit” + period + space + uppercase.

tomika · March 15, 2019, 7:12pm

Thank you both for your replies.

I think the plugin option is not exactly what I was looking for as the sements legth would not differ much, just the number of sentences within the segment might be different. It is also way beyond my tech skills

pcondal’s suggestion could work well for my intended purpose. Excluding digits might not be necessary because in Czech, my TL, uppercase is used only at the beginning of a sentence and for names/titles.

Thank you again for your ideas!

Edit: I’ve tested pcondal’s idea with the following sequence and it works perfectly on my test file.
Target: \.[:sep:][:letter:]%
Mode: RegEx, Case sensitive checked.

Topic		Replies	Views
Impossible? Word by word check General Discussion	1	1760	May 25, 2017
How to catch missing punctuation signs in target language General Discussion	11	2027	July 2, 2024
Character limit Technical Support	7	2356	February 19, 2019
Fine partial matches in source/target Technical Support	8	2293	October 28, 2019
Check number formatting Technical Support	8	2745	February 9, 2018

Regular expression for checking split sentences

Related topics