# Algorithm of Splitting Text With Groups of Text Separators

## Algorithm

Connections graphs for characters from Group1 and characters from Group2 should be provided for the given algorithm. The way for obtaining this graph is described here:

### Step 1 : Calculate average sentence size (in tokens)

Average sentence size can be calculated by counting distances between characters from Group1. For example:

Input tokens:

Hello! My name is: Max. I live in Kiev. My age is 18 years. This summer i'll go to Paris! And what about you?


Group1 characters: .!? Distances would be:

• 1 - Hello! My name is: Max. I live in Kiev. My age is 18 years. This summer i’ll go to Paris! And what about you?
• 4 - Hello! My name is: Max. I live in Kiev. My age is 18 years. This summer i’ll go to Paris! And what about you?
• 4 - Hello! My name is: Max. I live in Kiev. My age is 18 years. This summer i’ll go to Paris! And what about you?
• 5 - Hello! My name is: Max. I live in Kiev. My age is 18 years. This summer i’ll go to Paris! And what about you?
• 6 - Hello! My name is: Max. I live in Kiev. My age is 18 years. This summer i’ll go to Paris! And what about you?
• 4 - Hello! My name is: Max. I live in Kiev. My age is 18 years. This summer i’ll go to Paris! And what about you?

average size in this case would be: (1 + 4 + 4 + 5 + 6 + 4) / 6 = 4

### Step 2 : Splitting sentences

Rules to be applied in the text for each splitter from Group 1:

1. If the size of previous or next Group 1 character is bigger by more than 20% than the average sentence size, this splitter character is used for splitting sentences; otherwise, go to the next check.
2. Calculating distance to connections graphs:
1. Creating one-character connection graph consisting of the splitter character and the first character of the next token;
2. Calculating distances between one-character connection graph and Group1/2 connections graphs;
3. If one-character graph is closer to Group 1 characters connections graph, this character is used for sentence splitting;
4. If one-character graph is closer to Group 2 characters connections graph, this character is NOT used for sentence splitting.

By applying the two steps above to each Group 1 character in the text, the text can be split into sentences. Applying the algorithm to the next example input:

... He was born in the U.S.S.R. during the second world war...


will allow to split this sentence correctly.

Here are two examples from real books where Group 1 character usage is recognized as usage for non- sentence splitting purpose:

• … by U.S. federal laws and …
• … the U.S. unless a copyright …

## Research problems for this step

• Magic number - 20%?
• There has to be one formula that represents the probability that this particular Group 1 character is used for sentence splitting. If such a formula is created, the last question will be to find the correct threshold.