Сегментация текста на основе прямого предложения

Предположим, у меня есть такой файл docx:

Когда я был маленьким мальчиком, мой отец взял меня в город, чтобы посмотреть марширующий оркестр. Он сказал: «Сынок, когда ты вырастешь, ты будешь спасителем сломленных?». Отец сел рядом со мной, обняв меня за плечи обеими руками. Я сказал: «Я бы». Мой отец ответил: «Это мой мальчик!»

И я хочу сегментировать базу docx по прямому предложению. Нравится :

sent1 : Он сказал: «Сын, когда ты вырастешь, ты будешь спасителем сломленных?»

sent2 : Я сказал "Я бы."

sent3 : Мой отец ответил: "Это мой мальчик!"

Я пытался использовать регулярное выражение. результат таков

When I was a young boy my father took me into the city to see a marching band.

He said, "Son when you grow up would you be the savior of the broken?

".

My father sat beside me, hugging my shoulders with both of his arms.

I said "I Would.

".

My father replied "That is my boy!

код регулярного выражения:

import re
SENTENCE_REGEX = re.compile('[^!?\.]+[!?\.]')
text = open ('text.docx','r')

def parse_sentences(text):
   return [x.lstrip() for x in SENTENCE_REGEX.findall(text)]

def print_sentences(sentences):
    print ("\n\n".join(sentences))

if __name__ == "__main__":
    print_sentences(parse_sentences(text))

python regex text-segmentation

Syafiqur Rahman 02.09.2018 источник

comment

I tried using regex. С каким кодом? - CertainPerformance 02.09.2018

Ответы (1)

arrow_upward
0
arrow_downward

import re

txt = '''When I was a young boy my father took me into the city to see a marching band. He said, "Son when you grow up would you be the savior of the broken?" My father sat beside me, hugging my shoulders with both of his arms. I said "I Would." My father replied "That is my boy!"'''

pttrn = re.compile(r'(\.|\?|\!)(\'|\")?\s')

new = re.sub(pttrn, r'\1\2\n\n', txt)

print(new)

Вывод:

When I was a young boy my father took me into the city to see a marching band.

He said, "Son when you grow up would you be the savior of the broken?".

My father sat beside me, hugging my shoulders with both of his arms.


I said "I Would."

My father replied "That is my boy!"

PS: Насколько я знаю, окончания типа ?"., .". или !". в английском языке запрещены.

Zilong Li 02.09.2018

comment

и как мне сгруппировать его на основе прямого предложения? - Syafiqur Rahman; 02.09.2018

comment

@SyafiqurRahman lst = output.slip(\n\n) - Zilong Li; 02.09.2018

comment

Соскальзывать? Вы имели в виду раскол? потому что, когда я пытался, не было атрибута Slip. - Syafiqur Rahman; 02.09.2018

comment

@SyafiqurRahman Да, я имею в виду разделение. Извините за опечатку. Вы можете разделить строку на список строк на основе любого разделителя. - Zilong Li; 03.09.2018

Сегментация текста на основе прямого предложения

Ответы (1)

Вопросы по теме