Введение:

Токенизация — это процесс преобразования абзаца/корпуса в небольшие части. Токенизация может быть выполнена с использованием библиотеки NLTK. Используя библиотеку NLTK, мы можем разбить абзац на слова или предложения. Это означает, что весь абзац делится на предложения в случае токенизации предложений. Принимая во внимание, что абзац можно разделить на слова с помощью токенизации слов.

Давайте посмотрим, как мы можем реализовать токенизацию с помощью библиотеки NLTK:

Шаг 1. Импорт библиотеки NLTK

import nltk # Library for language processing
nltk.download() # Opens up a NLTK downloader
nltk.download(‘punkt’) # Download the required package

Шаг 2. Назначение переменной для обработки языка

paragraph = “””Marvel’s Spider-Man[a] is a 2018 action-adventure game developed by Insomniac Games and published by Sony Interactive Entertainment. Based on the Marvel Comics superhero Spider-Man, it is inspired by the long-running comic book mythology and adaptations in other media. In the main storyline, the super-human crime lord Mister Negative orchestrates a plot to seize control of New York City’s criminal underworld. When Mister Negative threatens to release a deadly virus, Spider-Man must confront him and protect the city while dealing with the personal problems of his civilian persona, Peter Parker.Gameplay is presented from the third-person perspective with a primary focus on Spider-Man’s traversal and combat abilities. Spider-Man can freely move around New York City, interacting with characters, undertaking missions, and unlocking new gadgets and suits by progressing through the main story or completing tasks. Outside the story, the player is able to complete side missions to unlock additional content and collectible items. Combat focuses on chaining attacks together and using the environment and webs to incapacitate numerous foes while avoiding damage.Development of Spider-Man, the first licensed game by Insomniac in its then-22-year history, began in 2014 and took approximately four years. Insomniac was given the choice of using any character from Marvel’s catalog to work on; Spider-Man was chosen both for his appeal to the employees and the similarities in traversal gameplay to their previous game Sunset Overdrive (2014). The game design took inspiration from the history of Spider-Man across all media but Marvel Comics and Insomniac wanted to tell an original story that was not linked to an existing property, creating a unique universe (known as Earth-1048) that has since appeared in novels, merchandise, and comics.Spider-Man was released for the PlayStation 4 on September 7, 2018. The game received praise for its narrative, characterization, combat, and web-swinging traversal mechanics, although some criticized its open-world design for lacking innovation. A number of reviewers called it one of the best superhero games ever made, some comparing it favorably to the Batman: Arkham series. It won several year-end accolades from a variety of gaming publications, critics, and game award ceremonies. Spider-Man became one of the fastest-selling games of the year, one of the best-selling PlayStation 4 games of all time, and the fastest-selling superhero game in the United States. Spider-Man was followed by a story-based, three-part downloadable content, Spider-Man: The City that Never Sleeps, which was released monthly from October to December 2018. A Game of the Year edition was released in August 2019.””” 

Текст выше взят со страницы википедии Spider Man 2018 Video game.

Когда у нас есть корпус, мы можем выполнить его токенизацию. Как обсуждалось во введении, мы можем выполнить 2 типа токенизации.

Давайте сначала реализуем токенизацию предложения:

sentences = nltk.sent_tokenize(paragraph)

Мы видим, что синтаксис «sent_tokenize» используется для разбиения абзаца на предложения. Теперь давайте посмотрим, из чего состоят переменные предложения.

sentences
[“Marvel’s Spider-Man[a] is a 2018 action-adventure game developed by Insomniac Games and published by Sony Interactive Entertainment.”,
 ‘Based on the Marvel Comics superhero Spider-Man, it is inspired by the long-running comic book mythology and adaptations in other media.’, 
“In the main storyline, the super-human crime lord Mister Negative orchestrates a plot to seize control of New York City’s criminal underworld.”, 
‘When Mister Negative threatens to release a deadly virus, Spider-Man must confront him and protect the city while dealing with the personal problems of his civilian persona, Peter Parker.’, 
“Gameplay is presented from the third-person perspective with a primary focus on Spider-Man’s traversal and combat abilities.”,
 ‘Spider-Man can freely move around New York City, interacting with characters, undertaking missions, and unlocking new gadgets and suits by progressing through the main story or completing tasks.’, 
‘Outside the story, the player is able to complete side missions to unlock additional content and collectible items.’, 
‘Combat focuses on chaining attacks together and using the environment and webs to incapacitate numerous foes while avoiding damage.’, 
‘Development of Spider-Man, the first licensed game by Insomniac in its then-22-year history, began in 2014 and took approximately four years.’, 
“Insomniac was given the choice of using any character from Marvel’s catalog to work on; Spider-Man was chosen both for his appeal to the employees and the similarities in traversal gameplay to their previous game Sunset Overdrive (2014).”, 
‘The game design took inspiration from the history of Spider-Man across all media but Marvel Comics and Insomniac wanted to tell an original story that was not linked to an existing property, creating a unique universe (known as Earth-1048) that has since appeared in novels, merchandise, and comics.’, 
‘Spider-Man was released for the PlayStation 4 on September 7, 2018.’, ‘The game received praise for its narrative, characterization, combat, and web-swinging traversal mechanics, although some criticized its open-world design for lacking innovation.’, 
‘A number of reviewers called it one of the best superhero games ever made, some comparing it favorably to the Batman: Arkham series.’, ‘It won several year-end accolades from a variety of gaming publications, critics, and game award ceremonies.’,
 ‘Spider-Man became one of the fastest-selling games of the year, one of the best-selling PlayStation 4 games of all time, and the fastest-selling superhero game in the United States.’, 
‘Spider-Man was followed by a story-based, three-part downloadable content, Spider-Man: The City that Never Sleeps, which was released monthly from October to December 2018.’,
 ‘A Game of the Year edition was released in August 2019.’]

Как мы видим, переменная предложения теперь имеет целый абзац, разделенный на каждое предложение. Давайте посмотрим общее количество предложений в нашем абзаце.

len(sentences) # len(obj) to check the length of the object
18

Итак, в нашем абзаце 18 предложений.

Теперь давайте посмотрим, как реализовать токенизацию слов:

words = nltk.word_tokenize(paragraph) 

Мы используем синтаксис «word_tokenize», чтобы преобразовать абзац в слова.

Давайте проверим, что имеет переменное слово.

words
[‘Marvel’, “‘s”, ‘Spider-Man’, ‘[‘, ‘a’, ‘]’, ‘is’, ‘a’, ‘2018’, ‘action-adventure’, ‘game’, ‘developed’, ‘by’, ‘Insomniac’, ‘Games’, ‘and’, ‘published’, ‘by’, ‘Sony’, ‘Interactive’, ‘Entertainment’, ‘.’, ‘Based’, ‘on’, ‘the’, ‘Marvel’, ‘Comics’, ‘superhero’, ‘Spider-Man’, ‘,’, ‘it’, ‘is’, ‘inspired’, ‘by’, ‘the’, ‘long-running’, ‘comic’, ‘book’, ‘mythology’, ‘and’, ‘adaptations’, ‘in’, ‘other’, ‘media’, ‘.’, ‘In’, ‘the’, ‘main’, ‘storyline’, ‘,’, ‘the’, ‘super-human’, ‘crime’, ‘lord’, ‘Mister’, ‘Negative’, ‘orchestrates’, ‘a’, ‘plot’, ‘to’, ‘seize’, ‘control’, ‘of’, ‘New’, ‘York’, ‘City’, “‘s”, ‘criminal’, ‘underworld’, ‘.’, ‘When’, ‘Mister’, ‘Negative’, ‘threatens’, ‘to’, ‘release’, ‘a’, ‘deadly’, ‘virus’, ‘,’, ‘Spider-Man’, ‘must’, ‘confront’, ‘him’, ‘and’, ‘protect’, ‘the’, ‘city’, ‘while’, ‘dealing’, ‘with’, ‘the’, ‘personal’, ‘problems’, ‘of’, ‘his’, ‘civilian’, ‘persona’, ‘,’, ‘Peter’, ‘Parker’, ‘.’, ‘Gameplay’, ‘is’, ‘presented’, ‘from’, ‘the’, ‘third-person’, ‘perspective’, ‘with’, ‘a’, ‘primary’, ‘focus’, ‘on’, ‘Spider-Man’, “‘s”, ‘traversal’, ‘and’, ‘combat’, ‘abilities’, ‘.’, ‘Spider-Man’, ‘can’, ‘freely’, ‘move’, ‘around’, ‘New’, ‘York’, ‘City’, ‘,’, ‘interacting’, ‘with’, ‘characters’, ‘,’, ‘undertaking’, ‘missions’, ‘,’, ‘and’, ‘unlocking’, ‘new’, ‘gadgets’, ‘and’, ‘suits’, ‘by’, ‘progressing’, ‘through’, ‘the’, ‘main’, ‘story’, ‘or’, ‘completing’, ‘tasks’, ‘.’, ‘Outside’, ‘the’, ‘story’, ‘,’, ‘the’, ‘player’, ‘is’, ‘able’, ‘to’, ‘complete’, ‘side’, ‘missions’, ‘to’, ‘unlock’, ‘additional’, ‘content’, ‘and’, ‘collectible’, ‘items’, ‘.’, ‘Combat’, ‘focuses’, ‘on’, ‘chaining’, ‘attacks’, ‘together’, ‘and’, ‘using’, ‘the’, ‘environment’, ‘and’, ‘webs’, ‘to’, ‘incapacitate’, ‘numerous’, ‘foes’, ‘while’, ‘avoiding’, ‘damage’, ‘.’, ‘Development’, ‘of’, ‘Spider-Man’, ‘,’, ‘the’, ‘first’, ‘licensed’, ‘game’, ‘by’, ‘Insomniac’, ‘in’, ‘its’, ‘then-22-year’, ‘history’, ‘,’, ‘began’, ‘in’, ‘2014’, ‘and’, ‘took’, ‘approximately’, ‘four’, ‘years’, ‘.’, ‘Insomniac’, ‘was’, ‘given’, ‘the’, ‘choice’, ‘of’, ‘using’, ‘any’, ‘character’, ‘from’, ‘Marvel’, “‘s”, ‘catalog’, ‘to’, ‘work’, ‘on’, ‘;’, ‘Spider-Man’, ‘was’, ‘chosen’, ‘both’, ‘for’, ‘his’, ‘appeal’, ‘to’, ‘the’, ‘employees’, ‘and’, ‘the’, ‘similarities’, ‘in’, ‘traversal’, ‘gameplay’, ‘to’, ‘their’, ‘previous’, ‘game’, ‘Sunset’, ‘Overdrive’, ‘(‘, ‘2014’, ‘)’, ‘.’, ‘The’, ‘game’, ‘design’, ‘took’, ‘inspiration’, ‘from’, ‘the’, ‘history’, ‘of’, ‘Spider-Man’, ‘across’, ‘all’, ‘media’, ‘but’, ‘Marvel’, ‘Comics’, ‘and’, ‘Insomniac’, ‘wanted’, ‘to’, ‘tell’, ‘an’, ‘original’, ‘story’, ‘that’, ‘was’, ‘not’, ‘linked’, ‘to’, ‘an’, ‘existing’, ‘property’, ‘,’, ‘creating’, ‘a’, ‘unique’, ‘universe’, ‘(‘, ‘known’, ‘as’, ‘Earth-1048’, ‘)’, ‘that’, ‘has’, ‘since’, ‘appeared’, ‘in’, ‘novels’, ‘,’, ‘merchandise’, ‘,’, ‘and’, ‘comics’, ‘.’,‘Spider-Man’, ‘was’, ‘released’, ‘for’, ‘the’, ‘PlayStation’, ‘4’, ‘on’, ‘September’, ‘7’, ‘,’, ‘2018’, ‘.’, ‘The’, ‘game’, ‘received’, ‘praise’, ‘for’, ‘its’, ‘narrative’, ‘,’, ‘characterization’, ‘,’, ‘combat’, ‘,’, ‘and’, ‘web-swinging’, ‘traversal’, ‘mechanics’, ‘,’, ‘although’, ‘some’, ‘criticized’, ‘its’, ‘open-world’, ‘design’, ‘for’, ‘lacking’, ‘innovation’, ‘.’, ‘A’, ‘number’, ‘of’, ‘reviewers’, ‘called’, ‘it’, ‘one’, ‘of’, ‘the’, ‘best’, ‘superhero’, ‘games’, ‘ever’, ‘made’, ‘,’, ‘some’, ‘comparing’, ‘it’, ‘favorably’, ‘to’, ‘the’, ‘Batman’, ‘:’, ‘Arkham’, ‘series’, ‘.’, ‘It’, ‘won’, ‘several’, ‘year-end’, ‘accolades’, ‘from’, ‘a’, ‘variety’, ‘of’, ‘gaming’, ‘publications’, ‘,’, ‘critics’, ‘,’, ‘and’, ‘game’, ‘award’, ‘ceremonies’, ‘.’, ‘Spider-Man’, ‘became’, ‘one’, ‘of’, ‘the’, ‘fastest-selling’, ‘games’, ‘of’, ‘the’, ‘year’, ‘,’, ‘one’, ‘of’, ‘the’, ‘best-selling’, ‘PlayStation’, ‘4’, ‘games’, ‘of’, ‘all’, ‘time’, ‘,’, ‘and’, ‘the’, ‘fastest-selling’, ‘superhero’, ‘game’, ‘in’, ‘the’, ‘United’, ‘States’, ‘.’, ‘Spider-Man’, ‘was’, ‘followed’, ‘by’, ‘a’, ‘story-based’, ‘,’, ‘three-part’, ‘downloadable’, ‘content’, ‘,’, ‘Spider-Man’, ‘:’, ‘The’, ‘City’, ‘that’, ‘Never’, ‘Sleeps’, ‘,’, ‘which’, ‘was’, ‘released’, ‘monthly’, ‘from’, ‘October’, ‘to’, ‘December’, ‘2018’, ‘.’, ‘A’, ‘Game’, ‘of’, ‘the’, ‘Year’, ‘edition’, ‘was’, ‘released’, ‘in’, ‘August’, ‘2019’, ‘.’]

word_tokenize преобразовал весь абзац в массив слов.

Давайте проверим количество слов в нашем абзаце

len(words) # len(obj) to check the length of the object
472

Итак, в нашем абзаце 472 слова.