Сегодня мы можем печатать больше, чем говорить, огромное количество текстовых данных генерировалось каждую минуту с цифровой эры. Данные - новое масло (и даже более ценное для тех, кто умеет его смешивать). Однако человеческий мозг не умеет читать большие объемы текста. Наши глаза ограничены с двух сторон, что не позволяет нормальному человеку читать более 200–300 слов в минуту. Согласно Книге рекордов Гиннеса, Мария Тереза ​​Кальдерон может читать более 50 000 слов в минуту: в 166 раз быстрее, чем мы (Мария смогла закончить чтение этого блога к концу первых двух строка этого абзаца). Было бы здорово, если бы все мы смогли превзойти Марию в чтении с помощью нескольких строк кода Python? Ответ положительный! Было бы здорово, с некоторой помощью бесплатного программного обеспечения fastText и небольшим пониманием скриптов Python.

Первая задача - это базовая обработка естественного языка (NLP) и применение некоторого метода «классификации текста», при котором каждую строку текстовых данных можно разделить на назначенные группы. Например, банковское дело может предлагать бесчисленное количество финансовых услуг, требующих письменного заявления от клиентов. Модуль классификации текста (классификатор) может прочитать эти выписки и отнести их к одной из категорий банковских услуг. В любом случае, машины нужно обучать, чтобы понимать человеческий язык.

Как начать тренировку машины? На этом занятии мы расскажем, как быстро и грязно написано на Python. Сначала откройте оболочку Python… и загрузите следующие зависимости: pandas, fastText и scikit-learn.

import pandas as pd
path_to_file = './data/consumer_complaints.csv'
df = pd.read_csv(path_to_file, usecols=[1,5], \
    dtype={'consumer_complaint_narrative': object})
df.dropna(inplace=True)

Фрейм данных Жалобы на финансирование потребителей в США (df) содержит два важных столбца. Мы изучаем [продукт], который представляет доступные банковские услуги, и учим наш компьютер узнавать жалобы клиентов с помощью [consumer_complaint_narrative]. Прежде чем углубляться в этот фрейм данных, мы должны знать, сколько продуктов упоминается в жалобах и каково количество жалоб по каждому продукту.

from collection import Counter
counts = Counter(df['product'])
counts
# Counter({'Debt collection': 17552, 'Mortgage': 14919, 'Credit reporting': 12526, 'Credit card': 7929, 'Bank account or service': 5711, 'Consumer Loan': 3678, 'Student loan': 2128, 'Prepaid card': 861, 'Payday loan': 726, 'Money transfers': 666, 'Other financial service': 110})

Ниже приведены примеры жалоб клиентов на «взыскание долгов».

for complaint in df[df['product'] == 'Debt collection'].consumer_complaint_narrative.sample(3, random_state=1):
    print(complaint)
    print('________________________________________')
# Had 4 phone calls in one day to my cell phone about debt collecting.
# They are asking to talk to a XXXX XXXX ... ... Not me ... .Never heard of him. They got the wrong number! I keep explaining to them you got the wrong number and they get very rude!
# 
# ________________________________________
# My sister provided Hyundai Motor Finance my phone # while hers was not working. I received a call from their XXXX number and when advised my sister was not available and asked who was calling. Female declined to identify herself or her company. I advised that the cell phone being called belongs to me and they no longer have my permission to dial my number manually or via their automated dialer. Female then hung up on me. My sister took care of the past due payment ( was just an oversight ) and we assumed everything was good. payment rec by HMF on XXXX/XXXX/15. On XXXX/XXXX/15 I recevied another call from HMF. I had my sister call back and they advised her account current and no record of call. Advised could furnish proof of call and requested again that they no longer call me.Rep said nothing he can do. Sister requested supervisor who told her the only way they would guarantee no calls to my phone is if she revokes her permission for ANY phone contact. My sister said she still wants to be contacted, just not at my phone number. Supervisor said nothing he can do. Also said that there were no automated calls out and that the only way she would have been dialed after she revoked permission, would be a manual dial. I called back in and advised that it is my cell and I revoke authorization, that they absolutely were not to manually/auto call my number any further. Ultimately they were calling out to an unauthorized party about a current account ... manually. Intentional harrassment. I received another call today from the XXXX number and a 28 second long dead air voicemail. Sister called back in and of course they had no record of call and were unhelpful again. I called back in, they said i had to get in touch with consumer affair department and because I 'm not a customer, just an innocent third party that they are harrassing that I would have to communicate with them via mail. Finally agreed to send me over via phone so solution could be expedited, but they kept just transferring me back to customer service. Eventually male rep told me that there was nothing I could do about the phone calls.
# 
# ________________________________________
# This item claimed by Action Collections on behalf of Benchmark Apartments from XX/XX/XXXX in the amount of {$2100.00} ( originally was {$1200.00} until I filed a dispute then was increased as of XX/XX/XXXX to {$2100.00} ) is inaccurate because the Benchmark Apartments charged me for inaccurate charges, falsified an important document and then immediately sent me to Action Collections with intentions of receiving illegitimate payment, whilst destroying my credit. This has went on for 7 years and has prevented me from having perfect credit with this one exception to my credit report. I worked with the XXXX as a mediator between myself and the XXXX as well spoke with Attorney General in XXXX, Id whom both agreed this collection was incredulous and a mistake. The XXXX were willing to bring amount owed to {$300.00} through XXXX, however Action Collections still had the collection and were completely unwilling to work with me. Action Collections shredded all the documents I submitted one which was an original move in/out document, which Benchmark falsified and I tried informing Action Collections they forced my signature and I had the original document, they asked me to send in all my receipts and that document. I made copies of everything but submitted in hopes they would see that XXXX never had intentions of performing a move out walk through to verify I had cleaned the apartment and previous damage to apartment before I moved in. XXXX went and had their own cleaning crew come in and then sent me an invoice for cleaning charges, inaccurate rent amount and miscellaneous charges that were never agreed to in contract. I had receipts to show I had cleaned the apartment in hopes of getting my deposit returned and they never met me after moving out. They immediately sent me to collections when I questioned the invoice that was sent. I am requesting the item be removed completely and as soon as possible from my credit report because it is affecting my life negatively. Enclosed are the following attachments ( exhibits ) : ( Exhibit XXXX ) copies of receipts from cleaning services I paid for per my rental contract when moving out the XXXX apartments. I made several attempts to check out with a manager by calling and going down to office but was never returned any phone calls. Finally, I was called back by someone in their office only to obtain my address and was informed that the apartment had been rented. In addition was notified the apartment had been rented. This was important because under the condition I have professional cleaning services done, have proper walk out to show they were done and provide receipts, if the apartment rented prior to end of month I would not be responsible for the {$650.00} rent and I would receive my deposit back of {$300.00}. Instead of my deposit I received a lengthy itemized invoice for inaccurate charges and charging me rent, which was not the correct rent amount in the agreement, ( Exhibit XXXX ). The rent amount in invoice was inaccurate from rental contract ( Exhibit XXXX ). There were also charges for professional services I had already had paid for to have cleaned. I demanded to know why these charges were invoiced to me but never heard back from anyone and then after two months of no answer and or hanging up on me, I received a notice from Action Collections that I was to pay them and my account was turned over to them. I pleaded with Action Collections as well contacted XXXX because I felt I was handled wrongly and never had a chance to hear from Benchmark ( Exhibits XXXX, XXXX, XXXX ). In addition, XXXX XXXX sent a falsified move-in sheet ( Exhibit XXXX ) with my name at top and the signing manager at bot
# 
# ________________________________________

Теперь переходим к части моделирования: сначала данные разделяются на два набора: обучающий набор и тестовый набор. Обучающий набор используется для обучения машины пониманию человеческого языка, а набор тестов поможет нам измерить, насколько умна машина.

from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=.2, random_state=9)

Чтобы обрабатывать неструктурированные текстовые данные, я начинаю использовать fastText (нелегко установить, но поверьте мне, это того стоит!) Для классификации текста. При использовании fastText формат входных данных оказывает огромное влияние на обучение машины. Преобразуем входные данные в требуемый формат fastText.

def make_fastText_input(df, fname):
    texts = df.consumer_complaint_narrative.values
    cat_id = df.category_id.values
    with open(fname, 'w', encoding='utf-8') as f:
        for idx in range(df.__len__()):
            f.write('__label__' + str(cat_id[idx]) + ' ' + texts[idx].replace('\n', '') + '</s>\n')
path_to_input = './data/customer_conplaint_fastText_input'
make_fastText_input(df=train, fname=path_to_input)

Входной файл (‘./data/customer_conplaint_fastText_input’) должен выглядеть примерно так, как показано в следующей инструкции.

__label__2 My mortgage is owned by XXXX and serviced by PNC Mortgage. My Loan # XXXX. I have been working with PNC for XXXX years now to get a Making home affordable loan modification to save my home from forclosure with no success. I have reviewed all the information and based on guidelines should qualify. PNC has been giving me the run around. I have given them all needed documents and they close my file and make me start over. They keep blaming it on the investor guidelines buy wont allow me to speak with anyone from XXXX. I am never given clear denial reasons. Please help me save my home from forclosure. I am stressing out over this long process that leads no where. I have once again submitted another making home affordable application on XXXX/XXXX/15. </s>
__label__6 i didnt make a inquiry for this bank </s>

Где __label__0, __label__1,…, __label__10 - это «Взыскание долга», «Потребительский заем», «Ипотека», «Кредитная карта»,
«Кредитная отчетность», «Студенческий заем», «Банковский счет или услуга», < br /> «Кредит до зарплаты», «Денежные переводы», «Прочие финансовые услуги» и
«Предоплаченная карта» соответственно.
Данные для обучения готовы, давайте приступим к их обучению.

import fastText as fT
classifier = fT.train_supervised(input=path_to_input)
# Read 10M words
# Number of words:  118348
# Number of labels: 11
# Progress: 100.0% words/sec/thread: 4515701 lr:  0.000000 loss: 0.865297 ETA:   0h 0m

Мы можем проверить производительность машины, выбрав случайное утверждение из набора тестов.

# pick one complaint from test set
test_01 = test.consumer_complaint_narrative.sample(1, random_state=10).values[0].replace('\n', '')
print(test_01)
# I have been receiving calls from an unknown debt collector. They have not given me any information as to who they are, a physical address, and call me day and night, after I have repeatedly asked to only contact me through mail. Now they have called my work without my permission. I do n't even know how they got my work number, and jeopardizing my job by doing so.
# test session
classifier.predict(test_01)
# (('__label__0',), array([0.85070831]))
# create your own complaint
my_complaint = 'I need a loan asap, show me the money'
# your result
classifier.predict(my_complaint)
# (('__label__5',), array([0.62615275]))
# lol, it said 'Student Loan'

Он предсказывает, что «text_01» является жалобой «Взыскание долга» с вероятностью 0,85, и это правильный ответ !!! Сохраняйте спокойствие и проверяйте больше…

# making test file
path_to_test = './data/customer_conplaint_fastText_test'
make_fastText_input(df=test, fname=path_to_test)
# testing the test file
classifier.test(path=path_to_test)
# (13362, 0.8083370752881305, 0.8083370752881305)

Грубо говоря, модель классифицировала текст в тестовом наборе с точностью 80,83% (около 10 800 из 13 362). Так как же быстро компьютер может читать?

import time
start = time.time()
result = classifier.test(path=path_to_test)
end = time.time()
print(end - start)
# 0.3329501152038574

Для чтения 13 362 жалоб или около 2 526 411 слов требуется 0,33 секунды, поэтому он способен читать более 455 миллионов слов в минуту (согласно этому классификатору на моем ноутбуке *). Более чем в миллион раз быстрее, чем могут сделать обычные люди, или в тысячу раз быстрее, чем Мария.

В заключение, «Обработка естественного языка (НЛП)» - это заставить компьютер понимать наш человеческий язык, и задача, которую мы только что закончили, «Классификация текста» - это один из подходов «контролируемого обучения», который является подмножеством «Машинного обучения». которые в основном сосредоточены на разработке машин для выполнения аналогичных задач, показанных выше. Кстати, в следующем блоге будут представлены более сложные современные методы, такие как RNN и LSTM, с более точными результатами, так что будьте на связи ...