Наивный машинный перевод и LSH

Мы собираемся создать проект по внедрению системы машинного перевода с хешированием, чувствительным к локальности. Проект основан на том, что я сделал в задании курса «Обработка естественного языка с классификацией и векторными пространствами». Причина, по которой я пишу этот пост, - чтобы лучше понять процесс.

Слово встраивает данные для английских и французских слов

Цель проекта - создать систему перевода английских слов на французские. Приведенные нами данные представляют собой вложения слов на английском и французском языках. Полный набор данных для английских встраиваний составляет около 3,64 гигабайта, а французских - около 629 мегабайт. Для завершения проекта на Coursera нам понадобится подмножество вложений.

en_embeddings_subset = pickle.load(open("en_embeddings.p", "rb"))
fr_embeddings_subset = pickle.load(open("fr_embeddings.p", "rb"))

Данные во вложениях представляют собой словари, в которых ключ соответствует слову, а значения представляют собой 300-мерный массив с этими конкретными вложениями слов.

Нам нужно загрузить два словаря, отображающих английские и французские слова. Учебный и тестовый словарь делятся поровну. 5000 на обучение и на тестирование.

Создание матриц встраивания и преобразования

Когда у нас есть словари английских и французских слов, мы должны преобразовать их в матрицы. Функция get_matrices принимает в качестве входных данных en_fr, en_embeddings, fr_embeddings. Он возвращает матрицу X и матрицу Y, где каждая строка в X является вложением слова для английского слова, а та же строка в Y является вложением слова для французской версии этого английского слова.

def get_matrices(en_fr, french_vecs, english_vecs):
# X_l and Y_l are lists of the english and french word embeddings
    X_l = list()
    Y_l = list()
# get the english words (the keys in the dictionary) and store in a set()
    english_set = english_vecs.keys()
# get the french words (keys in the dictionary) and store in a set()
    french_set = french_vecs.keys()
# store the french words that are part of the english-french dictionary (these are the values of the dictionary)
    french_words = set(en_fr.values())
# loop through all english, french word pairs in the english french dictionary
    for en_word, fr_word in en_fr.items():
# check that the french word has an embedding and that the english word has an embedding
        if fr_word in french_set and en_word in english_set:
# get the english embedding
            en_vec = english_vecs[en_word]
# get the french embedding
            fr_vec = french_vecs[fr_word]
# add the english embedding to the list
            X_l.append(en_vec)
# add the french embedding to the list
            Y_l.append(fr_vec)
# stack the vectors of X_l into a matrix X
    X = np.vstack(X_l)
# stack the vectors of Y_l into a matrix Y
    Y = np.vstack(Y_l)
return X, Y

Перевод как линейное преобразование вложений

Когда у нас есть английские и французские вложения, мы должны создать матрицу преобразования R. Вы можете несколько векторов Engish emebddings с помощью R, чтобы получить новые вложения f. Затем вы можете вычислить ближайших соседей к f во французских вложениях, чтобы получить слово, наиболее похожее на английское.

Мы можем описать перевод как задачу минимизации.

Норма Фробениуса матрицы 𝐴 (при условии, что она имеет размерность 𝑚, 𝑛) определяется как квадратный корень из суммы абсолютных квадраты его элементов.

В реальных приложениях норма Форбениуса часто заменяется ее значением в квадрате, деленным на 𝑚, где m - количество примеров.

Чтобы реализовать механизм трансляции, мы должны вычислить функцию потерь. Прежде всего, мы получим Y, умножив X и R. Затем мы должны вычислить разницу между XR- Y. В конце мы должны вычислить квадрат нормы Фробениуса разности и разделить его на 𝑚 .

def compute_loss(X, Y, R):
   # m is the number of rows in X
    m = X.shape[0]
    
    # diff is XR - Y
    diff = np.dot(X,R)-Y
# diff_squared is the element-wise square of the difference
    diff_squared = diff**2
# sum_diff_squared is the sum of the squared elements
    sum_diff_squared = np.sum(diff_squared)
# loss i the sum_diff_squard divided by the number of examples (m)
    loss = sum_diff_squared/m
    return loss

Это был первый шаг. На втором этапе мы должны вычислить потери градиента относительно матрицы преобразования R. Градиент кодирует, насколько небольшое изменение R влияет на изменение функции потерь. Это дает нам направление, в котором мы должны уменьшить R, чтобы минимизировать потери.

def compute_gradient(X, Y, R):
    # m is the number of rows in X
    m = X.shape[0]
# gradient is X^T(XR - Y) * 2/m
    gradient = np.dot(X.transpose(),np.dot(X,R)-Y)*(2/m)
    return gradient

Третий шаг - найти оптимальное R с помощью алгоритма градиентного спуска. Градиентный спуск - это итерационный алгоритм, который используется для поиска оптимума функции. Обычно мы тренируемся с фиксированным количеством итераций вместо того, чтобы ждать, пока убыток упадет ниже определенного порога.

def align_embeddings(X, Y, train_steps=100, learning_rate=0.0003):
    np.random.seed(129)
# the number of columns in X is the number of dimensions for a word vector (e.g. 300)
    # R is a square matrix with length equal to the number of dimensions in th  word embedding
    R = np.random.rand(X.shape[1], X.shape[1])
for i in range(train_steps):
        if i % 25 == 0:
            print(f"loss at iteration {i} is: {compute_loss(X, Y, R):.4f}")
        # use the function that you defined to compute the gradient
        gradient = compute_gradient(X,Y,R)
# update R by subtracting the learning rate times gradient
        R -= learning_rate * gradient
    return R

Тестирование перевода

Мы только приближаем функцию перевода с английского языка на французский вложения с помощью матрицы линейного преобразования . В большинстве случаев мы не можем получить точное вложение. В этом случае действительно полезен алгоритм k-ближайших соседей. Используя 1-NN с 𝐞𝐑 в качестве входных данных, мы можем искать вложение 𝐟 (в виде строки) в матрицу 𝐘, которая является ближайший к преобразованному вектору 𝐞𝐑.

Чтобы найти ближайшего соседа, нам нужно ввести вектор v, набор возможных кандидатов для поиска ближайшего соседа и k ближайших соседей. Метрика расстояния должна основываться на косинусном сходстве.

def nearest_neighbor(v, candidates, k=1):
    
    similarity_l = []
# for each candidate vector...
    for row in candidates:
        # get the cosine similarity
        cos_similarity = cosine_similarity(v,row)
# append the similarity to the list
        similarity_l.append(cos_similarity)
        
    # sort the similarity list and get the indices of the sorted list
    sorted_ids = np.argsort(similarity_l)
# get the indices of the k most similar candidate vectors
    k_idx = sorted_ids[-k:]
    return k_idx

Проверьте свой перевод и оцените его точность

Надо проверить точность перевода. Функция test_vocabulary принимает английскую матрицу вложения X, французскую матрицу вложения Y и матрицу R. Возвращает точность перевода из X в Y по R.

Точность измеряется отношением правильных прогнозов ко всем прогнозам.

def test_vocabulary(X, Y, R):
    # The prediction is X times R
    pred = np.dot(X,R)
# initialize the number correct to zero
    num_correct = 0
# loop through each row in pred (each transformed embedding)
    for i in range(len(pred)):
        # get the index of the nearest neighbor of pred at row 'i'; also pass in the candidates in Y
        pred_idx = nearest_neighbor(pred[i],Y)
# if the index of the nearest neighbor equals the row of i... \
        if pred_idx == i:
            # increment the number correct by 1.
            num_correct += 1
# accuracy is the number correct divided by the number of rows in 'pred' (also number of rows in X)
    accuracy = num_correct / len(pred)
return accuracy

LSH и поиск документов

Мы собираемся реализовать более эффективную версию k-ближайших соседей, используя хеширование с учетом локальности. Мы применяем этот алгоритм к твитам.

Получение вложения документа

Обычно порядок слов имеет значение в текстовом документе. Рассмотрим эти два предложения.

Яблочный пирог лучше пиццы пепперони.

Пицца пепперони лучше яблочного пирога

Предложения имеют противоположное значение из-за порядка слов.

В некоторых случаях игнорирование порядка слов может позволить нам обучить эффективные и действенные модели. Аппорач называется моделью документа «Пакет слов».

Встраивание документа создается путем суммирования всех вложений слов в документе. Если слово «вложение» неизвестно, мы можем просто проигнорировать это слово.

def get_document_embedding(tweet, en_embeddings): 
  doc_embedding = np.zeros(300)
    # process the document into a list of words (process the tweet)
    processed_doc = process_tweet(tweet)
    for word in processed_doc:
        # add the word embedding to the running total for the document embedding
        doc_embedding += en_embeddings.get(word,0)
    return doc_embedding

Теперь нам нужно сохранить все векторы документа в словаре.

def get_document_vecs(all_docs, en_embeddings):
# the dictionary's key is an index (integer) that identifies a specific tweet
    # the value is the document embedding for that document
    ind2Doc_dict = {}
# this is list that will store the document vectors
    document_vec_l = []
for i, doc in enumerate(all_docs):
        # get the document embedding of the tweet
        doc_embedding = get_document_embedding(doc,en_embeddings)
# save the document embedding into the ind2Tweet dictionary at index i
        ind2Doc_dict[i] = doc_embedding
# append the document embedding to the list of document vectors
        document_vec_l.append(doc_embedding)
# convert the list of document vectors into a 2D array (each row is a document vector)
    document_vec_matrix = np.vstack(document_vec_l)
return document_vec_matrix, ind2Doc_dict

Ищу твиты

Будем искать твиты. У нас есть вектор размерности (m, d), где m - количество твитов, а d - размерность вложений. Мы собираемся использовать косинусное сходство, чтобы найти, какие твиты похожи друг на друга.

idx = np.argmax(cosine_similarity(document_vecs, tweet_embedding))
print(all_tweets[idx])

Поиск самых похожих твитов с LSH

Мы также можем найти похожие твиты с хешированием сенсита местности. Вместо того, чтобы искать все 10 000 векторов, вы можете просто выполнить поиск по подмножеству, чтобы найти его ближайших соседей.

N_VECS = len(all_tweets)       # This many vectors.
N_DIMS = len(ind2Tweet[1])     # Vector dimensionality.
print(f"Number of vectors is {N_VECS} and each has {N_DIMS} dimensions.")

Получение хеш-числа для вектора

Для каждого вектора мы должны получить уникальный номер, который нужно присвоить хеш-ведру. Мы должны использовать гиперплоскости, чтобы разделить векторное пространство. Мы можем использовать его, чтобы разделить вектор на две части. Все векторы, скалярное произведение которых с вектором нормали к плоскости положительно, находятся по одну сторону от плоскости. Все векторы, скалярное произведение которых на вектор нормали плоскости отрицательное, находятся по другую сторону плоскости. Для вектора мы можем взять его скалярное произведение со всеми плоскостями, затем закодировать эту информацию, чтобы назначить вектор одному хеш-ведру. Для вектора мы можем взять его скалярное произведение со всеми плоскостями, затем закодировать эту информацию, чтобы назначить вектор одному хеш-ведру. Если вектор находится на той же стороне, что и вектор нормали, закодируйте его на 1.

def hash_value_of_vector(v, planes):
    # for the set of planes,
    # calculate the dot product between the vector and the matrix containing the planes
    # remember that planes has shape (300, 10)
    # The dot product will have the shape (1,10)
    dot_product = np.dot(v,planes)
# get the sign of the dot product (1,10) shaped vector
    sign_of_dot_product = np.sign(dot_product)
# set h to be false (eqivalent to 0 when used in operations) if the sign is negative,
    # and true (equivalent to 1) if the sign is positive (1,10) shaped vector
    h = sign_of_dot_product>=0
# remove extra un-used dimensions (convert this from a 2D to a 1D array)
    h = np.squeeze(h)
# initialize the hash value to 0
    hash_value = 0
n_planes = planes.shape[1]
    for i in range(n_planes):
        # increment the hash value by 2^i * h_i
        hash_value += np.power(2,i)*h[i]
# cast hash_value as an integer
    hash_value = int(hash_value)
return hash_value

Создание хеш-таблицы

Нам нужно создать функцию хеш-таблицы, которая принимает векторные слова, назначает их ведру и сохраняет вектор там.

def make_hash_table(vecs, planes):
# number of planes is the number of columns in the planes matrix
    num_of_planes = planes.shape[1]
# number of buckets is 2^(number of planes)
    num_buckets = 2**num_of_planes
# create the hash table as a dictionary.
    # Keys are integers (0,1,2.. number of buckets)
    # Values are empty lists
    hash_table = {i:[] for i in range(num_buckets)}
# create the id table as a dictionary.
    # Keys are integers (0,1,2... number of buckets)
    # Values are empty lists
    id_table = {i:[] for i in range(num_buckets)}
# for each vector in 'vecs'
    for i, v in enumerate(vecs):
        # calculate the hash value for the vector
        h = hash_value_of_vector(v,planes)
# store the vector into hash_table at key h,
        # by appending the vector v to the list at key h
        hash_table[h].append(v)
# store the vector's index 'i' (each document is given a unique integer 0,1,2...)
        # the key is the h, and the 'i' is appended to the list at key h
        id_table[h].append(i)
return hash_table, id_table

Создание всех хеш-таблиц

Теперь мы можем хэшировать векторы и сохранять их в хеш-таблице. Чем больше сегментов вы используете, тем точнее будет ваш поиск, но также и больше времени.

После того, как у нас есть хеш-таблица, мы можем приблизительно рассчитать k-nn.

def approximate_knn(doc_id, v, planes_l, k=1, num_universes_to_use=N_UNIVERSES):
# Vectors that will be checked as possible nearest neighbor
    vecs_to_consider_l = list()
# list of document IDs
    ids_to_consider_l = list()
# create a set for ids to consider, for faster checking if a document ID already exists in the set
    ids_to_consider_set = set()
# loop through the universes of planes
    for universe_id in range(num_universes_to_use):
# get the set of planes from the planes_l list, for this particular universe_id
        planes = planes_l[universe_id]
# get the hash value of the vector for this set of planes
        hash_value = hash_value_of_vector(v, planes)
# get the hash table for this particular universe_id
        hash_table = hash_tables[universe_id]
# get the list of document vectors for this hash table, where the key is the hash_value
        document_vectors_l = hash_table[hash_value]
# get the id_table for this particular universe_id
        id_table = id_tables[universe_id]
# get the subset of documents to consider as nearest neighbors from this id_table dictionary
        new_ids_to_consider = id_table[hash_value]
# remove the id of the document that we're searching
        if doc_id in new_ids_to_consider:
            new_ids_to_consider.remove(doc_id)
            print(f"removed doc_id {doc_id} of input vector from new_ids_to_search")
# loop through the subset of document vectors to consider
        for i, new_id in enumerate(new_ids_to_consider):
# if the document ID is not yet in the set ids_to_consider...
            if new_id not in ids_to_consider_set:
                # access document_vectors_l list at index i to get the embedding
                # then append it to the list of vectors to consider as possible nearest neighbors
                document_vector_at_i = document_vectors_l[i]
# append the new_id (the index for the document) to the list of ids to consider
                vecs_to_consider_l.append(document_vector_at_i)
                ids_to_consider_l.append(new_id)
# also add the new_id to the set of ids to consider
                # (use this to check if new_id is not already in the IDs to consider)
                ids_to_consider_set.add(new_id)
# Now run k-NN on the smaller set of vecs-to-consider.
    print("Fast considering %d vecs" % len(vecs_to_consider_l))
# convert the vecs to consider set to a list, then to a numpy array
    vecs_to_consider_arr = np.array(vecs_to_consider_l)
# call nearest neighbors on the reduced list of candidate vectors
    nearest_neighbor_idx_l = nearest_neighbor(v, vecs_to_consider_arr, k=k)
# Use the nearest neighbor index list as indices into the ids to consider
    # create a list of nearest neighbors by the document ids
    nearest_neighbor_ids = [ids_to_consider_l[idx]
                            for idx in nearest_neighbor_idx_l]
return nearest_neighbor_ids