11–26 Nov 2021
Europe/Budapest timezone

Methods for Interpreting Text Autoencoders

Not scheduled
20m
Online lecture

Speaker

Bálint Gyires-Tóth (Budapest University of Technology and Economics)

Description

Calculating text statistics, like term frequency and similarity is an essential part of processing and modeling symbol sequences, including, for instance, text, source codes, and DNA sequences. However, calculating these statistics may have high computational demand. Thus, even in case of datasets with a moderate size, it can be cumbersome to integrate it into research or commercial product. As the size of the dataset increases, it can even become unfeasible. Therefore, surrogate models can be trained to approximate these statistics - directly with supervised learning, or indirectly with self-supervised approaches.
In this talk, a novel method is introduced for analyzing text autoencoders in terms of reconstruction loss and learned representation in the bottleneck. The method utilizes the reconstruction loss of the autoencoder to approximate text statistics, like term frequency and other string similarity metrics (including Levenshtein Distance and Longest Common Subsequence). The performance of convolutional neural network and Long Short-Term Memory-based autoencoders are investigated on public datasets (Penn Treebank, DBpedia, Yelp Review Polarity).
The results help to interpret what text autoencoders learn. It is also a step towards understanding what properties might be represented in text embeddings.

Title

Methods for Interpreting Text Autoencoders

authors Bálint Gyires-Tóth, Marco H A Inácio
affiliation Budapest University of Technology and Economics, Department of Telecommunications and Media Informatics

Primary authors

Bálint Gyires-Tóth (Budapest University of Technology and Economics) Dr Marco H A Inácio

Presentation materials

There are no materials yet.