Speaker
Description
Calculating text statistics, like term frequency and similarity is an essential part of processing and modeling symbol sequences, including, for instance, text, source codes, and DNA sequences. However, calculating these statistics may have high computational demand. Thus, even in case of datasets with a moderate size, it can be cumbersome to integrate it into research or commercial product. As the size of the dataset increases, it can even become unfeasible. Therefore, surrogate models can be trained to approximate these statistics - directly with supervised learning, or indirectly with self-supervised approaches.
In this talk, a novel method is introduced for analyzing text autoencoders in terms of reconstruction loss and learned representation in the bottleneck. The method utilizes the reconstruction loss of the autoencoder to approximate text statistics, like term frequency and other string similarity metrics (including Levenshtein Distance and Longest Common Subsequence). The performance of convolutional neural network and Long Short-Term Memory-based autoencoders are investigated on public datasets (Penn Treebank, DBpedia, Yelp Review Polarity).
The results help to interpret what text autoencoders learn. It is also a step towards understanding what properties might be represented in text embeddings.
Title
Methods for Interpreting Text Autoencoders
authors | Bálint Gyires-Tóth, Marco H A Inácio |
---|---|
affiliation | Budapest University of Technology and Economics, Department of Telecommunications and Media Informatics |