About Me

Hello! My name is Jacob Krantz and I am a PhD student at Oregon State University studying artificial intelligence. I am advised by Dr. Stefan Lee. My research interests lie in the space of Embodied AI, where decision-making agents accomplish tasks through interaction with an environment (either simulated or physical). I focus on developing agents that can reason from both vision and natural language, with one particular context being navigation. Much of my research takes place in photorealistic simulated environments; it is here where agents can accumulate vast amounts of experience (and ideally skills) before transferring to the real world.

In Fall 2020, I was a research intern at Facebook AI Research (FAIR) working with Oleksandr Maksymets. I stayed on as a part-time student researcher through February 2021. In 2019 I graduated from Gonzaga University after studying Computer Science and Physics. I was advised by Dr. Paul De Palma. In 2018 I was an REU Fellow at the University of Colorado, Colorado Springs and was advised by Dr. Jugal Kalita.

In my free time, I am an avid mountain athlete. From rock climbing, ski mountaineering, to trail running, you can find me in the mountains all year long.

Research Interests

Embodied AI

Agent Navigation

Deep Learning

recent news

May, 2020 Our Visually Grounded Interaction and Language (ViGIL) Workshop is happening June 10 at NAACL 2021.
May, 2020 Our RxR-Habitat Challenge at the CVPR 2021 Embodied AI Workshop concludes May 31.
Nov, 2020 Our paper proposing Localization from Embodied Dialog appeared at EMNLP 2020.
Aug, 2020 Our paper on vision-and-language navigation in continuous envionments (VLN-CE) appeared at ECCV 2020.

Research Projects

Where Are You? Localization from Embodied Dialog

Meera Hahn    Jacob Krantz    Dhruv Batra    Devi Parikh    James M. Rehg

Stefan Lee    Peter Anderson

EMNLP, 2020

We present WHERE ARE YOU? (WAY), a dataset of ∼6k dialogs in which two humans – an Observer and a Locator – complete a cooperative localization task. The Observer is spawned at random in a 3D environment and can navigate from first-person views while answering questions from the Locator. The Locator must localize the Observer in a detailed top-down map by asking questions and giving instructions. Based on this dataset, we define three challenging tasks: Localization from Embodied Dialog or LED (localizing the Observer from dialog history), Embodied Visual Dialog (modeling the Observer), and Cooperative Localization (modeling both agents). In this paper, we focus on the LED task – providing a strong baseline model with detailed ablations characterizing both dataset biases and the importance of various modeling choices. Our best model achieves 32.7% success at identifying the Observer’s location within 3m in unseen buildings, vs. 70.4% for human Locators.

Beyond the Nav-Graph: Vision and Language Navigation in Continuous Environments

Jacob Krantz    Erik Wijmans    Arjun Majundar    Dhruv Batra    Stefan Lee

ECCV, 2020

We develop a language-guided navigation task set in a continuous 3D environment where agents must execute low-level actions to follow natural language navigation directions. By being situated in continuous environments, this setting lifts a number of assumptions implicit in prior work that represents environments as a sparse graph of panoramas with edges corresponding to navigability. Specifically, our setting drops the presumptions of known environment topologies, short-range oracle navigation, and perfect agent localization. To contextualize this new task, we develop models that mirror many of the advances made in prior setting as well as single-modality baselines. While some of these techniques transfer, we find significantly lower absolute performance in the continuous setting – suggesting that performance in prior navigation-graph settings may be inflated by the strong implicit assumptions.

Language-Agnostic Syllabification with Neural Sequence Labeling

Jacob Krantz    Maxwell Dulin    Paul De Palma

IEEE International Conference on Machine Learning and Applications, 2019

The identification of syllables within phonetic sequences is known as syllabification. This task is thought to play an important role in natural language understanding, speech production, and the development of speech recognition systems. The concept of the syllable is cross-linguistic, though formal definitions are rarely agreed upon, even within a language. In response, data-driven syllabification methods have been developed to learn from syllabified examples. These methods often employ classical machine learning sequence labeling models. In recent years, recurrence-based neural networks have been shown to perform increasingly well for sequence labeling tasks such as named entity recognition (NER), part of speech (POS) tagging, and chunking. We present a novel approach to the syllabification problem which leverages modern neural network techniques. Our network is constructed with long short-term memory (LSTM) cells, a convolutional component, and a conditional random field (CRF) output layer. Existing syllabification approaches are rarely evaluated across multiple language families. To demonstrate cross-linguistic generalizability, we show that the network is competitive with state of the art systems in syllabifying English, Dutch, Italian, French, Manipuri, and Basque datasets.

Abstractive Summarization Using Attentive Neural Techniques

Jacob Krantz    Jugal Kalita

International Conference on Natural Language Processing, 2018

In a world of proliferating data, the ability to rapidly summarize text is growing in importance. Automatic summarization of text can be thought of as a sequence to sequence problem. Another area of natural language processing that solves a sequence to sequence problem is machine translation, which is rapidly evolving due to the development of attention-based encoder-decoder networks. This work applies these modern techniques to abstractive summarization. We perform analysis on various attention mechanisms for summarization with the goal of developing an approach and architecture aimed at improving the state of the art. In particular, we modify and optimize a translation model with self-attention for generating abstractive sentence summaries. The effectiveness of this base model along with attention variants is compared and analyzed in the context of standardized evaluation sets and test metrics. However, we show that these metrics are limited in their ability to effectively score abstractive summaries, and propose a new approach based on the intuition that an abstractive model requires an abstractive evaluation.

Syllabification by Phone Categorization

Jacob Krantz    Maxwell Dulin    Paul De Palma    Mark VanDam

Genetic and Evolutionary Computation Conference Companion, 2018

Syllables play an important role in speech synthesis, speech recognition, and spoken document retrieval. A novel, low cost, and language agnostic approach to dividing words into their corresponding syllables is presented. A hybrid genetic algorithm constructs a categorization of phones optimized for syllabification. This categorization is used on top of a hidden Markov model sequence classifier to find syllable boundaries. The technique shows promising preliminary results when trained and tested on English words.

Press Coverage

Venture Beat article introducing our work to bring Vision and Language Navigation closer to reality via continuous environments (VLN-CE).