Hexiang (Frank) Hu
Ph.D. Student [at] USC [at]
Deep Learner
I am passionate with Machine Learning, Computer Vision as well as Natural Language Processing.


Hexiang Hu is a Computer Science Ph.D. candidate in Viterbi School of Engineering at University of Southern California (USC), working with Prof. Fei Sha. Prior to this, He was a Ph.D. student in Henry Samueli School of Engineering and Applied Science at University of California, Los Angeles (UCLA). He earned his Bachelor’s degrees in Computer Science from Zhejiang University and Simon Fraser University with honor. He worked with Prof. Greg Mori during his undergrads. His research interests include Machine Learning, Computer Vision and Natural Language Processing. [ CV ]


Summer 2020
Intern @ Google Research
Summer 2019
Intern @ Intel AI
Spring 2019
Visitor @ Berkeley AI Research Lab
Summer 2018
Intern @ Facebook AI Research
2017 -
PhD student @ USC
Large Scale Machine Learning, Vision and Language
Supervisor: Prof. Fei Sha
2016 - 2017
PhD student @ UCLA
Deep Learning, Vision
Supervisor: Prof. Fei Sha

Selected Publications

Learning the Best Pooling Strategy for Visual Semantic Embedding

We propose a Generalized Pooling Operator (GPO), which learns to automatically adapt itself to the best pooling strategy for different feature modalities in visual semantic embedding models, requiring no manual tuning while staying effective and efficient.

CVPR 2021 (Oral) in Virtual Meeting
Learning to Represent Image and Text using Denotation Graph

In this paper, we propose learning representations from a set of implied, visually grounded expressions between image and text, automatically mined from those datasets. In particular, we use denotation graphs to represent how specific concepts (such as sentences describing images) can be linked to abstract and generic concepts (such as short phrases) that are also visually grounded.

EMNLP 2020 (Oral) in Virtual Meeting
Learning Adaptive Classifiers Synthesis for Generalized Few-Shot Learning

We investigate the problem of generalized few-shot learning (GFSL), using dictionary based classifier synthesis.

IJCV 2021
BabyWalk: Going Farther in Vision-and-Language Navigation by Taking Baby Steps

We propose BabyWalk, a novel navigation agent that learns navigate by decomposing long instructions into shorter ones (BabySteps) and completing them sequentially.

ACL 2020 (Oral) in Seattle, WA
Few-Shot Learning via Embedding Adaptation with Set-to-Set Functions

We propose a novel approach to adapt the instance embeddings to the target classification task with a set-to-set function, yielding embeddings that are task-specific and discriminative.

CVPR 2020 in Seattle, WA
Multimodal Model-Agnostic Meta-Learning via Task-Aware Modulation

One important limitation of MAML is that they seek a common initialization shared across tasks, which made it suffering from adapting tasks of a multimodal distribution. This paper propose a generic method that augment MAML with the capability of identifying the task mode using a model based learner, such that it can adapt quickly with a few gradient updates.

NeurIPS 2019 (Spotlight) in Vancouver, BC
Binary Image Selection (BISON): Interpretable Evaluation of Visual Grounding

This paper presents an alternative evaluation task for visual-grounding systems: given a caption the system is asked to select the image that best matches the caption from a pair of semantically similar images. The system's accuracy on this Binary Image SelectiON (BISON) task is not only interpretable, but also measures the ability to relate fine-grained text content in the caption to visual content in the images.

ICCV 2019 Workshop (CLVL) in Seoul, South Korean
Engaging Image Captioning Via Personality

We define a new task, Personality-Captions, where the goal is to be as engaging to humans as possible by incorporating controllable style and personality traits. We collect and release a large dataset of 201,858 of such captions conditioned over 215 possible traits.

CVPR 2019 in Long Beach, CA
Synthesized Policies for Transfer and Adaptation across Tasks and Environments

In this paper, we consider the problem of learning to simultaneously transfer across both environments (ENV) and tasks (TASK), probably more importantly, by learning from only sparse (ENV, TASK) pairs out of all possible combinations. We propose a compositional neural network which depicts a meta rule for composing policies from the environment and task embeddings.

NIPS 2018 (Spotlight) in Montreal, QC
Learning Structured Inference Neural Networks with Label Relations

We propose a generic structured model that leverages diverse label relations to improve image classification performance. It employs a novel stacked label prediction neural network, capturing both inter-level and intra-level label semantics. The design of this framework naurally extends to leverage partial observations in the label space to inference the rest label space.

CVPR 2016 & T-PAMI
Being Negative but Constructively: Lessons Learnt from Creating Better Visual Question Answering Datasets

We show the design of the decoy answers has a significant impact on how and what the learning models learn from the datasets. In particular, the resulting learner can ignore the visual information, the question, or the both while still doing well on the task.

NAACL-HLT 2018 (Oral) in New Orleans, LA
Learning Answer Embedding for Visual Question Answering

We propose a novel probabilistic model for visual question answering.

CVPR 2018 in Salt Lake City, UT