I'm a research scientist at Sakana AI. Before that, I was a research scientist at Google DeepMind (formerly Google Brain) based in Tokyo. I received my PhD in Computer Science from the University of Tokyo, my M.S. from Waseda University, and my B.S. from Shanghai Jiao Tong University. My research interests are in reinforcement learning, robotics, evolutionary algorithms, and generative models.
We introduce a finetuning method that enables large language models to scale test-time compute using the diffusion framework. Accuracy improves with more diffusion steps, and the model can adaptively allocate compute via ODE solvers and guided generation. Our method applies to any cross-entropy–trained model without altering original weights, complements standard finetuning and search, and bridges autoregressive and diffusion paradigms.
We present ASAL, the first method to apply vision-language foundation models to Artificial Life (ALife). ASAL automates the discovery of lifelike simulations by finding target behaviors, generating open-ended novelty, and illuminating diverse phenomena. It works across multiple ALife substrates—Boids, Lenia, Game of Life, and more—and has led to the discovery of previously unseen lifeforms. By enabling human-aligned, scalable exploration, ASAL introduces a powerful new paradigm for accelerating ALife research beyond manual trial-and-error.
We present Transformer², a self-adaptive framework that enables large language models to handle unseen tasks in real time by adjusting only the singular components of their weight matrices. Using a two-pass inference process with a task dispatcher and RL-trained expert vectors, Transformer² outperforms methods like LoRA with fewer parameters and greater efficiency. It generalizes across architectures and modalities, offering a scalable path toward dynamic, self-organizing AI systems.
We introduce CycleQD, a skill-focused training method for language models that applies Quality Diversity with cyclic task adaptation, model merging–based crossover, and SVD-based mutation. By rotating the focus across tasks, CycleQD avoids data imbalance issues and simplifies objective design. Applied to LLAMA3-8B-INSTRUCT, it outperforms traditional fine-tuning on coding, OS, and database tasks, matching GPT-3.5-Turbo’s performance while preserving strong language abilities. The method is general and extends beyond language to domains like image segmentation.
We introduce Neural Attention Memory Models (NAMMs), a learned memory management system that enhances both the efficiency and performance of transformers by selectively focusing on relevant context. Unlike prior rule-based methods, NAMMs evolve atop pre-trained models and condition only on attention matrices, making them broadly applicable. Trained on a small set of tasks, NAMMs improve performance across long-context benchmarks while drastically reducing input size, and they generalize zero-shot across architectures and modalities—including vision and reinforcement learning.