NanoBanana侧面黑白鲨守-modified.png

🛬 Attending

🇰🇷ICML2026, Seoul, KR

TY to support grants from:

<aside> <img src="notion://custom_emoji/0d39d0ab-438c-4f29-be70-03aa9d912057/2c928fcd-40c2-806f-bf66-007a2758f01d" alt="notion://custom_emoji/0d39d0ab-438c-4f29-be70-03aa9d912057/2c928fcd-40c2-806f-bf66-007a2758f01d" width="40px" /> Google Scholar

</aside>

<aside> <img src="notion://custom_emoji/0d39d0ab-438c-4f29-be70-03aa9d912057/2c928fcd-40c2-8076-afbe-007af223fbba" alt="notion://custom_emoji/0d39d0ab-438c-4f29-be70-03aa9d912057/2c928fcd-40c2-8076-afbe-007af223fbba" width="40px" /> Research Gate

</aside>

<aside> <img src="attachment:5a1e77b6-a69a-4a9c-8282-b0e70b4de887:Less_Wrong_LOGO.png" alt="attachment:5a1e77b6-a69a-4a9c-8282-b0e70b4de887:Less_Wrong_LOGO.png" width="40px" /> LessWrong

</aside>

Socials

</aside>

<aside> <img src="notion://custom_emoji/0d39d0ab-438c-4f29-be70-03aa9d912057/2c928fcd-40c2-8059-bfa4-007a8ad3da3f" alt="notion://custom_emoji/0d39d0ab-438c-4f29-be70-03aa9d912057/2c928fcd-40c2-8059-bfa4-007a8ad3da3f" width="40px" /> BlueSky

</aside>

<aside> <img src="notion://custom_emoji/0d39d0ab-438c-4f29-be70-03aa9d912057/2c928fcd-40c2-8059-811e-007a23ecdf0a" alt="notion://custom_emoji/0d39d0ab-438c-4f29-be70-03aa9d912057/2c928fcd-40c2-8059-811e-007a23ecdf0a" width="40px" /> X/Twitter

</aside>

The rise of reasoning agents with more autonomy is a double-edged sword that enable unique failure modes: agentic reasoning can be very invalid logically or factually (reasoning not done right) and even when they are valid, agents can work against you in their decision-making and interaction with each other (reasoning done right, but for the wrong purpose).

My research focus on making AI smarter and making smarter AI safer, especially in multi-agent settings (which I believe is an inevitable trend for future deployment at scale):

Line A: Reasoning Done Right (How to Make AI Smarter): My vertical focus on the science of evaluation to better inform post-training, often with the help of actionable interpretability methods. Ultimately, this line of work contributes to AI4Science that accelerate scientific discovery, where I’m particularly intrigued by AI application in fields where physics and chemistry collide, such as (exo-)planetary and environmental science.

I try to follow several principles in developing evaluation benchmarks:

Try to avoid toy model: sim-to-real gap is always there, but try to minimize that gap and meaningfully reflect real-world workflow in agentic settings (CauSciBench)
If we have to use toy model, the design should serve a purpose (usually, this purpose is to disentangle certain capability/propensity from results that often reflect a mixture of many), in this line I’ve tried to tackle multimodal reasoning (SeePhys) and logical reasoning with formal verification (Lean+TCS)
If we have to use toy model, the design should be principled + grounded by theory, some of our current effort involves building comprehensive taxonomy and benchmarks for AI deception/sycophancy grounded in cognitive science. (GT-HarmBench and 2 more coming this month!)

Line B: Reasoning For Good (How to Make Smarter Agents Safer?): Reasoning and Agency also enabled novel threat model such as deception, scheming and collusion. My vertical focus on identifying key triggers/minimum conditions that suppress/encourage such agentic misalignment. Recently, I’ve been trying to probe

how agents react differently in realistic (quasi-deployment) vs. fictional (quasi-evaluation) scenarios, and more broadly how prompt sensitivity could easily make or break AI safety evaluation results (For example, see our recent results on how previously reported alignment faking is very sensitive to prompt formulation)
how agents interact with each other in deceptive, even covert ways such as code-switching and more sophisticated steganographic, encoded reasoning. The divergence of reasoning space with action space is probably the most critical failure mode for agentic misalignment (especially in large-scale deployment)

Eventually, I believe the best way to ensure scalable agentic safety is consequence-aware reasoning (CoI, Chain-of-Implication) similar to how legal deterrence work for humans.

(Un)useful stuff I personally recommend:

I use Ferdium to aggregate all of my notifications in one place, such that I only need to check 1 app when I wake up, highly recommend! I heard Hermit does a similar job on mobile device but haven’t gotten to try it yet.
I use OpenAI Codex for all my coding now, which is currently the best AI coding tool out there in my opinion (and it’s open-sourced)
OpenAI RFT (the finetuning API) doesn’t work, it’s expensive and will fail with no reason
ai2-asta for literature review NO hallucination on Semantic Scholar backend
uv for package management, don’t use conda
inspect eval
LMMS-eval for multimodal eval
Tinker: I’m a beta user of the Tinker API
slime: RL Scaling with Megatron-LM for training + SGLang for inference