NanoBanana侧面黑白鲨守-modified.png

🛬 Attending

🇰🇷ICML2026, Seoul, KR


TY to support grants from:

Coefficient Giving

Thinking Machines Lab

Zulip for Open-Source Project


<aside> <img src="notion://custom_emoji/0d39d0ab-438c-4f29-be70-03aa9d912057/2c928fcd-40c2-806f-bf66-007a2758f01d" alt="notion://custom_emoji/0d39d0ab-438c-4f29-be70-03aa9d912057/2c928fcd-40c2-806f-bf66-007a2758f01d" width="40px" /> Google Scholar

</aside>

<aside> <img src="notion://custom_emoji/0d39d0ab-438c-4f29-be70-03aa9d912057/2c928fcd-40c2-8076-afbe-007af223fbba" alt="notion://custom_emoji/0d39d0ab-438c-4f29-be70-03aa9d912057/2c928fcd-40c2-8076-afbe-007af223fbba" width="40px" /> Research Gate

</aside>

<aside> <img src="attachment:5a1e77b6-a69a-4a9c-8282-b0e70b4de887:Less_Wrong_LOGO.png" alt="attachment:5a1e77b6-a69a-4a9c-8282-b0e70b4de887:Less_Wrong_LOGO.png" width="40px" /> LessWrong

</aside>

Socials


<aside> <img src="notion://custom_emoji/0d39d0ab-438c-4f29-be70-03aa9d912057/2c928fcd-40c2-8012-8d55-007a1a6ff476" alt="notion://custom_emoji/0d39d0ab-438c-4f29-be70-03aa9d912057/2c928fcd-40c2-8012-8d55-007a1a6ff476" width="40px" />

Linkedin

</aside>

<aside> <img src="notion://custom_emoji/0d39d0ab-438c-4f29-be70-03aa9d912057/2c928fcd-40c2-8059-bfa4-007a8ad3da3f" alt="notion://custom_emoji/0d39d0ab-438c-4f29-be70-03aa9d912057/2c928fcd-40c2-8059-bfa4-007a8ad3da3f" width="40px" /> BlueSky

</aside>

<aside> <img src="notion://custom_emoji/0d39d0ab-438c-4f29-be70-03aa9d912057/2c928fcd-40c2-8059-811e-007a23ecdf0a" alt="notion://custom_emoji/0d39d0ab-438c-4f29-be70-03aa9d912057/2c928fcd-40c2-8059-811e-007a23ecdf0a" width="40px" /> X/Twitter

</aside>


Meeting w/ me

Please first Email me before you schedule any meetings here, I only take calls after explicit invitation and with people that I know.

The rise of reasoning agents with more autonomy is a double-edged sword that enable unique failure modes: agentic reasoning can be very invalid logically or factually (reasoning not done right) and even when they are valid, agents can work against you in their decision-making and interaction with each other (reasoning done right, but for the wrong purpose).

My research focus on making AI smarter and making smarter AI safer, especially in multi-agent settings (which I believe is an inevitable trend for future deployment at scale):

Line A: Reasoning Done Right (How to Make AI Smarter): My vertical focus on the science of evaluation to better inform post-training, often with the help of actionable interpretability methods. Ultimately, this line of work contributes to AI4Science that accelerate scientific discovery, where I’m particularly intrigued by AI application in fields where physics and chemistry collide, such as (exo-)planetary and environmental science.

I try to follow several principles in developing evaluation benchmarks:

  1. Try to avoid toy model: sim-to-real gap is always there, but try to minimize that gap and meaningfully reflect real-world workflow in agentic settings (CauSciBench)
  2. If we have to use toy model, the design should serve a purpose (usually, this purpose is to disentangle certain capability/propensity from results that often reflect a mixture of many), in this line I’ve tried to tackle multimodal reasoning (SeePhys) and logical reasoning with formal verification (Lean+TCS)
  3. If we have to use toy model, the design should be principled + grounded by theory, some of our current effort involves building comprehensive taxonomy and benchmarks for AI deception/sycophancy grounded in cognitive science. (GT-HarmBench and 2 more coming this month!)

Line B: Reasoning For Good (How to Make Smarter Agents Safer?): Reasoning and Agency also enabled novel threat model such as deception, scheming and collusion. My vertical focus on identifying key triggers/minimum conditions that suppress/encourage such agentic misalignment. Recently, I’ve been trying to probe

  1. how agents react differently in realistic (quasi-deployment) vs. fictional (quasi-evaluation) scenarios, and more broadly how prompt sensitivity could easily make or break AI safety evaluation results (For example, see our recent results on how previously reported alignment faking is very sensitive to prompt formulation)
  2. how agents interact with each other in deceptive, even covert ways such as code-switching and more sophisticated steganographic, encoded reasoning. The divergence of reasoning space with action space is probably the most critical failure mode for agentic misalignment (especially in large-scale deployment)

Eventually, I believe the best way to ensure scalable agentic safety is to learn safety constraints via consequence-aware reasoning (CoI, Chain-of-Implication) similar to how legal deterrence work for humans.

Useful:

The following represents only my personal opinions:

Science of Evaluation

Science of Post-Training

Education


<aside> 🐔 PhD

Advisor:

</aside>

<aside> 🐥 MSc. Interdisciplinary Science ETH (CS and Physics) ETH Zurich, Switzerland (2024-2025)

Thesis Advisor: Prof. Zhijing Jin, Prof. Bernhard Schölkopf

</aside>

<aside> 🐣 BSc. Interdisciplinary Science ETH (CS, Physics and Chemistry) ETH Zurich, Switzerland (2023-2025)

Thesis Advisor: Prof. Mrinmaya Sachan

</aside>