Yuan-Sen Ting (丁源森)

The Ohio State University

Expediting Discoveries in Astronomy with A.I. Agents

NSF awarded over $200 million for AI Research Institutes

~ 2 centers

~ 2 centers

Physical Sciences

~ 3 centers

7 centers x 15M ~ 100M

Environmental Sciences

Biological Sciences

Hype, myth, or real deal?

Why hasn't astronomy had its
"AlphaFold" moment yet?"

YST, Annual Review of Astronomy and Astrophysics, arXiv: 2510.10713

Most AI in Astronomy focuses on extending statistical methods

Most AI in Astronomy focuses on extending statistical methods

0.9

0.8

0.7

0.25

0.30

0.35

0.40

\Omega_M
\sigma_8

Dark Matter Density

Growth Amplitude

E.g.,
simulation-based
inferences

Sihao Cheng, YST+, 2020

Applying A.I. to individual tasks 
will have limited impacts in astrophysics

The complexity of astronomy is too low for AI

My niece

Highly non-Gaussian

Weakly non-Gaussian

Cosmic large-scale structure

Astronomy is not biology

Data / Observation

Theory / Hypothesis

Analysis Pipelines

True

False

Biology faced fundamental bottlenecks  from individual tasks

Data / Observation

Theory / Hypothesis

Analysis Pipelines

True

False

Alphafold

Most astronomical tasks already have working heuristics

Data / Observation

Theory / Hypothesis

LamdaCDM

True

False

Toward Agentic Research for Astronomy

Data

Theory

State of the research

Making "plans"

Harness reasoning

Beyond just individual task optimizations

A.I. in Math Olympiads

A.I. in Astronomy Olympiads

Pinheiro, ..., YST+, 2025

In open-world setting, can large language models match human researchers at expediting
data explorations?

Sun, YST+, 2024, 2025

??

Can A.I. agents understand spectral data (spectral energy distribution) from JWST?

Real-world reasoning extends far beyond algorithmic formalism

A default fit with
an SED model

Extinction model ?

Real-world reasoning extends far beyond algorithmic formalism

Young stellar population?

Real-world reasoning extends far beyond algorithmic formalism

Many real-world problems aren't simple optimization problems

The objective goes beyond minimizing a single error metric.

Many tasks may require modifying assumptions / physical models, not just optimizing over all parameters

Action spaces are vast and hard to parameterize.

Can a large-language model learn
from its own experience?

Human "intuition" + experience

Introducing Mephisto*

* In the classic tale of Faust, Mephisto is a demon who tempts the scholar Faust with knowledge and power in exchange for his soul.

A collaboration of multiple AI agents (LLM models)

Proposing actions

Execute actions

State evolution

Knowledge distillation

A collaboration of multiple AI agents (LLM models)

Proposing actions

Execute actions

State evolution

Knowledge distillation

Enabling AI to collect "knowledge" through exploration

Knowledge base

1

2

3

4

Proposing Actions - e.g., different physical models / parameter range

Enabling AI to collect "knowledge" through exploration

Knowledge base

1

2

3

4

Execute Actions - write configuration files, run the codes, automously

Enabling AI to collect "knowledge" through exploration

Knowledge base

1

2

3

4

vs.

vs.

vs.

vs.

State Evaluation - evaluate the results (beyond a single error metric)

Enabling AI to collect "knowledge" through exploration

Knowledge base

1

2

3

4

vs.

vs.

vs.

vs.

Knowledge Distillation - summarise useful actions given the previous state

Mephisto - deployed as "walkers" in the action space

Number of Learning Iterations

0

10

20

30

5.1

5.6

6.0

6.4

GPT-4o baseline --
"think without knowledge"

Chi-Square of the Fit

LLMs with self-improvement outperforms native LLMs

Fitting JWST JADES data

Sun, YST+, 2024

Number of Learning Iterations

0

10

20

30

5.1

5.6

6.0

6.4

GPT-4o baseline --
"think without knowledge"

Mephisto

Chi-Square of the Fit

LLMs with self-improvement outperforms native LLMs

Sun, YST+, 2024

Example of learned "knowledge"

" If the fit is overestimated in the UV and optical bands,

increasing the E_BV_lines parameter may lead to a better fit by accounting for more dust attenuation in these bands. "

Sun, YST+, 2025

Mephisto operates as walkers exploring the "hypothesis space"

With COSMOS2020 SEDs

Mephisto finds better solutions using only 1% of the trials that brute force methods require

YST+, 2025g

Fitting equivalent widths used to require human judgments

YST+, 2025g

E.g., deciding whether there's an unresolved blend of lines

YST+, 2025g

E.g., adjusting for the continuum

YST+, 2025g

What took a trained postdoc six months now costs ~$100 with agents

Liu, YST+ 2024

Agents can sift through hundreds of millions of ASAS-SN
light curves and reason their way to interesting outliers

Pesta & YST, in prep.

Phase

Phase

-0.5

-0.25

0

0.25

0.5

-0.5

-0.25

0

0.25

0.5

Magnitude

11

12

13

14

15

12

13

14

15

16

Caught in the brief, unstable evolutionary semi-detached phase 

A rare alignment of
a massive Supergiant
in a 13-year orbit

P=13 years

P=2.3 days

Graduate student / Postdoc

The Plot Twist

A.I. still struggles with many tasks that are easy for humans

Princeton Language and Intelligence Lab, June 2024

Human accuracy: ~80%

GPT-4o: ~47%

Can A.I. reason about scientific charts?

ARC Prize Foundation (ARC-AGI-2, 2025)

Spatial Pattern Reasoning

Human Panel : ~ 100%

GPT-5 : ~10%

Moravec's Paradox (1988)

-
Things that seem easy for humans might be hard for computer, and vice versa

Reversing the evolution of "intelligence"

Evolution Timeline: What came first vs. last

Conversational abilities
are the easiest to imitate

A lot of our holistic abilities were developed much earlier

Easy-for-AI

Complex calculations

Easy-for-Human

Logical inference (?)

Memorizing information

Language

Coding

Spatial reasoning

Common sense physics
(water flows downhill)

Basic motor skills​ 

Visual reasoning

Understanding context

A.I. in Astronomy Olympiads

Pinheiro, ..., YST+, 2025

Visual reasoning remains a limiting factor for AI agents

Pinheiro, ..., YST+, 2025

Yang,... YST+, 2025, ICCV

YST+, 2025d

Visualizing the knowledge graph in astronomy

Sun, YST+, 2024b

de Haan, YST+, 2025

Score (%)

Cost per 1 SED Source (USD)

AstroSage-8B
(de Haan, YST+ 2025a)

AstroSage-70B
(de Haan, YST+ 2025b)

For astronomy Q&A, AstroSage-70B delivers GPT-5-level performance while costing 20x less

de Haan, YST+, 2024, 2025

AI-capable tasks are getting exponentially cheaper

de Haan, YST+, 2025

they are to be surprised

To physicists who think AI can't do their jobs ...

But for AI-for-Science hype?

They're just as wrong.

Easy-for-AI

Complex calculations

Easy-for-Human

Logical inference (?)

Memorizing information

Language

Coding

Spatial reasoning

Common sense physics

Basic motor skills​ 

Visual reasoning

Understanding context

Epistemology: What counts as knowledge?

Supported by the Alfred P. Sloan Foundation and CCAPP / OSU

YST+ 2026, Nature Astronomy, in-press.

"One particularly useful conception of understanding emphasizes several interconnected capacities: characterizing the features of a system, communicating those characteristics so that others can mentally reconstruct them .... "

YST+ 2026, Nature Astronomy, in-press.

"... On this way of thinking, understanding is a matter of making the world intelligible to communities of inquirers. "

Narrative matters, rheotoric matters, context matters

Ernest Hemingway's six-word story

For sale:

Baby shoes,

Never worn.

"Scientific understanding in complex domains shares something of this character.

The ‘knowledge’ encoded in a successful model of galaxy formation
is not fully captured by its equations or even its predictions; it includes the tacit understanding of which features matter, why they matter, and how they connect to the broader enterprise of astronomy."

YST+ 2026, Nature Astronomy, in-press.

YST+ 2026, Nature Astronomy, in-press.

"We may discover what astronomy has always tacitly known: that understanding the universe is a distinctly human project—even when, especially when, we have non-human collaborators in the endeavour."

Extra Slides

Annotated
Labelled Data

supervised
tasks

Unlabelled Data

foundational models

Interacting with "physical" world

AI
astronomer

Example of learned "knowledge"

" If there is a gross underestimation in the MWIR bands,

consider exploring a wider range of fracAGN values in the agn module to improve the fit in these bands "

Number of Learning Iterations

0

10

20

30

5.1

5.6

6.0

6.4

Chi-Square

Chi-Square of the Fit

Why this plateau ??

Sun, YST+, 2024

LLMs with self-play RL outperforms native LLMs

Number of Learning Iterations

0

10

20

30

5.1

5.6

6.0

6.4

Chi-Square

Chi-Square of the Fit

 - Number of photometry bands fitted within 1σ

LLMs with self-play RL outperforms native LLMs

Sun, YST+, 2024

Number of Learning Iterations

0

10

20

30

5.1

5.6

6.0

6.4

Chi-Square

Chi-Square of the Fit

 - Number of photometry bands fitted within 1σ

"Exploration"

"Exploitation"

LLMs with self-play RL outperforms native LLMs

Sun, YST+, 2024

Explaining James Webb's "little red dot" galaxies with Mephisto

Wavelength [micron]

Flux

Sun, YST+, 2025

A seamless and interpretable AI-human collaboration

Learn from the data

Summarize "knowledge"

Examine and include prior knowledge

A seamless and interpretable AI-human collaboration

Expedite discovery

Use the learned knowledge as context

Quantifying the growth of the field -- by groups of concepts

Year

2000

2005

2010

2015

2020

7

9

11

10

8

Count [thousands]

Scientific concepts

Sun, YST+, 2024b

Quantifying the growth of the field -- by groups of concepts

Year

2000

2005

2010

2015

2020

1.5

Count [thousands]

Numerical simulation

1.2

0.9

0.6

0.3

Statistics

Sun, YST+, 2024b

The number of ML concepts in astronomy has not grown

Year

2000

2005

2010

2015

2020

1.5

Count [thousands]

1.2

0.9

0.6

0.3

Machine learning

Linear Regression,
Gaussian Process, Random Forest, ......

152

210

230

Sun, YST+, 2024b

Quantifying the cross-domain interaction:
How technical concepts inspire scientific ones

Knowledge graph via the literature-citation metric

Concept

Paper

Ting et al.

Contain

Einstein et al.

Contain

Contain

citation

Concept B:
Plasmon

Concept A:
Dark Matter

Concept A:
Dark Matter

Concept

Concept B:
Plasmon

Distance between concept A to B =

Paper

averaged over all papers containing concept A

Knowledge graph via the literature-citation metric

Concept

Paper

Technical concept:
Neural Networks

Scientific concept: Large-Scale Structure

Cross-domain linkage shows a two-phase evolution

Year

2000

2005

2010

2015

2020

-4.0

Log Average Linkage

-4.2

-4.4

-4.6

Numerical simulation
x scientific concepts

Technology development

Sun, YST+, 2024b

Concept

Paper

Scientific Concept: Large-Scale Structure

Numerical Simulations

Simulations being developed

Linkage
decoupled

Cross-domain linkage shows a two-phase evolution

Year

2000

2005

2010

2015

2020

-4.0

Log Average Linkage

-4.2

-4.4

-4.6

Numerical simulation
x scientific concepts

Technology deployment

Technology development

Sun, YST+, 2024b

Concept

Paper

Scientific Concept: Large-Scale Structure

Numerical Simulations

Simulations being deployed to sciences

Linkage increases

Year

2000

2005

2010

2015

2020

-4.0

Log Average Linkage

-4.2

-4.4

-4.6

Numerical simulation
x scientific concepts

N-body
simulation

Hydrodynamical simulation

Cross-domain linkage shows a two-phase evolution

Sun, YST+, 2024b

Interest in AI x Astronomy outpaces technological development

Year

2000

2005

2010

2015

2020

-4.0

Log Average Linkage

-4.2

-4.4

-4.6

ML x Scientific concepts

Gaussian process
multi-layer perceptron

We don't understand how people intuitively understand plots

AI is still 20-50 points worse than humans

Brute force fine-tuning can close the gap in simple descriptive tasks, but not in visual reasoning tasks 

Yang,... YST+, 2025, ICCV

Our concepts show finer granularity than keywords

YST+, 2025d

Our concepts show finer granularity than keywords

YST+, 2025d

The temporal evolution of concept
co-occurrences

in papers

Cosmology

Galaxy

High-energy

Sun/Star

Exoplanet

Simulation

Instrument

AI/Stat

Cosmology

Galaxy

High
-energy

Star

Planet

Sims

Instru.

AI/Stats

Sun/Star

Applications of AI in Stats

YST+, 2025d

We also need a capable model that can generate run cost efficiently....

capable model

vs.

cost efficiency

e.g., GPT-5

In the SED case study, we need ~0.1M tokens per source

= USD 1 per source ...

1B sources = $1 billion

e.g., Roman Space Telescope, Euclid Space Telescope

~ approximately the build cost

Can we improve lightweight
open-weights LLMs to perform well on astronomical tasks?

Natural Language Processing experts

Oak Ridge
National Lab

Argonne
National Lab

 AstroMLab (astromlab.org)

Harvard-Smithsonian ADS

U. Ilinois
Urbana-Champaign

The first extensive benchmarking effort in astronomy

The first extensive benchmarking effort in astronomy

Knowledge Recall

YST+, 2025a

AstroBench: High quality astronomy QA benchmark dataset

Nguyen, YST+ 2023

Benchmark multiple choice question - example

What is the primary reason for the decline in the number density of luminous quasars at redshifts greater than 5?

A decrease in the overall star formation rate, leading to fewer potential host galaxies for quasars.

An increase in the neutral hydrogen fraction in the intergalactic medium, which obscures the quasars’ light.

A decrease in the number of massive black hole seeds that can form and grow into supermassive black holes.

An increase in the average metallicity of the Universe, leading to a decrease in the efficiency of black hole accretion.

Special thanks to

Beyond just benchmarking astronomical knowledge recall

Warren Buffet :
" The trick is, when there is nothing to do, do nothing
 "

Still it is not very scalable 

LLaMA-3.1 70b throughput on four H100 GPUs

= ~ 100 tokens / second

1 SED source = 15 GPU minutes

1B sources = 10M GPU days

A cluster with 10,000 H100 GPUs 
running for 3 years

= 0.03 USD

= 40 USD

Huang's Law

Compute Power

Year

CPU Moore's Law is plateauing

GPU is
picking up the pace

LLMs are getting very cheap, very quickly

The price drop has an e-folding time of appromately
3 months

YST, AstroMLab+, 2025

Score (%)

Cost per 1 SED Source (USD)

< July 2024

Score (%)

Cost per 1 SED Source (USD)

< July 2024

Score (%)

Cost per 1 SED Source (USD)

+ 3 months

Google Gemma-2

Google
Gemini-1.5

Open-Weight

Proprietary

DeepSeek v2

Score (%)

Cost per 1 SED Source (USD)

Alibaba Qwen-2.5

Open-Weight

Proprietary

Meta LLaMA 3

+ 3 months

Yi 01

X's Grok

Stepfun

Microsoft
Phi-3.5

Nvidia's Nemotron

Score (%)

Cost per 1 SED Source (USD)

Open-Weight

Proprietary

+ 3 months

+ 3 months

Proprietary

(Experimental / Not Released)

DeepSeek v3 / R1

Score (%)

Cost per 1 SED Source (USD)

Open-Weight

Proprietary

+ 3 months

+ 3 months

Proprietary

(Experimental / Not Released)

OpenAI (o3)

Google Gemini-2.0

Score (%)

Cost per 1 SED Source (USD)

Open-Weight

Proprietary

+ 3 months

+ 3 months

Proprietary

(Experimental / Not Released)

Microsoft
Phi-4

MiniMax 01

Gemini-2.5-Pro

Claude-3.7-Sonnet

Meta LLaMA 4

1B sources = $1 billion

e.g., Roman Space Telescope, Euclid Space Telescope

~ approximately the build cost (July 2024)

3% of the build cost (March 2025)

Mephisto achieves the same success rate with 1/30 of the cost

March 2025

Sun, YST+, 2025

GPT-4o

QwQ32B

Data-poor , Theory-rich

Collecting
more data

???

Data-poor , Theory-rich

Data-rich , Theory-poor

Roman, HSC, Euclid, DESI, SDSS, PFS

Data-poor , Theory-rich

What A.I. agent
can solve

Interesting astronomy problems

What A.I. agent
can solve

Interesting astronomy problems

JWST SED Fitting

Summary :

Nonetheless, expectations should be tempered — the Moravec paradox makes AI capabilities uneven for full autonomy.

Fine-tuning models, building ecosystems and proper benchmarking to enable cost-effective, well-rounded AI agents is the path forward.

Modern LLMs' reasoning capabilities make AI agents an exciting new paradigm for astronomical research.

Though limited in reasoning, Mephisto analyzes and navigates SED physical models as effectively as humans.