Revisiting Psycholinguistic Norms: Comparing Human and GPT-/DeepSeek-Derived Ratings on Concreteness, Imageability, Familiarity, Valence, and Arousal of 25,000+ Two-Character Chinese Words (94040)

Presentation Schedule

Thursday

Friday

Saturday

Sunday

Monday

Virtual

All

Revisiting Psycholinguistic Norms: Comparing Human and GPT-/DeepSeek-Derived Ratings on Concreteness, Imageability, Familiarity, Valence, and Arousal of 25,000+ Two-Character Chinese Words (94040)

Session Information:

Friday, 11 July 2025 15:45
Session: ECE Poster Session
Room: SOAS, Brunei Suite (Ground Floor)
Presentation Type:Poster Presentation

All presentation times are UTC + 1 (Europe/London)

In typical psycholinguistic norming studies, participants rate individual words on lexical variables (e.g., concreteness, valence). These ratings allow researchers to select stimuli for controlling or manipulating lexical variables (Tse et al., 2021) and examine how they influence lexical processing tasks, addressing questions in word recognition (Tse & Yap, 2018). However, collecting human rating data is time-consuming and labor-intensive. Recently, Large Language Models (LLMs) (e.g., GPT-4o) have been employed to approximate human ratings using conversational probes (e.g., Martínez et al., 2025; Trott, 2024). Extending this approach, our study investigated the relationship between human ratings (Chan & Tse, 2024) and ratings derived from two LLMs (GPT-4o-Turbo, DeepSeek-R1-FW) for the concreteness, imageability, familiarity, valence, and arousal of more than 25,000 two-character Chinese words. Among GPT, DeepSeek, and human ratings, valence yielded the strongest intercorrelation (mean = .82), followed by concreteness (.69), arousal (.64), imageability (.63), and familiarity (.58). Across these five variables, GPT and DeepSeek correlated similarly with human ratings (both mean = .65), which was lower than the correlation between GPT and DeepSeek themselves (.71). We further examined how these ratings predict lexical decision and naming performance (Tse et al., 2017, 2023), while controlling for orthographic, phonological, and semantic factors (Tse et al., 2023). Results indicate that although LLM-derived valence and familiarity ratings aligned with human ratings in predicting lexical decision and naming performance, the predictions diverged for concreteness, imageability, and arousal. These findings suggest caution in replacing human ratings with LLM-derived values when norming lexical variables.

Authors:
Xi Cheng, The Chinese University of Hong Kong, Hong Kong
Xi Huang, The Chinese University of Hong Kong, Hong Kong
Yuen-Lai Chan, Lingnan University, Hong Kong
Chi-Shing Tse, The Chinese University of Hong Kong, Hong Kong

About the Presenter(s)
Professor Chi-Shing Tse is a University Professor/Principle Lecturer at The Chinese University of Hong Kong in Hong Kong

Connect on Linkedin
https://www.linkedin.com/in/chi-shing-tse-44b7451b6/

See this presentation on the full schedule – Friday Schedule

Presentation Schedule

Revisiting Psycholinguistic Norms: Comparing Human and GPT-/DeepSeek-Derived Ratings on Concreteness, Imageability, Familiarity, Valence, and Arousal of 25,000+ Two-Character Chinese Words (94040)

Conference Comments & Feedback

Comments

Powered by WP LinkPress

Share this Presentation

Posted by James Alexander Gordon