Carnegie Mellon University
February 28, 2025

Data Visualization Mapped MS-DAS Graduate Research Project

By Sarah Bender

Shannon Riffe
Heidi Opdyke
  • Interim Director of Communications

Research Data Services Librarian Alfredo González-Espinoza wants to make it easier for CMU researchers to find related projects and opportunities for collaboration across disciplines.

Kilthub, CMU’s institutional repository, contains around 4,000 theses uploaded by graduate students across the university. But within Kilthub, there’s no easy way to explore how a research topic might be connected to other areas of campus. González-Espinoza began looking for ways to gather and display this data, making it accessible to students and encouraging them to engage more with other researchers.

Mellon College of Science student Chehak Arora, a student in the Master’s of Science in Data Analytics for Science (MS-DAS) program, reached out to González-Espinoza.

“I’ve always been passionate about working with data. I enjoy uncovering insights from complex datasets and applying analytical techniques to solve real-world problems,” Arora said. “Over winter break, I wanted to make the most of my time and work on something meaningful.”

Arora is a member of the Tartan Research Data Alliance led by González-Espinoza and Open Knowledge Librarian Emily Bongiovanni. The program aims to create a community of practice around research data management, to foster good research practices, introduce researchers to the support and services the libraries offer, and explore networking opportunities or identify potential collaborations.

Arora asked González-Espinoza to help connect her with data projects happening at Carnegie Mellon, so she could practice her skills and contribute to meaningful work being done by the CMU community.

“This project was particularly interesting because it provided valuable hands-on experience with machine learning models for text data,” she said. “I learned a lot about how clustering algorithms work, especially in terms of identifying patterns and grouping similar data points.”

The researchers used natural language processing techniques and large language models such as BERT to use metadata in order to organize and analyze academic theses based on their content.

Over the course of the project, Arora said she learned how to use the BERT-based sentence transformers package to process text from thesis abstracts and transform them into a semantic map.

González-Espinoza and Arora used HDBSCAN — Hierarchical Density-Based Spatial Clustering of Applications with Noise — and other techniques to visualize a high dimensionality map of research similarities in a two-dimensional, interactive space.

“Imagine a star map, but instead of stars, you're looking at every graduate thesis in KiltHub. Each point represents a thesis, and the space between them shows how related their topics are — the closer together, the more similar their research themes,” González-Espinoza explained. “This creates a fascinating ‘galaxy’ of CMU graduate research.”

The interactive scatter plot, which was first shared as a part of the Love Data Week 2025 (Feb. 10-14) celebration, enables users to zoom, pan and hover over data points to explore research trends, document titles and topic distributions across colleges. Among the most popular theses topics are robotic task planning, computer vision and political rhetoric history.

“One surprising insight was the overlap between theses from different colleges,” Arora said. “It was interesting to see how research topics from seemingly distinct disciplines had thematic or methodological similarities. This highlighted the interdisciplinary nature of many academic fields and showed how research topics can be interconnected across domains.”

The visualization captures the semantic meaning of the research, rather than just keywords, to show conceptual relationships between different works. With the visualization, users can explore connections between different fields and find potential areas for new research. It can even be used to identify unique research opportunities for collaboration by highlighting gaps inside a field or across disciplines.

“Working on this project was an exciting experience for me. I gained a deep understanding of these techniques, and now I’m applying them to my capstone project,” said Arora, who is now studying sentiment analysis in banking. “It was exciting to see how a model can be applied to different datasets.”

— Related Content —