July 5, 2025

New Study Evaluates Text Embedding Models for Built Asset Data Alignment

CMiC Global

Since 1974, CMiC has been a global leader in enterprise software for the construction industry. Headquartered in Toronto, Canada, CMiC delivers a fully integrated platform that streamlines project management, financials, and field operations.

With a focus on innovation and customer success, CMiC empowers construction firms to enhance efficiency, improve collaboration, and make data-driven decisions. Trusted by industry leaders worldwide, CMiC continues to shape the future of construction technology.

Read More About CMiC:

News Summary

A recent study has benchmarked various text embedding models to assess their effectiveness in automating the alignment of complex built asset data with technical concepts. This research aims to fill the gap in comprehensive evaluations of text embedding technologies within this specialized domain. The findings indicate significant variability in model performance, emphasizing the importance of tailored assessments for effective asset management and the future exploration of domain-specific adaptations.

Advancements in Automating Built Asset Information Alignment

Accurate mapping of built asset information to data classification systems is a crucial element in ensuring effective asset management. The complexity of built asset data, primarily made up of technical text elements, has traditionally necessitated manual alignment, heavily reliant on the expertise of domain professionals. Recent advancements in contextual text representation learning, known as text embedding, present promising avenues for automating this intricate data alignment process.

Benchmark Evaluations of Text Embedding Models

Despite ongoing enhancements in text embedding technology, previous studies have not conducted a comprehensive evaluation of how state-of-the-art text embedding models perform within the context of built asset data. To fill this gap, a new study benchmarks various text embedding models, aiming to assess their effectiveness in aligning built asset information with technical concepts.

The study introduces proposed datasets derived from two recognized built asset data classification dictionaries. To ensure a thorough evaluation, the benchmarking covers six datasets that focus on key tasks involving clustering, retrieval, and reranking. This approach uncovers significant performance variability across different models.

Results and Findings from Benchmarking

The benchmarking results reveal a deviation from the common assumption that larger text models perform better. This underscores the necessity of conducting domain-specific evaluations rather than relying solely on general benchmarks. The research emphasizes that the performance of models can be significantly influenced by text length and type, with longer text inputs yielding better results.

In addition to performance observations, the study identifies critical areas for future research, which should focus on enhancing domain adaptation techniques and exploring instruction-tuning strategies to refine model performance. Effective asset management, which is essential for maintaining the long-term functionality of infrastructure, is significantly impacted by these advancements.

The Importance of Digital Twins and Data Source Alignment

The increasing integration of digital technologies in infrastructure management necessitates the development of enriched digital twins for real-time operations and asset management. By aligning diverse data sources with established models, accessibility for stakeholders improves, and software interoperability is enhanced.

The alignment of built asset data is a complex challenge, primarily due to the variety of terminologies and formats that arise from different disciplines. For example, differing terminology usage between architects, structural engineers, and subcontractors highlights the complications faced in manual data alignment, which is often a time-consuming and error-prone process. The need for automated solutions in aligning built asset data is thus underscored.

Methodology and Datasets Utilized in the Study

The methodology of the study employs a contextualized representation of text as numeric vectors to better grasp intricate terminologies. Text embedding capabilities have made remarkable strides, particularly with the advent of pre-trained transformer models, including BERT and GPT. This research evaluates 24 cutting-edge text embedding models across six different tasks, engaging a wide array of subdomains related to built asset data.

To ensure robustness and comparability, the developed datasets, which cover architectural, structural, mechanical, and electrical domains, encompass more than 10,000 data entries. The benchmarking process adheres to the Massive Text Embedding Benchmark (MTEB) framework, which is instrumental in establishing standardized evaluation metrics.

Performance Assessment and Future Directions

The clustering tasks within this study involve grouping similar built products based on textual representation similarities, while the retrieval and reranking tasks evaluate the ability of models to identify relevant product descriptions based on user queries. The results reveal substantial performance disparities based on text input specifics, reaffirming the limited transferability of general benchmarks to specialized fields.

Conclusions drawn from this research indicate that data quality and training strategies play a more crucial role in achieving effective text alignment than the size of the model itself. Future research should focus on the creation of diverse, multilingual datasets and the evaluation of domain adaptation methods in the management of built asset information. The resources, including datasets and software, are readily accessible on platforms like GitHub and Hugging Face, fostering ongoing advancements in the automated alignment of built asset data.

Deeper Dive: News & Info About This Topic

Additional Resources

Author: Construction NY News

NEW YORK STAFF WRITER The NEW YORK STAFF WRITER represents the experienced team at constructionnynews.com, your go-to source for actionable local news and information in New York and beyond. Specializing in "news you can use," we cover essential topics like product reviews for personal and business needs, local business directories, politics, real estate trends, neighborhood insights, and state news affecting the area—with deep expertise drawn from years of dedicated reporting and strong community input, including local press releases and business updates. We deliver top reporting on high-value events such as the New York Build Expo, infrastructure breakthroughs, and cutting-edge construction technology showcases. Our coverage extends to key organizations like the Associated General Contractors of New York State and the Building Trades Employers' Association, plus leading businesses in construction and real estate that power the local economy such as Turner Construction Company and CMiC Global. As part of the broader network, including constructioncanews.com, constructiontxnews.com, and constructionflnews.com, we provide comprehensive, credible insights into the dynamic construction landscape across multiple states.

Construction NY News

Stay Connected

More Updates

Three-story data center shell under construction with crane, construction vehicles, fencing and parking area in Fairfax County

Would You Like To Add Your Business?

Submit