Lexical Semantic Alignment: A Contrastive Framework for Diachronic English Lexicon Evolution Modeling
作者:佚名 时间:2026-03-17
This thesis introduces a novel contrastive lexical semantic alignment framework for modeling diachronic evolution of the English lexicon, addressing a critical gap in traditional static semantic analysis that assumes fixed word meanings across time. Framing semantic shift as a systematic, measurable transformation rather than random fluctuation, the work aligns independently trained time-period word embeddings onto a shared coordinate system using meaning-stable anchor words as a reference backbone. Built on a contrastive learning design that pulls semantically shifted representations of the same word across eras apart while grouping unchanged uses close together, the framework standardizes quantitative detection of subtle meaning shifts invisible to manual qualitative analysis. Validated on 19th–21st century English corpus data, it outperforms existing baseline methods by reducing alignment noise and accurately tracking gradual evolution, including clear examples like the metaphorical shift of "wire" from a physical material to a communication concept. Empirical results confirm that modern technological and societal change accelerates semantic shift, with core vocabulary remaining stable while peripheral terms adapt rapidly to new concepts. This rigorous, reproducible framework delivers major benefits for digital humanities, enabling fast, consistent analysis of massive historical text archives, and improves NLP system performance on historical language tasks like information retrieval and sentiment analysis. It also lays a adaptable foundation for future diachronic research across languages, linking traditional linguistic theory to robust computational practice. (156 words)
Chapter 1Introduction
We open with an introduction that lays down the theoretical and practical groundwork for our thesis, “Lexical Semantic Alignment: A Contrastive Framework for Diachronic English Lexicon Evolution Modeling,” whose core focus is to probe the dynamic, time-driven shifts in word meaning— a phenomenon called semantic shift— which, in computational linguistics, lets us interpret historical texts, track cultural changes, and refine natural language processing systems built to operate across distinct time periods. We anchor our work on the idea that semantic changes are not random fluctuations but systematic transformations, ones we can effectively model and quantify by aligning lexical representations from different time frames. This framing turns a nebulous linguistic trend into a measurable, researchable problem.
At its core, studying diachronic lexicon evolution means analyzing word usage patterns pulled from massive text corpora that span decades or even centuries; traditional semantic analysis methods, though, often rely on static vector space models where each word sits as a fixed coordinate in a high-dimensional space, an approach that falls flat with historical data because it assumes word meanings stay constant through time. To fix this notable gap, we put forward a framework built on lexical semantic alignment, a process that maps vector spaces from different time periods onto a single common coordinate system. Anchor words, which are terms that have maintained consistent meanings across centuries, form the reliable backbone of this entire alignment process. We use these stable, unchanging terms as a solid base to rotate and adjust disparate semantic spaces until they achieve full congruence with one another.
To put our alignment framework into practice, we first split the diachronic corpus into separate, distinct temporal slices, each corresponding to a specific historical era, and for every slice, we train word embeddings independently to capture the unique semantic nuances that define both everyday and formal language use in that time frame. After training these era-specific embeddings, we use a contrastive learning mechanism to spot and measure the exact divergence in word usage between a chosen source time period and a target time period. This targeted step lets us clearly flag which words have undergone major, measurable semantic shifts over time. We then apply orthogonal transformation matrices, derived directly from our identified anchor words, to align historical embeddings with a modern reference space, turning the once-subjective task of detecting semantic change into a precise geometric calculation that measures angular distance or cosine similarity between word vectors before and after alignment.
Looking at the real-world value of this framework, we see it has huge utility for digital humanities, where automated semantic alignment lets researchers sift through massive archives of literature and historical documents with a speed and consistency no manual annotation can match, while also giving lexicographers a tool to trace term etymologies and evolution with empirical precision. In natural language processing, understanding these diachronic shifts is necessary to build better machine learning models, since standard models trained on modern data often fail when used on older texts. This framework gives NLP systems the ability to adapt to the unique temporal context of historical language use. With this adaptive capacity, systems perform far better at tasks like information retrieval and sentiment analysis across diverse historical datasets.
In the end, we argue that studying lexical evolution needs a rigorous, standardized approach, one that moves past just anecdotal stories of semantic change to more concrete, evidence-based inquiry. Our proposed contrastive framework advances this field by formalizing the alignment process, making sure every measurement of semantic drift is reproducible and rooted in statistical methods, not just subjective linguistic observation. This creates a clear, solid path for future research to explore both word change and its underlying forces. We can now examine closely not just how words shift over time but also the linguistic and cognitive pressures that drive these transformative changes.
Chapter 2A Contrastive Lexical Semantic Alignment Framework for Diachronic English Lexicon Evolution
2.1Theoretical Foundations of Lexical Semantic Alignment in Diachronic Linguistics
We locate the theoretical base of lexical semantic alignment in diachronic linguistics at an overlap between historical philology and modern computational modeling, a space that focuses on quantifying how individual word meanings shift slowly or rapidly over long, unbroken stretches of time; at the heart of this work sits the idea of diachronic lexical semantics, which looks into the mechanisms and paths of meaning change across different historical phases of a language, and we note that lexical semantic change is not a single event but covers many processes, like sense shift where a word picks up a new meaning and may let go of an old one, and semantic drift where a word’s use slowly, often unnoticeably, moves away from its original semantic anchor. These core ideas give us the precise descriptive terminology we need to grasp the lexicon’s ever-changing, fluid nature over extended time frames. We also recognize that the consistent link between a word’s spoken or written form and its intended, commonly understood meaning almost never stays fixed across the many decades and centuries of a language’s long, evolving existence.
Traditional historical linguists have spent much of their professional careers documenting these evolutionary word meaning patterns, mostly through close, detailed qualitative analysis of surviving written textual evidence from different historical eras; early work in this area focused on listing out the specific directions semantic change can take, marking clear patterns like narrowing, broadening, amelioration, and pejoration, and while these studies told us a lot about the cognitive and social-cultural forces driving language evolution, they stuck mostly to description and lacked concrete ways to measure how much change had happened over time or line up semantic spaces across hundreds of years. This long-standing historical perspective shows individual words don’t exist in isolation, and their shifts are tied to the language’s broader lexical system. This key realization that individual words shift relative to a language’s entire, interconnected lexical system is what makes semantic alignment a necessary component of modern computational approaches to studying diachronic language change.
In modern computational linguistics, the core reason we rely on semantic alignment to model cross-temporal lexicon changes comes straight from the inherent instability of lexical distributions that come with shifts in usage and context over extended periods of time; in the framework of modern distributional semantics, we represent word meanings as vectors in a high-dimensional space, built from how words appear alongside each other in text, but when we compare texts from two different time periods, the statistical traits of these spaces change a lot because vocabulary, grammar, and topic focus all shift over time, so comparing word vectors directly across eras is a flawed method since the underlying shape of the semantic space has moved. Semantic alignment acts as a necessary methodological corrective by adjusting these spaces to fit a single, shared, comparable coordinate system. This careful alignment adjustment makes sure the relative distances between words show real, genuine semantic shifts, not random statistical noise or structural differences between different historical corpora.
Connecting modern distributional semantics to traditional diachronic language theory means closing the gap between discrete, qualitative change descriptions that have guided historical linguists for decades and continuous, quantitative numerical measures of word meaning shift over time; traditional diachronic theories give us clear ideas about what types of changes happen in word meaning, like when a concept extends through loose metaphorical links, while modern distributional semantics gives us the precise mathematical tools to measure exactly how much that meaning change has occurred over specific time frames, and the alignment process turns the traditional linguistic idea that meaning depends heavily on context into a concrete, usable research practice. By stabilizing word relationships over time, alignment lets us track semantic drift as a clear path through the vector space. This gives us a consistent, standard way to spot and analyze diachronic language evolution, and this mix of old and new theories lays a solid base for building a contrastive alignment framework that stays true to how the English lexicon has changed over its long history.
2.2Contrastive Dimension Design for Cross-Temporal Lexical Semantic Comparison
We build our cross-temporal lexical semantic comparison framework around contrastive dimension design, the core mechanism that lets us measure how word meanings evolve over time with consistent, quantifiable precision; in computational linguistics, this process comes down to setting up specific vector directions and metric spaces that reliably tell stable semantic senses apart from those that have undergone measurable shift. This framework draws on contrastive learning, a method uniquely suited to capturing cross-time semantic differences because it prioritizes relative distances and structural links between word representations across different periods over fixed coordinate positioning. This focus on relative positioning instead of fixed coordinates fixes notable flaws in older static embedding methods. Contrastive learning pushes our models to move representations of the same word closer together when its core meaning stays steady over time, while pulling those representations far apart when the word’s meaning shifts in significant, measurable ways. This setup works especially well for modeling long-term semantic evolution because it naturally adapts to the messy, non-linear reality of word meaning change, where shifts often follow irregular patterns that don’t fit simple, predictable trajectories or uniform rates of change.
To put this framework into real-world practice, we follow a strict, structured process to build time-aware contrastive dimensions, centered entirely on creating targeted sets of positive and negative sample pairs for model training; positive pairs draw from the same word used in different periods where its core meaning stays consistent or contextual usage shows clear, unbroken semantic continuity. For instance, the word “star” used to discuss astronomical objects in the 19th and 20th centuries forms a positive pair, guiding our model to map these two time-specific uses close together in the embedding space. Negative pairs follow the exact opposite logic, pairing the same word across different periods where its meaning has shifted sharply or taken on entirely new connotations. By treating these shifted uses as negative samples, we train the framework to spot and separate distinct semantic groups, marking cases where words have generalized, specialized, or taken on negative tones.
These carefully designed contrastive dimensions work because they can capture the many distinct ways word meanings shift over time; a dimension tuned to spot specialization, for example, will push representations of a word’s old, broad general use far away from its new, narrow, technical application in the embedding space. A separate dimension, built to detect shifts toward negative connotations, will recognize when a word’s inherent tone darkens and move those time-specific representations apart to reflect the semantic change. This targeted, change-focused approach goes far beyond basic word co-occurrence statistics used in older models.
This framework offers real, measurable value for diachronic linguistic research because it gives us a standardized, repeatable way to automatically find and categorize semantic shifts at scales we could never reach with slow, labor-intensive manual qualitative analysis alone. Our design makes the model resistant to noise common in historical text corpora, while keeping it sensitive to the small, subtle shifts that shape the long-term history of the English language. By tying model learning to these contrastive dimensions, we get a clear, accurate map of word meaning change across different time periods.
2.3Construction of the Diachronic English Lexicon Evolution Modeling Framework
We construct the Diachronic English Lexicon Evolution Modeling Framework as a systematic architectural effort that uses computational linguistics to capture language’s dynamic nature, with its core focus on solving the basic challenge of aligning lexical semantics across time periods so words can be compared accurately even as their meanings shift over centuries. The model’s operational logic begins by splitting large historical text corpora into discrete, chronologically ordered intervals, such as decades or centuries, a stratification that lets the system treat language as a fluid continuum rather than a static entity. This setup paves the way for detailed, granular analysis of word meaning evolution across different historical eras.
The first step of putting the framework into practice involves training lexical embeddings independently for each specific time slice, using context-aware algorithms to generate dense vector representations that capture the syntactic and semantic nuances of vocabulary as it existed within that exact historical window, laying down the raw semantic data needed for each studied era. But vector spaces trained on different corpora are naturally misaligned, due to random neural network initialization and the unique linguistic distribution patterns of each time period, making direct comparison of these vectors yield mathematically invalid results. To bridge this persistent alignment gap, the framework integrates a specialized contrastive lexical semantic alignment module.
The framework’s core algorithmic innovation lies in this contrastive alignment mechanism, which avoids traditional mapping methods that force all words into a rigid, identical structure and instead operates to preserve semantic stability while allowing measurable deviation, starting with identifying a subset of anchor words shown to have unchanging meanings across selected time periods as reference points for the alignment process. The model then learns a transformation function that maps vector spaces from different time slices into a single, shared high-dimensional semantic coordinate system, guided by a contrastive objective that minimizes distance between invariant anchor word embeddings across time while maximizing separation of unrelated words. This results in a unified semantic space that balances temporal continuity for stable words with adaptive drift for others.
Once embeddings are projected into this shared space, the framework moves to quantifying semantic change, translating geometric relationships within the aligned vector space into concrete, measurable metrics of lexical evolution by calculating cosine distance or similar similarity scores between a target word’s embedding in an earlier time slice and its counterpart in a later one. A small distance between embeddings signals high semantic stability for the word, while a large, noticeable divergence indicates a substantial shift in how the word was used and understood. The framework also analyzes shift direction relative to other semantic clusters to spot narrowing, broadening, or radical change.
This framework holds significant practical value for both computational linguistics and digital humanities, as it standardizes procedures for diachronic analysis to move beyond casual anecdotal evidence of language change and provide a strict, quantitative methodology that lets researchers detect subtle lexical evolution patterns invisible to the human eye. These patterns offer deeper insights into the cultural and cognitive shifts that drive long-term language change, while the ability to model dynamics in a continuous, aligned space supports development of more sophisticated historical language processing tools. This enhances our collective ability to interpret and preserve linguistic heritage stored in vast textual archives.
2.4Empirical Validation of the Framework Using 19th-21st Century English Corpus Data
To put our proposed contrastive lexical semantic alignment framework through thorough empirical validation, we conduct strict, systematic examinations using comprehensive English corpus data that spans the entire 19th, 20th, and 21st centuries. All the base data we use for this validation work comes from large, long-term text archives built specifically to capture the full scope of written English across those three distinct historical eras, and we carefully split the entire corpus into back-to-back, non-overlapping time chunks, each lasting a single decade or a full 25 years, to create a detailed, high-resolution timeline of how lexical use shifts over time. This structured slicing lets us track even small, gradual shifts in word meaning over extended periods of time. We start our entire preprocessing workflow with strict, methodical data cleaning, which carefully removes non-text elements, fixes inconsistent character coding, and cuts out content that adds no useful analytical information. After completing every step of the data cleaning process, we break the polished, consistent text into separate, discrete lexical units through a standard tokenization process, then use lemmatization to standardize different inflected forms of the same word, grouping all grammatical variations under a single lemma label so we can accurately pick up on semantic shifts without being distracted by unrelated syntactic changes.
We carefully design our experimental setup to test how well the framework models long-term semantic change by pitting it against well-established baseline methods used in linguistic research. These baselines include standard static embedding alignment tools like the commonly used Procrustes analysis, as well as time-focused referencing models such as the widely accepted Alignment-Based Framework, while our framework uses a contrastive learning goal that treats words from directly adjacent time chunks as matching pairs and random words from unrelated contexts as non-matching pairs to refine the overall semantic embedding space. This targeted pairing strategy helps us fine-tune the framework to spot meaningful semantic shifts over time. We carefully select specific evaluation metrics to measure two key factors: the extent to which word meanings drift over time, and how accurately we can predict their long-term evolutionary paths. For common, widely used lexical items, we focus on calculating the cosine distance between embedding vectors taken from different time chunks to spot subtle changes in how closely related their meanings are, and we also isolate newly coined neologisms to study how quickly they integrate into the lexicon and how their meanings settle into stable, fixed patterns over time.
Our quantitative results clearly show that the contrastive framework works significantly better than existing methods at picking up on small, subtle semantic changes in lexical items over extended periods. When put head-to-head against other established methods in controlled comparative tests, our model picks up on slow, gradual meaning shifts far more easily, cutting down on the background noise that comes from alignment mistakes in older, less precise approaches, and it does a far better job of tracking the unique semantic paths of polysemous words, which traditional methods often muddle together by mixing up their distinct senses. This cuts down on confusion that would otherwise hide true semantic evolution patterns over time. When it comes to newly coined neologisms, the model effectively tracks how quickly their meanings settle into clear, fixed definitions from their initial vague contextual associations. We run targeted case studies on common nouns and verbs taken from fast-changing industrial and technological sectors, and these tests show the contrastive approach maintains strong, consistent semantic continuity, correctly telling apart true, long-term semantic evolution from temporary contextual blips caused by random noise in the underlying corpus data.
More in-depth analysis through targeted, specific case studies shows that the framework has real practical use for studying how the English lexicon evolves over long historical periods. Take the common term “wire” for example; the model tracks its semantic path with remarkable precision, from referring strictly to a physical metal material in the 19th century to standing for metaphorical communication concepts in the 20th and 21st centuries, while it also captures how new words like “online” grow in meaning and expand their functional uses across different contexts over time. These concrete examples clearly show the framework’s ability to capture real-world lexical shifts over time. Our empirical findings from this comprehensive validation process reveal broad, overarching patterns of how the English lexicon evolved from the 19th to the 21st century. We can see a clear, distinct trend of faster semantic change that lines up directly with rapid technological and societal advances in the modern era, and the framework shows that while core vocabulary stays relatively stable over time, less common peripheral words change often with new terms appearing and senses shifting quickly, pointing to an adaptive expansion where new semantic spaces open up to fit emerging ideas without breaking the core linguistic system’s structure. Our successful validation shows that the framework works effectively as a tool for computational historical linguistics, offering a solid, consistent way to measure the dynamic changes that shape language over time.
Chapter 3Conclusion
This research’s conclusion synthesizes theoretical insights and methodological advances from investigations into English lexicon evolution over time, framed through lexical semantic alignment, and it anchors its core definition in the idea that language change follows structured shifts in semantic spaces rather than random, unpatterned movement; by modeling vocabulary evolution as a path through a high-dimensional vector space, the work shows we can measure the small, gradual meaning shifts words undergo over successive decades of use.
It moves past traditional lexicography, which often relies on hands-on, qualitative analysis of word meanings, by offering a computational framework that captures the ever-shifting, dynamic nature of semantic drift in language over extended periods.
The whole project centers on using contrastive learning to align historical and contemporary word embeddings.
This method effectively closes the temporal divide between different linguistic states, keeping intact the semantic ties hidden in raw data even as orthographic and syntactic variations crop up naturally over years of language use; by lining up these separate vector spaces, the framework lets us compare precisely how concepts were understood in different eras, uncovering both underlying continuity and sudden breaks in the lexicon.
The step-by-step process laid out in this study includes strict data gathering, preprocessing, and model training, all designed to validate the semantic alignment’s results; first, large diachronic corpora are tokenized and standardized to create a representative sample of English use across distinct time periods, ensuring each slice reflects the linguistic norms of its specific decade or century, then word embeddings are trained separately for each slice to capture that era’s unique semantic nuances.
The key implementation step uses a contrastive alignment framework, which frames the alignment task as an optimization problem where the model learns to group semantically matching words from different eras closer together while pushing unrelated ones apart.
This requires careful adjustment of hyperparameters, like temperature coefficients and negative sample set sizes, to capture fine semantic details without overfitting to random noise.
The aligned embeddings that come out of this rigorous, multi-step process form the solid base for analyzing semantic change, giving researchers the precise tools to calculate measures like shifts in cosine similarity to spot words that have undergone major, noticeable changes in meaning over extended time periods.
Looking at this work’s real-world value reveals it has great potential across multiple academic and technical fields; in computational linguistics, it provides a set of standardized tools for historical linguistics, letting researchers automatically detect semantic shifts at a scale no manual, labor-intensive analysis could ever reach just a few years ago.
This ability is especially useful for digital humanities scholars who want to study cultural and intellectual history through the slow, steady changes in language use over decades and centuries.
It also has direct, practical uses for improving modern natural language processing systems, especially when handling archival data and adapting models to new domains.
By learning how word meanings shift gradually and consistently over time, models can better interpret old, unadapted historical texts, leading to more accurate information retrieval and sentiment analysis across different time frames; it also adds to broader theoretical ideas about semantic stability, showing which factors—like how often a word is used or its part of speech—most affect how fast a word’s meaning changes.
The approach’s proven robustness means it can be tweaked and adapted to work for other languages and language families, creating a universal way to model how languages change over extended periods.
This research sets a new standard for putting diachronic linguistics into practice, linking abstract semantic theory to real, usable computational tools that scholars and engineers can build on for future studies.
