Cross-Modal Alignment: A Contrastive Framework for Idiom Sense Disambiguation in Code-Switching Texts

Chapter 1 Introduction

Cross-modal alignment represents a sophisticated computational paradigm designed to bridge heterogeneous data domains, enabling the synthesis of information from disparate modalities such as textual and acoustic inputs. At its core, this concept rests on the principle of projecting features from different sensory modalities into a shared, high-dimensional latent space where semantic proximity can be rigorously measured and compared. The fundamental definition of this process involves the mathematical mapping of distinct vector representations, ensuring that concepts expressed through different channels—for instance, a written sentence and its spoken counterpart—occupy similar coordinate positions within the unified embedding space. This alignment is not merely a concatenation of data but a deep structural harmonization that preserves the intrinsic semantic relationships while filtering out modality-specific noise. In the context of computational linguistics, the core principle relies on the assumption that although the surface forms of text and speech differ vastly, the underlying cognitive intent and semantic meaning remain constant. Therefore, the operational objective is to train neural networks to recognize this invariance, effectively ignoring the superficial discrepancies in signal structure to focus on the shared informational content.

The operational pathway for achieving cross-modal alignment typically involves a contrastive learning framework, which functions by treating paired and unpaired samples differently to refine the model’s discriminative capabilities. In a standard implementation, the system processes a batch of multimodal inputs, such as code-switching utterances and their corresponding textual transcriptions, through modality-specific encoders. These encoders transform raw data into feature vectors, which are then projected into the shared latent space. The critical mechanism involves the calculation of similarity metrics, often utilizing cosine similarity, between every possible pair of text and audio embeddings in the current batch. The objective function is constructed to maximize the similarity score for positive pairs—where the text and audio originate from the same semantic instance—while simultaneously minimizing the similarity for negative pairs—unrelated segments that serve as distractors. Through this iterative process of error minimization, the model learns to draw matching modalities closer together and push non-matching ones apart, effectively constructing a robust semantic map where the modality barrier is dissolved.

The practical application value of this technology becomes particularly evident when addressing the complexities of idiom sense disambiguation within code-switching texts, a task that challenges traditional natural language processing methods. Code-switching, the alternating use of two or more languages within a single discourse, introduces high variability and syntactic irregularity that often confuse monolingual models. Furthermore, idioms possess a figurative meaning that cannot be deduced from the literal definitions of their constituent parts, adding another layer of semantic ambiguity. Cross-modal alignment addresses these challenges by incorporating acoustic features, such as intonation, rhythm, and stress patterns, which often carry crucial paralinguistic cues that signal irony, emphasis, or metaphorical intent. By aligning the textual transcript with the corresponding audio signal, the framework can leverage these prosodic features to resolve ambiguities that are insoluble when relying on text alone. For example, the specific pitch contour or pause pattern used when uttering an idiom can distinguish between a literal interpretation and a figurative one. Consequently, this approach significantly enhances the accuracy of sense disambiguation systems, providing a more nuanced understanding of multilingual communication. This advancement is vital for the development of robust downstream applications, including real-time translation services, cross-cultural sentiment analysis tools, and intelligent dialogue systems that must operate effectively in linguistically diverse environments. The integration of cross-modal alignment thus marks a pivotal step toward more resilient and human-like artificial intelligence, capable of navigating the intricate subtleties of natural language.

Chapter 2 A Contrastive Cross-Modal Alignment Framework for Idiom Sense Disambiguation in Code-Switching Texts

2.1 Cross-Modal Data Construction and Preprocessing for Code-Switching Idiom Contexts

The construction of a robust cross-modal dataset serves as the foundational step in developing a contrastive framework for idiom sense disambiguation within code-switching texts. This process involves not only the aggregation of linguistic data characterized by language alternation but also the strategic alignment of such textual segments with non-linguistic modalities that effectively capture the semantic nuances of idioms. To achieve this, data acquisition primarily targets open accessible corpora or existing constructed datasets known for containing high instances of code-switching. The objective is to isolate text sequences where target idioms appear in mixed-language environments. Following extraction, the subsequent critical phase involves the retrieval and generation of corresponding visual or auxiliary modal information. This multimodal pairing is designed to distinguish between the literal and figurative interpretations of the idioms. For instance, a visual modality aligned with the figurative sense depicts the pragmatic meaning or metaphorical context of the idiom, whereas a counterpart aligned with the literal sense illustrates the word-for-word physical representation. This deliberate juxtaposition enables the model to learn the contrastive signals necessary for accurate disambiguation.

Once the raw data is collected, a rigorous preprocessing pipeline is implemented to standardize the inputs and enhance model performance. The initial stage addresses text normalization for code-switching sequences. Given the informal nature often associated with code-switching, this involves cleaning the text by removing special characters, correcting orthographic inconsistencies, and tokenizing the mixed-language utterances according to the specific grammatical rules of each language involved. Simultaneously, the non-text modal data undergoes specific processing protocols. Visual data, such as images or video frames, are resized to uniform dimensions, normalized to standard pixel value ranges, and often augmented to improve model generalization. Auxiliary data, if present, is converted into vectorized formats suitable for neural network processing.

Parallel to structural normalization, the dataset requires precise semantic labeling. Annotators review each data sample to assign idiom sense labels, clearly marking whether the instance conveys a literal or figurative meaning within the specific code-switching context. This ground truth is essential for supervising the contrastive learning objective. After annotation, the dataset is partitioned into distinct subsets for training, validation, and testing. This division is typically performed using a stratified sampling strategy to ensure that the distribution of idiom senses and language switching patterns remains consistent across all subsets, thereby preventing data leakage and ensuring a reliable evaluation of the model's generalization capabilities.

Analyzing the statistical characteristics of the constructed dataset is vital to validate its rationality and representativeness. A comprehensive examination of the data distribution reveals the balance between literal and figurative samples, as well as the frequency and variety of code-switching points, such as intra-sentential or inter-sentential switches. Descriptive statistics regarding sentence length, vocabulary richness, and the diversity of visual contexts provide evidence that the dataset adequately covers the complexities of real-world usage. By demonstrating that the dataset encompasses a wide spectrum of linguistic phenomena and visual modalities, this analysis confirms that the data is sufficiently robust to support the training of a contrastive cross-modal alignment framework, ultimately ensuring the reliability and applicability of the disambiguation model.

2.2 Contrastive Alignment Mechanism for Multimodal Idiom Sense Representation Learning

The contrastive alignment mechanism for multimodal idiom sense representation learning serves as the foundational architecture designed to resolve semantic ambiguity within code-switching texts by bridging the gap between linguistic and non-linguistic data. This mechanism begins by establishing a robust encoding process for both the text modality, characterized by the complex code-switching context, and the auxiliary non-text modality, which may include visual or auditory features. In the text encoding phase, the system processes the mixed-language input to capture the contextual nuances surrounding the target idiom, generating a preliminary textual representation that reflects the syntactic and semantic properties of the code-switched environment. Simultaneously, the auxiliary modality undergoes a separate encoding procedure where relevant features are extracted and mapped into a high-dimensional vector space. This dual encoding strategy ensures that each idiom sense is initially represented by distinct vectors originating from different modal sources, preserving the unique information inherent to each data type while preparing them for subsequent integration.

Following the initial encoding, the core of the mechanism lies in the sophisticated design of the contrastive alignment framework, which operates on the principle of maximizing agreement between semantically related instances while minimizing agreement between unrelated ones. To achieve this, the framework constructs positive sample pairs by matching the textual representation of a specific idiom sense with its corresponding representation from the non-text modality, thereby creating pairs that reflect the same underlying meaning despite their different surface forms. Conversely, negative sample pairs are generated by pairing the textual representation of one idiom sense with the non-text representation of a different sense, or by shuffling associations within the batch to create mismatched combinations. This rigorous construction of positive and negative pairs provides the necessary supervision signal for the model to learn the subtle boundaries between different idiomatic meanings.

The driving force behind this alignment process is the contrastive loss function, which mathematically penalizes the model when representations of positive pairs are distant in the vector space and rewards the model when representations of negative pairs are separated. By employing a loss function such as the InfoNCE or a similar variant, the framework effectively pulls the embeddings of identical idiom senses closer together, effectively minimizing the distance between modalities for the same concept. Simultaneously, the loss function pushes the embeddings of different idiom senses further apart, expanding the margin between distinct semantic categories. This dynamic optimization process forces the model to focus on the intrinsic semantic content of the idiom rather than the superficial modality-specific features, thereby reducing the heterogeneity that typically exists between text and non-text data.

Through this iterative training procedure, the contrastive alignment mechanism significantly optimizes the multimodal representation of each idiom sense. The resulting representations are not only modality-invariant, meaning they robustly capture the semantic essence regardless of the input source, but also highly discriminative, ensuring that different senses of the same idiom are clearly distinguishable. This capability is particularly crucial in code-switching contexts where linguistic ambiguity is heightened by the mixture of languages and cultural references. By aligning the cross-modal information, the system effectively eliminates modal heterogeneity and enhances the overall distinguishability between different idiom senses, leading to a substantial improvement in the accuracy and reliability of sense disambiguation tasks in complex multilingual scenarios.

2.3 Idiom Sense Disambiguation Module Integrated with Cross-Modal Contrastive Signals

The idiom sense disambiguation module integrated with cross-modal contrastive signals functions as the core decision-making component within the proposed framework, designed to resolve semantic ambiguity by synthesizing textual context with aligned visual information. In the context of code-switching texts, where linguistic nuances are often compounded by the alternation between languages, relying solely on textual encoders risks missing the subtle figurative cues that define the correct usage of an idiom. Consequently, the operational procedure of this module begins with the fusion of aligned cross-modal representations and the context encoding of the target idiom. The system takes the encoded textual vector of the input sentence, specifically focusing on the hidden states corresponding to the target idiom, and combines it with the visual features that have been aligned through the contrastive learning phase. This fusion is not a mere concatenation but a sophisticated interaction mechanism, often implemented via gated fusion or attention-based weighting, which allows the model to dynamically determine the relevance of the visual signal to the current textual context. By integrating these aligned multimodal features, the module generates an enhanced contextual representation that encapsulates both the syntactic structure of the code-switched sentence and the semantic grounding provided by the visual modality.

Once the enhanced contextual representation is constructed, the module proceeds to map this fused vector into the candidate idiom sense space for final classification. This phase involves a predefined repository of candidate sense definitions, where each distinct meaning of the target idiom is represented as a dense vector embedding, typically derived from a gloss dictionary or a semantic knowledge base. The disambiguation process essentially transforms into a matching task, where the system calculates a compatibility score between the fused contextual representation and each candidate sense representation. Mathematically, this is achieved by computing the dot product or cosine similarity between the enhanced text-visual vector and the vector of each candidate sense. The candidate sense that yields the highest matching score is identified as the most semantically compatible interpretation, and is subsequently selected as the final disambiguation result. This operation ensures that the prediction is not merely a guess based on local collocations, but a reasoned inference backed by the holistic understanding of the scene depicted in the associated image.

The logical connection between the cross-modal contrastive signals and the final disambiguation task is fundamental to the system’s ability to handle the complexity of code-switching idioms. The contrastive signals serve as a supervisory guide during the training phase, forcing the visual and textual encoders to align their feature spaces so that the image content and the correct idiom sense are brought closer together. This alignment is crucial because it mitigates the noise introduced by code-switching, where the grammatical structure might be inconsistent or fragmented. When the aligned visual information is injected into the disambiguation module, it acts as an external anchor that stabilizes the semantic interpretation of the text. For instance, if the textual context is ambiguous due to the mixing of languages, the visual modality provides concrete evidence of the physical entities or actions involved, thereby narrowing down the possible senses of the idiom. The integration of these signals ensures that the disambiguation model leverages complementary information from both modalities, resulting in a more robust and accurate classification performance. Ultimately, this approach transforms the disambiguation process from a purely linguistic problem into a multimodal reasoning task, significantly enhancing the model’s practical applicability in real-world scenarios involving diverse and mixed-language inputs.

2.4 Experimental Evaluation and Comparative Analysis on Code-Switching Idiom Datasets

The experimental evaluation and comparative analysis constitute the critical phase for validating the effectiveness of the proposed Cross-Modal Alignment framework in resolving idiom ambiguity within code-switching texts. To ensure the reliability and reproducibility of the experimental outcomes, a rigorous testing environment is established utilizing high-performance computing hardware to facilitate the extensive training requirements of deep learning models. The hyperparameter configuration is meticulously tuned through a grid search process, where the learning rate, batch size, and hidden layer dimensions are optimized to converge on the most effective model settings. Specific attention is directed toward the temperature parameter within the contrastive loss function, as it directly governs the penalty strength for hard negative samples and the overall gradient descent behavior. The evaluation relies on standard metrics widely adopted in natural language processing, specifically Accuracy and F1-score, which collectively provide a quantitative measure of the model’s ability to correctly predict idiomatic senses across varying class distributions.

A diverse set of strong baseline models is selected to benchmark the performance of the proposed approach, ranging from traditional machine learning classifiers to state-of-the-art neural architectures. These baselines include monolingual contextual models like BERT, multilingual variants such as XLM-R, and previous state-of-the-art models specifically designed for idiom disambiguation. The experimental results demonstrate that the proposed framework significantly outperforms these comparative models across all evaluation metrics. This performance gain highlights the limitations of standard approaches that often fail to capture the complex semantic interactions between languages in code-switching scenarios, whereas the cross-modal alignment mechanism effectively bridges the linguistic gap to extract richer contextual features.

To dissect the architectural contributions of the proposed method, comprehensive ablation studies are conducted to isolate the impact of individual components. The experimental setup involves systematically removing the cross-modal alignment module and the contrastive learning objective to observe the resultant performance degradation. The analysis reveals that the absence of contrastive learning leads to a noticeable drop in classification precision, indicating that the discriminative training is essential for distinguishing between literal and figurative senses. Similarly, removing the cross-modal alignment component results in a more significant decline in performance, particularly on instances involving heavy intra-sentential code-switching. This finding confirms that aligning the semantic spaces of the source and target languages is fundamental to capturing the correct pragmatic meaning of idioms embedded in mixed-language contexts.

The evaluation further extends to a granular analysis of how different code-switching types and idiom sense categories influence disambiguation accuracy. The framework exhibits varying levels of robustness across different switching patterns, such as inter-sentential, intra-sentential, and tag-switching. Results indicate that the model performs exceptionally well on intra-sentential switches where the surrounding context provides abundant cross-lingual cues for alignment. Conversely, cases involving idioms with high semantic similarity between their literal and figurative meanings present a greater challenge, suggesting a boundary for the current discriminative capabilities. Case studies are provided to qualitatively analyze the model’s decision-making process, illustrating how the attention mechanism focuses on relevant code-switched context words to resolve ambiguity. These examples serve to bridge the quantitative results with qualitative understanding, confirming that the framework effectively leverages the complementary information from both languages to achieve superior sense disambiguation.

Chapter 3 Conclusion

In summary, this study has presented a robust contrastive framework designed to address the intricate challenges of idiom sense disambiguation within code-switching texts through the mechanism of cross-modal alignment. The research fundamentally operates on the premise that idiomatic expressions possess a unique semantic density that transcends literal linguistic interpretation, necessitating a computational approach capable of mapping these linguistic inputs to rich, contextual visual representations. By leveraging the synergistic capabilities of dual-encoder architectures, specifically employing pre-trained language models and vision transformers, the proposed methodology effectively bridges the semantic gap between textual descriptions and corresponding visual scenes. The core principle guiding this investigation is that accurate sense disambiguation relies heavily on the model's ability to align the embedding space of multilingual text with the embedding space of relevant imagery, thereby utilizing the concrete grounding provided by visual data to resolve the inherent ambiguity of figurative language.

From an operational perspective, the implementation of this framework involves a rigorous process of data preprocessing, feature extraction, and contrastive optimization. The procedure begins by encoding code-switched sentences and associated images into high-dimensional vector spaces, where the model is trained to maximize the cosine similarity between positive text-image pairs while minimizing the similarity between negative pairs. This contrastive learning paradigm forces the network to learn fine-grained semantic relationships, effectively distinguishing between the literal and figurative meanings of idioms based on visual context. The technical integrity of the system is further bolstered by the integration of a specialized fusion layer, which interacts with the multimodal features to capture complex dependencies that single-modality models would inevitably overlook. Through this structured pathway, the framework not only identifies the correct sense of an idiom but also enhances the general representational quality of the entire code-switched input.

The practical application value of this research extends significantly beyond the immediate scope of linguistic disambiguation. In the domain of natural language processing, specifically within the context of social media analysis and cross-cultural communication, the ability to accurately interpret code-switched idioms is crucial for maintaining semantic coherence. Automated systems equipped with this cross-modal alignment capability can substantially improve the performance of downstream tasks such as machine translation, sentiment analysis, and information retrieval. Furthermore, by establishing a standardized procedure for incorporating visual context into linguistic understanding, this work provides a scalable blueprint for handling other forms of figurative language and low-resource languages where textual data alone is insufficient for disambiguation. The capacity to ground abstract linguistic concepts in concrete visual reality represents a pivotal advancement in the pursuit of more human-like artificial intelligence.

Ultimately, the significance of this study lies in its validation of cross-modal alignment as a fundamental solution to the ambiguity problem inherent in code-switching environments. The results demonstrate that visual information is not merely supplementary but often essential for accurate comprehension of figurative speech in multilingual contexts. By rigorously defining the operational parameters and successfully demonstrating the efficacy of the contrastive framework, this thesis contributes a theoretically sound and practically viable tool to the field of computational linguistics. Future developments built upon this foundation will likely focus on optimizing computational efficiency and expanding the range of supported modalities, yet the core insight remains that aligning distinct sensory inputs provides the most reliable pathway toward achieving true semantic understanding in complex, mixed-language scenarios.

01 Chapter 1 Introduction

02 Chapter 2 A Contrastive Cross-Modal Alignment Framework for Idiom Sense Disambiguation in Code-Switching Texts