Volltext-Downloads (blau) und Frontdoor-Views (grau)

Harnessing the Power of Multimodal Deep Learning for DIY Product Matching

  • In the field of product matching, accurately matching product images and descriptions has proven to be a great challenge due to unreliability and inconsistencies on both the vendor and customer side. While unimodal methods, which only process one modality such as image or text, have been researched extensively, they often fail to provide a complete understanding of the important product features. To address this, multimodal deep learning has been a rising field in machine learning in recent times, which aims to combine information from multiple modalities, such as image, text and audio, in order to capture all critical product information. Inspired by recent advances in this field, this thesis employs a multimodal neural network based on a bidirectional triplet loss function, which maps similar image and text embeddings closer to each other in an embedding space. As a backbone, the model uses a Convolutional Neural Network on both the image and text side. What sets this thesis apart as particularly innovative is that a character-level approach is used on the text side as backbone, instead of common word embedding methods. This is further supported by recent achievements in multimodal deep learning, demonstrating that the character-level approach is particularly effective. The research in this thesis is conducted in collaboration with Parsionate GmbH, an IT consulting company specializing in data management, located in Stuttgart. Parsionate provides a HORNBACH dataset consisting of DIY products for experimentation in this thesis. During the experiments, the optimal hyperparameters are identified using a grid search approach. The experimental results highlight that the multimodal neural network with bidrectional triplet loss is able to outperform unimodal methods when evaluated directly against the top 5 similar products. Notably, the multimodal network with a character-level Convolutional Neural Network for text processing and ResNet50 for image processing outperforms all other word embedding methods. These findings strongly suggest that further investigating multimodal neural networks with character-level approaches opens up new avenues for research and subsequent application in the product matching domain.

Download full text files

  • Masterarbeit_Kazim_Ali_Mazhar.pdf
    eng

    nur im Hochschulnetz einsehbar

Export metadata

Statistics

frontdoor_oas
Metadaten
Author:Kazim Ali Mazhar
URN:urn:nbn:de:bsz:753-opus4-31663
Advisor:Gabriele Gühring
Document Type:Master's Thesis
Language:English
Year of Completion:2023
Publishing Institution:Hochschule Esslingen
Granting Institution:Hochschule Esslingen
Date of final exam:2023/10/30
Release Date:2024/06/26
Page Number:119
Faculty:Informatik und Informationstechnik