2023 | Laboratoire des Sciences du Numérique de Nantes

We are seeking for candidates to a PhD fellowship in Computer science, in collaboration between LS2N (France) and NII (Japan), in the topics of: Ontology Learning, Graph Embeddings and GNN, Semi Supervised Learning, and Knowledge Graphs

Application is available here and open from May 2023 until a candidate is selected.

Combining Knowledge graph embedding and prior knowledge based semi-supervised learning for ontology learning from large scale data.

Keywords: Ontology learning, Knowledge Graph Completion, Prior Knowledge, Clustering, Relation Prediction, Knowledge Graph Embedding, Graph Neural Network.
Laboratory: DUKe, LS2N (Laboratory of Digital Sciences of Nantes, France) and a collaboration with NII & AIST (Tokyo, Japan)
Supervisors: Mounira Harzallah and Fabrice Guillet
CNRS financial support: 2135 € (gross salary)/month and a NII financial support for the Japan internship.
Start date: 1st of October 2023
Duration: 3 years
Requirements:
Education Level: MSc
Field: Computer Science, Data Science, Web Science, Computational Linguistics, Artificial Intelligence
Candidate Profile: Knowledge on Data mining/Machine Learning, Knowledge on Semantic Web and NLP will be strongly appreciated but not mandatory, Knowledge in programing languages mainly Python.
Language: English
The application evaluation will be continuous until the position is filled. Interested candidates should submit : CV, cover letter, transcripts of records of the tree last years and names and addresses of two references. Applications should be submitted to mounira.harzallah@univ-nantes.fr and fabrice.guillet@univ-nantes.fr

PhD Description

Background. The popularity of ontologies and the easy access to a large number of textual resources have strongly motivated the automatic construction of ontologies using artificial intelligence techniques. Three types of construction approaches are distinguished: distributional approaches, knowledge graph-based approaches and pattern-based approaches [Xu et al., 2019, Chen et al. 2020]. In this thesis, we will focus on distributional approaches and more specifically on clustering and graph-based approaches. Generally, clustering allows to consider a large amount of data. However, it faces two main difficulties: the cluster labelling and the formation of semantically consistent clusters relevant to the ontology domain. In our previous work, we have developed a prior knowledge-driven LDA to tackle these two difficulties [Huang et al. 2021, Xu et al 2020]. However, clustering based approaches suffer also from the sparsity of the term representation space [Shwartz et al., 2016]. Graph-based approaches extract triples from texts (subject, predicate, object), then align and link them to form knowledge graphs (e.g. Yago, DBpedia). They allow to process a large number of texts and build very large graphs, but they suffer from the issue of data heterogeneity, because the same concept can be denoted by different terms in distinct triples and the same term can have several semantics [Nguyen and Ichise, 2012], [Kertkeidkachorn and Ichise, 2018].

PhD purpose. The purpose of this thesis is to develop a new approach for automatic ontology construction combining semi-supervised clustering methods driven by prior knowledge (seed knowledge, local knowledge, domain knowledge, DBpedia,..) [Jagarlamudi et al. 2012, Xu et al. 2019, Huang et al, 2021] and knowledge graph embedding [Ebisu and Ichise, 2018]. This new approach will solve the scientific locks of data heterogeneity and data sparsity. By defining cluster terms by subgraphs and their vector embeddings, the problem of text sparsity can be addressed and the quality of clusters can be improved. In recent years, graph embedding has gained rapid growth [Zhang et al. 2020]. It aims to automatically learn a low-dimensional feature representation for each node in a graph. Graph embedding is used in the construction of machine learning models for various tasks, and our goal is to exploit them to improve ontology learning. The approach to be developed in this thesis will also infer hypernym relationships between terms within each cluster. The objective of this task is threefold: 1) to evaluate the quality of the clusters, 2) to refine their description space in an iterative clustering/extraction of hypernym relations/clustering approach, and 3) to evaluate and improve the quality of the exploited knowledge graphs from which term subgraphs are extracted.

The positioning and significance of this research. Since Ontology is crucial for AI applications, many research studies are working on ontology learning. However, they investigate the sparsity and the heterogeneous problem separately. The first originality of our research is to combine knowledge graph representation and prior-knowledge-driven clustering to solve simultaneously the sparsity and the heterogeneous problems. Knowledge graph and graph embedding deal with sparsity problem and prior knowledge-driven clustering deals with heterogenous problem.The second originality of our research is to enrich semantically the graph embedding by integrating prior knowledge from the core ontology in the process of embedding. Focusing on improving the embedding process itself, Sun et al [2020] show that embedding based approaches perform well when training is performed on the text corpus from which the graph is constructed. However, in the case where this corpus is unavailable or of small size, the graph embedding will be based exclusively on its structure, which weakens the performance of these approaches. In this case, in order to semantically enrich the graph embedding input, considering the semantics of certain entities or properties of the graphs could be relevant. This enrichment could be done using a domain ontology or its core ontology.

Therefore, we would like to develop an original approach benefiting on the one hand from the power of graph embedding techniques for the clustering of entities, and on the other hand from the semantic quality of ontology in order to drive and refine the learning. A core ontology will be used as a seed knowledge model to improve the quality of graph embedding as well as for clustering.

La journée dédiée au pôle thématique Sciences des Données et de la Décision (SDD) a été un succès avec la participation de 45 personnes. Les six équipes du pôle, COMBI, DUKe, GDD, MéForBio, modelis et TASC, ont été représentées avec une forte participation des doctorants.

La thématique de la journée était “Le cycle de vie de la donnée”. Nous avons eu le plaisir d’accueillir Stéphane Pesant, chercheur à l’EMBL-EB au Royaume Uni, qui a partagé son expertise sur le cycle de vie de la donnée dans les projets du monde marin. En tant que gestionnaire de données pour les missions de Tara Oceans, il a présenté l’ensemble du processus, de la collecte à la publication.

La matinée s’est poursuivie avec une présentation courte de chaque équipe, permettant aux doctorants de présenter leurs travaux.

Marinna Gaudin, de l’équipe COMBI, a exposé la question centrale de sa thèse : “Dans quelle mesure la biogéographie des interactions planctoniques peut-elle fournir des informations sur la réponse des communautés planctoniques au changement climatique ?”

sdd_07_04_23_COMBI Télécharger

Guillaume Raschia, de l’équipe DUKe, a présenté les travaux de thèse d’Aurélie Suzanne, qui a trouvé une solution efficace pour traiter les flux d’événements temporels en temps réel.

sdd_07_04_23_DUKe Télécharger

Quatre doctorants de l’équipe GDD ont mis en avant leur travail sur comment rendre de données hautement accessibles sur le web. Ils ont discuté de la transformation des données en graphes de connaissances, de l’augmentation des données grâce au web et de l’accès efficace et flexible du web des données.

sdd_07_04_23_GDD Télécharger

Honglu Sun, de l’équipe MéForBio, a exposé les problématiques de son travail de thèse, qui consiste à identifier les paramètres de la modélisation hybride (modèles continus et discrets) de réseaux de régulation génique à partir de données de séries temporelles.

sdd_07_04_23_MeForBio Télécharger

Après une courte introduction d’Olivier Peton, responsable de l’équipe modelis, David Tremblet a présenté son sujet de thèse, qui s’inscrit dans le cadre du projet européen ASSISTANT. Il a expliqué comment il travaillait à la planification de la production basée sur les données avec des économies de coûts importantes, tout en prenant en compte l’incertitude dans la prédiction de la consommation de ressources.

sdd_07_04_23_modelis Télécharger

Enfin, après une courte introduction de Samir Loudni, responsable de l’équipe TASC, trois doctorants de l’équipe TASC ont introduit de manière ludique leurs problématiques de recherche : les modèles discrets d’ordonnancement dans un contexte d’optimisation combinatoire, l’utilisation de solveurs de contraintes pour échantillonner des problèmes combinatoires de manière générique et l’apprentissage pour classer des règles d’association à l’aide de l’intégrale de Choquet de façon itérative en prenant en compte les retours de l’utilisateur.

sdd_07_04_23_TASC Télécharger

L’après-midi, trois tables rondes ont eu lieu en parallèle. Chacune d’entre elles a duré une heure, suivie d’un bref compte-rendu adressé à l’ensemble des participants.

La modélisation des données et les nouveaux enjeux, comment les affronter (éthique, qualité des jeux de données utilisés, qualité des résultats, respect de la vie privée, respect de la RGPD, etc.).
La place de l’IA dans nos préoccupations (est-elle le sujet central de nos recherches, ou un outil à un stade de solution que nous proposons).
Les bonnes pratiques à suivre pour produire des protocoles expérimentaux sérieux (quelle taille des données à générer ? quels benchmarks ? quels indices de qualité ? etc.).