Type Prediction Combining Linked Open Data and Social Media

Yaroslav Nechaev (), Francesco Corcoglioniti (), and Claudio Giuliano ()

   Check out the slides from our ACM CIKM 2018 talk
   Read the preprint version

To cite the paper please use the following BibTex entry:

@inproceedings{nechaev2018cikm,
  author    = {Yaroslav Nechaev and Francesco Corcoglioniti and Claudio Giuliano},
  title     = {Type Prediction Combining Linked Open Data and Social Media},
  booktitle = {Proceedings of the 27th {ACM} International Conference on Information
               and Knowledge Management, {CIKM} 2018, Torino, Italy, October 22-26, 2018},
  pages     = {1033--1042},
  year      = {2018}
}

Abstract

Linked Open Data (LOD) and social media often contain the representations of the same real-world entities, such as persons and organizations. These representations are increasingly interlinked, making it possible to combine and leverage both LOD and social media data in prediction problems, complementing their relative strengths: while LOD knowledge is highly structured but also scarce and obsolete for some entities, social media data provide real-time updates and increased coverage, albeit being mostly unstructured.

In this paper, we investigate the feasibility of using social media data to perform type prediction for entities in a LOD knowledge graph. We discuss how to gather training data for such a task, and how to build an efficient domain-independent vector representation of entities based on social media data. Our experiments on several type prediction tasks using DBpedia and Twitter data show the effectiveness of this representation, both alone and combined with knowledge graph-based features, suggesting its potential for ontology population.

Supplementary Material

Complete experimental results
class_scores.pdf 44.7KB      class_scores.tsv 27.0KB

These files list the types (i.e., predicted classes) for the 8 type prediction tasks considered in the submission, both as PDF table and as TSV file. For each class, the files report the number of samples with that class in the ground truth, and class-wise precision (P), recall (R), and F1 scores obtained by the reference "Social" approach. Confidence intervals computed with the boostrap (percentile) method are also reported.
Derived from eval-scores.zip

avg_scores.pdf 64.1KB      avg_scores.tsv 8.8KB

These files contain the average performances for each prediction approach on the 8 type prediction tasks considered in the submission, reported both as PDF table and as TSV file. For each <task, approach> pair, the files report macro- (across types) and micro-averaged precision (P), recall (R) and F1 scores. Also reported are confidence intervals computed with the boostrap (percentile) method and statistical significance of score differences with respect to the reference "Social" approach (significantly better/worse scores marked respectively with '+'/'-').
Derived from eval-scores.zip

Experimental code and data
eval-all.zip 748.4MB      eval-scores.zip 4.6MB

These files contain the code and (all or only score-related) data for the experiments reported in the submission, to allow "partial" reproducibility of results – "partial" as Twitter terms of use do not allow us to provide also the source Twitter stream data from which social features were derived.

The files contain a python script "eval.py" and its associated configuration "eval.json" that implement the evaluation workflow depicted in the following figure.

The starting point consists in the ground truth <entity, profile, type> data contained in folder /groundtruth (file groundtruth.tsv.gz), and in the RDF and social features families contained in folder /features (~760MB data overall), each family consisting of a ".svm.gz" file where each line is a feature vector in libsvm/svm-light format, and a ".ids.gz" file with the profile names corresponding to those vectors. Based on this data and the configuration "eval.json", the first "setup" step performed by script "eval.py" consists in producing a set of libsvm <label, feature vector> files in folder /libsvm, each file referring to a <task, approach> combination (~5.4GB data overall). The next step consists in training SVM models (with optimal hyperparameters) and testing them on ground truth data according to the nested cross-validation (CV) scheme described in the submission. Each <task, approach> nested-CV process is run in a separate process, with all the approaches of a task run in parallel to speed up computing. The partial results of these processes, consisting of predictions, optimal hyperparameters, and log files, are collected in folder /partials. The final step implemented by "eval.py" consist in reading those partial files and producing a number of TSV and PDF report, that is: TSV files with merged predictions and hyperparameter settings; TSV and PDF file with class-wise prediction scores; TSV and PDF file with average micro- and macro-averaged scores; precision/recall plot.

For all steps implemented by "eval.py", if an output file is already present it is not recomputed, so results can be preserved among runs and the script execution may be halted and resumed without losing intermediate results computed so far. The difference between the two files "eval-all.zip" and "eval-scores.zip" is that the first and larger one contains all the code and data (with entities and user profiles anonymized), while the latter and smaller file contains the TSV file with predictions instead of ground truth and feature files, and thus permit to perform only the "report generation" step in the figure. Therefore, if interested in executing the whole pipeline please download "eval-all.zip", whereas download "eval-scores.zip" if interested only in producing the reports based on the predictions we already computed. In both cases, to run "eval.py" you need:

  • 8GB free RAM
  • 7GB free disk space
  • a Linux / Mac OS X / Unix-like environment (the script performs some file/process manipulations calling sh)
  • python 3 with numpy, pandas, scikit learn, and matplotlib
  • "pigz" utility available on PATH (to speed up reading/writing gzip files)
  • pdflatex available on PATH (for generating the PDF reports)

The execution of the whole pipeline via "eval.py" takes ~7 hours using a 10-core CPU machine (Xeon E5-2630).

groundtruth.zip 1.0MB

This file contains a Bash script "groundtruth.sh" and an DBpedia-Twitter alignment file "alignments.tql.gz" that can be used to extract ground truth data out of DBpedia 2016-04 (instructions contained in the script). It requires setting up a local Virtuoso instance populated with DBpedia 2016-04 data, as well as install RDFpro locally and making it available on PATH. Even if you are not going to execute this script, inside you may find the actual SPARQL queries we used for extracting ground truth data from DBpedia for the 8 tasks. These queries document the kinds of aggregation and cleanup that we performed to derive the ground truth.