This dataset is crawled from three popular on-line social networks (OSNs), namely, Twitter, Facebook and Foursquare. We collected this dataset as follows.
We first gathered a set of Singapore-based Twitter users who declared Singapore as location in their user profiles. From the Singapore-based Twitter users, we retrieve a subset of Twitter users who declared their Facebook or Foursquare accounts in their short bio description. In total, we collected 1,998 Twitter-Facebook user identity pairs (known as TW-FB ground truth matching pairs}, and 3,602 Twitter-Foursquare user identity pairs (known as TW-FQ ground truth matching pairs).
To simulate a real-world setting, where a user identity in the source OSN may not have its corresponding matching user identity in the target OSN, we expanded the datasets by adding Twitter, Facebook and Foursquare users who are connected to users in the TW-FB ground truth matching pairs and TW-FQ ground truth matching pairs sets. Note that isolated users who do not have links to other users are removed from the data sets.
After collecting the datasets, we extract the following user features using the OSNs' APIs.
• Username: The username of the account.
• Screen name: The natural name of the user account. It is usually formed using the first and last name of the user.
• Profile Image: The thumbnail or image provided by the user to visually present herself.
• Network: The relationship links between users.
Table 1 summarizes the statistics of our dataset. Due to privacy concerns, the data is anonymized.
Kindly cite the following paper if you use the dataset:
UNSUPERVISED USER IDENTITY LINKAGE VIA FACTOID EMBEDDING
IEEE International Conference on Data Mining, Nov 17-20, 2018
Wei Xie, Xin Mu, Roy Ka-Wei Lee, Feida Zhu and Ee Peng Lim
Last updated on 09 Dec 2019 .