In our recent work on cross-language entity linking (McNamee et al., 2011) we experimented with linking foreign language mentions of people to nodes in English Wikipedia. We used a 2008 snapshot of English Wikipedia which has been used the NIST TAC-KBP evaluations (2009-2014).
We assembled a cross-language entity linking test set that covers 21 languages and five writing systems (Arabic, Chinese, Cyrillic, Greek, and Roman). Languages included in the test set: Albanian, Arabic, Bulgarian, Chinese, Croatian, Czech, Danish, Dutch, Finnish, French, German, Greek, Italian, Macedonian, Portuguese, Romanian, Serbian, Spanish, Swedish, Turkish, and Urdu. The test set includes 55,000 foreign queries, plus English versions of each of the queries.
The text set is available for download:
We efficiently obtained ground truth assessments for this task using multi-aligned parallel texts, and crowdsourcing, plus significant English-language automation. We described the construction of our test set in detail in this paper: