Svensk analogi 2.0

Svensk semantisk och syntaktisk likhet

I. IDENTIFYING INFORMATION
Title*	Swedish analogy test set v1.1
Subtitle	Swedish semantic and syntactic similarity test set
Created by*	Tosin Adewumi (tosin.adewumi@ltu.se), ML Group, LTU
Publisher(s)*	Språkbanken Text (sb-info@svenska.gu.se)
Link(s) / permanent identifier(s)*	https://spraakbanken.gu.se/en/resources/superlim
License(s)*	CC BY 4.0
Abstract*	The Swedish analogy test set follows the format of the original Google version. However, it is bigger and balanced across the 2 major categories, having a total of 20,638 samples, made up of 10,381 semantic and 10,257 syntactic samples. It is also roughly balanced across the syntactic subsections. There are 5 semantic subsections and 6 syntactic subsections. The dataset was constructed, partly using the samples in the English version, with the help of tools dedicated to Swedish translation and it was proof-read for corrections by two native speakers (with a percentage agreement of 98.93\%).
Funded by*	Vinnova (grant no. 2019-02996)
Cite as	[1]
Related datasets	Part of the SuperLim collection (https://spraakbanken.gu.se/en/resources/superlim).

II. USAGE
Key applications	Intrinsic evaluation of Swedish word embeddings
Intended task(s)/usage(s)	Given a word pair A and B and a word C, find a word D such that A is to B as C is to D (A:B::C:D)
Recommended evaluation measures	Accuracy
Dataset function(s)	Few-shot training ('prompting'), testing
Recommended split(s)	A few-shot training set (aka 'prompt', 10%), test set (90%). The prompt was added with the GPT-like models in mind. For those models that do not need a prompt, it can be ignored.

III. DATA
Primary data*	Text
Language*	Swedish
Dataset in numbers*	Total of 20,638 samples; 10,381 semantic samples and 10,257 syntactic samples. Those are split into 2045 train samples and 18,593 test samples. No effort was made to control the balance of syntactic and semantic samples in train and test, the split was random.
Nature of the content*	Each sample contains 2 pairs of words. Hence, there are 4 similar words per line.
Format*	TSV/JSONL with 5 columns/objects: four words and a category. The word to be predicted is called 'label', the given words 'pair1_element1', 'pair1_element2', and 'pair2_element1'.
Data source(s)*	Partly based on the English version by: Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. New additions were made using the following online tools: https://bab.la and https://en.wiktionary.org/wiki/
Data collection method(s)*	Two Swedish native speakers proof-read the finished version. The inter-agreement score was calculated. This was after compilation from part of the English version (Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.), which was translated. Additional data source is en.wiktionary.org/wiki
Data selection and filtering*	The dataset was postprocessed and corrected by Lars Borin and Aleksandrs Berdicevskis
Data preprocessing*	Does not apply
Data labeling*	Does not apply
Annotator characteristics	Two Swedish native speakers

IV. ETHICS AND CAVEATS
Ethical considerations
Things to watch out for

V. ABOUT DOCUMENTATION
Data last updated*	2023-03-05, Gerlof Bouma
Which changes have been made, compared to the previous version*	Minor format changes
Access to previous versions	Work in progress
This document created*	2021-05-20, Tosin Adewumi
This document last updated*	2023-03-05, Gerlof Bouma
Where to look for further details	[1],[2]
Documentation template version*	v1.1

VI. OTHER
Related projects

References	[1] Adewumi, T. P., Liwicki, F., & Liwicki, M. (2020). Corpora compared: The case of the swedish gigaword & wikipedia corpora. arXiv preprint arXiv:2011.03281.
	[2] Adewumi, T. P., Liwicki, F., & Liwicki, M. (2020). Exploring Swedish & English fastText Embeddings with the Transformer. arXiv preprint arXiv:2007.16007.

Fil	Storlek	Modifierad	Licens
sweanalogy.zip an archive with the dataset in JSONL and TSV formats and the documentation sheet (zip)	178.63 KB	2023-03-30	CC BY 4.0 attribution

Del av samling

Typ

Språk

Storlek

Kontakt