SwissCrawl

The largest corpus of written Swiss German to date.

The largest corpus of written Swiss German to date.

What is it ?

As part of the SwissTranslation project, SwissCrawl is a corpus of 500,000+ Swiss German (GSW) sentences gathered from crawling the web between September and November 2019.

More precisely, it contains 562,521 sentences from 62K URLs among 3,472 domains. 89% of those sentences have a high Swiss German probability (confidence of > 99%). With a length varying between 25 and 998 characters, they are representative of the way native speakers write in forums and social media. As such, they can contain slang and ascii emojis, and do not always have proper capitalisation/punctuation. However, we believe they are of humongous value to foster the research on Swiss German NLP.

What is the format of the data ?

The corpus is available as a CSV file with the following columns:

text The actual sentence.
url The URL of the page were the first occurrence of the text was found.
crawl_proba The confidence (0.0-1.0) of being Swiss German, as computed by our Language Identification (LID) model.
date The date the text was found.

How was it created ?

The SwissCrawl corpus is part of the SwissTranslation project.

The sentences were gathered using a customised tool chain, especially suited for low-resourced languages. The code is publicly available at https://github.com/derlin/swisstext-lrec. If you are interested to use it for other languages, feel free to contact the authors or use the issues mechanism on GitHub.

How can I access it ?

SwissCrawl is under Creative Commons CC BY-NC 4.0 and is free for non-commercial use only. Simply send us a request by e-mail and shortly explain your purposes, we will get back to you as soon as possible.

Do you have more information ?

The corpus and its creation is explained more thoroughly in the following article (arXiv:1912.00159, LREC 2020 proceedings), that you can also use to cite SwissCrawl:

@InProceedings{linder2020crawler,
  author = {Linder, Lucy  and  Jungo, Michael  and  Hennebert, Jean  and  Musat, Claudiu Cristian  and  Fischer, Andreas},
  title = {Automatic Creation of Text Corpora for Low-Resource Languages from the Internet: The Case of Swiss German},
  booktitle = {Proceedings of The 12th Language Resources and Evaluation Conference},
  month = {May},
  year = {2020},
  address = {Marseille, France},
  publisher = {European Language Resources Association},
  pages = {2706--2711},
  abstract = {This paper presents SwissCrawl, the largest Swiss German text corpus to date. Composed of more than half a million sentences, it was generated using a customized web scraping tool that could be applied to other low-resource languages as well. The approach demonstrates how freely available web pages can be used to construct comprehensive text corpora, which are of fundamental importance for natural language processing. In an experimental evaluation, we show that using the new corpus leads to significant improvements for the task of language modeling.},
  url = {https://www.aclweb.org/anthology/2020.lrec-1.329}
}

The documentation for the tool chain is also available at https://derlin.github.io/swisstext.