The wikitext long term dependency language modeling dataset

Stephen Merity - September 26, 2016

The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.

Compared to the preprocessed version of Penn Treebank (PTB), WikiText-2 is over 2 times larger and WikiText-103 is over 110 times larger. The WikiText dataset also features a far larger vocabulary and retains the original case, punctuation and numbers - all of which are removed in PTB. As it is composed of full articles, the dataset is well suited for models that can take advantage of long term dependencies.

The following paper introduces the dataset in detail. If you use this dataset in published work, please cite:

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016.
Pointer Sentinel Mixture Models
Published results: WikiText-103

Validation and testing perplexities for WikiText-103. Lower is better.

Publication Model Parameters Validation Testing
Grave et al. 2016 LSTM - - 48.7
Grave et al. 2016 Neural cache model (size = 100) - - 44.8
Grave et al. 2016 Neural cache model (size = 2000) - - 40.8
Published results: WikiText-2

Validation and testing perplexities for WikiText-2. Lower is better.

Publication Model Parameters Validation Testing
Merity et al. 2016 Zoneout + Variational LSTM 20M 108.7 100.9
Grave et al. 2016 LSTM - - 99.3
Merity et al. 2016 Variational LSTM (code from Gal 2015) 20M 101.7 96.3
Grave et al. 2016 Neural cache model (size = 100) - - 81.6
Merity et al. 2016 Pointer LSTM (window = 100) 21M 84.8 80.8
Grave et al. 2016 Neural cache model (size = 2000) - - 68.9
Download

Word level: For the datasets, you can download them from:

Each file contains wiki.train.tokens, wiki.valid.tokens, and wiki.test.tokens. No processing is needed other than replacing newlines with <eos> tokens.

Raw / character level: For the datasets, you can download them from:

The raw tokens before the addition of <unk> tokens. They should only be used for character level work or for creating newly derived datasets.

Examples

 = Gold dollar =

 The gold dollar or gold one @-@ dollar piece was a coin struck as a regular issue by the United States Bureau of the Mint from 1849 to 1889 . The coin had three types over its lifetime , all designed by Mint Chief Engraver James B. Longacre . The Type 1 issue had the smallest diameter of any United States coin ever minted .
 A gold dollar had been proposed several times in the 1830s and 1840s , but was not initially adopted . Congress was finally galvanized into action by the increased supply of bullion caused by the California gold rush , and in 1849 authorized a gold dollar . In its early years , silver coins were being hoarded or exported , and the gold dollar found a ready place in commerce . Silver again circulated after Congress in 1853 required that new coins of that metal be made lighter , and the gold dollar became a rarity in commerce even before federal coins vanished from circulation because of the economic disruption caused by the American Civil War .
 Gold did not again circulate in most of the nation until 1879 ; once it did , the gold dollar did not regain its place . In its final years , it was struck in small numbers , causing speculation by hoarders . It was also in demand to be mounted in jewelry . The regular issue gold dollar was last struck in 1889 ; the following year , Congress ended the series .
 = Super Mario Land =

 Super Mario Land is a 1989 side @-@ scrolling platform video game , the first in the Super Mario Land series , developed and published by Nintendo as a launch title for their Game Boy handheld game console . In gameplay similar to that of the 1985 Super Mario Bros. , but resized for the smaller device 's screen , the player advances Mario to the end of 12 levels by moving to the right and jumping across platforms to avoid enemies and pitfalls . Unlike other Mario games , Super Mario Land is set in Sarasaland , a new environment depicted in line art , and Mario pursues Princess Daisy . The game introduces two Gradius @-@ style shooter levels .
 At Nintendo CEO Hiroshi Yamauchi 's request , Game Boy creator Gunpei Yokoi 's Nintendo R & D1 developed a Mario game to sell the new console . It was the first portable version of Mario and the first to be made without Mario creator and Yokoi protégé Shigeru Miyamoto . Accordingly , the development team shrunk Mario gameplay elements for the device and used some elements inconsistently from the series . Super Mario Land was expected to showcase the console until Nintendo of America bundled Tetris with new Game Boys . The game launched alongside the Game Boy first in Japan ( April 1989 ) and later worldwide . Super Mario Land was later rereleased for the Nintendo 3DS via Virtual Console in 2011 again as a launch title , which featured some tweaks to the game 's presentation .
 Initial reviews were laudatory . Reviewers were satisfied with the smaller Super Mario Bros. , but noted its short length . They considered it among the best of the Game Boy launch titles . The handheld console became an immediate success and Super Mario Land ultimately sold over 18 million copies , more than that of Super Mario Bros. 3 . Both contemporaneous and retrospective reviewers praised the game 's soundtrack . Later reviews were critical of the compromises made in development and noted Super Mario Land 's deviance from series norms . The game begot a series of sequels , including the 1992 Super Mario Land 2 : 6 Golden Coins , 1994 Wario Land : Super Mario Land 3 , and 2011 Super Mario 3D Land , though many of the original 's mechanics were not revisited . The game was included in several top Game Boy game lists and debuted Princess Daisy as a recurring Mario series character .
 = = = Sinclair Scientific Programmable = = =

 The Sinclair Scientific Programmable was introduced in 1975 , with the same case as the Sinclair Oxford . It was larger than the Scientific , at 73 by 155 by 34 millimetres ( 2 @.@ 9 in × 6 @.@ 1 in × 1 @.@ 3 in ) , and used a larger  battery , but could also be powered by mains electricity .
 It had 24 @-@ step programming abilities , which meant it was highly limited for many purposes . It also lacked functions for the natural logarithm and exponential function . Constants used in programs were required to be integers , and the programming was wasteful , with start and end quotes needed to use a constant in a program .
 However , included with the calculator was a library of over 120 programs that that performed common operations in mathematics , geometry , statistics , finance , physics , electronics , engineering , as well as fluid mechanics and materials science . The full library of standard programs contained over 400 programs in the Sinclair Program Library .
Dataset statistics

In comparison to the Mikolov processed version of the Penn Treebank (PTB), the WikiText datasets are larger. WikiText-2 aims to be of a similar size to the PTB while WikiText-103 contains all articles extracted from Wikipedia. The WikiText datasets also retain numbers (as opposed to replacing them with N), case (as opposed to all text being lowercased), and punctuation (as opposed to stripping them out).

Penn Treebank WikiText-2 WikiText-103
Train Valid Test Train Valid Test Train Valid Test
Articles - - - 600 60 60 28,475 60 60
Tokens 887,521 70,390 78,669 2,088,628 217,646 245,569 103,227,021 217,646 245,569
Vocab 10,000 33,278 267,735
OoV 4.8% 2.6% 0.4%
Contact information

If you have questions about the dataset or want to report new results, contact Stephen Merity.