Collobert & Weston embeddings from a run called 20100116-redo-baseline-with-100dims on mammouth. They are induced as described in the paper: Joseph Turian, Lev-Arie Ratinov and Yoshua Bengio (2010) "WORD REPRESENTATIONS: A SIMPLE AND GENERAL METHOD FOR SEMI-SUPERVISED LEARNING", on the RCV1 corpus, cleaned as described in the paper (roughly 37M words of News text). The following table details some of the training information: embedding size The size of the embeddings induced. training updates Number of training examples used. (We can train indefinitely, as described in the paper, but this is where we stopped.) model learning rate Learning rate used to adjust model parameters. embedding learning rate Learning rate used to adjust embeddings. pre-update training error The error achieved by the model on training examples, prior to updating them. (range: 0-1) ======================================================================================================================= embedding size | training updates | model learning rate | embedding learning rate | pre-update train err ======================================================================================================================= 25 | 2280000000 | 1e-08 | 1e-07 | 0.00339721 50 | 2270000000 | 1e-09 | 1e-06 | 0.003038 100 | 2030000000 | 1e-09 | 1e-06 | 0.00294992 200 | 1750000000 | 1e-09 | 1e-06 | 0.00300134 ======================================================================================================================= The following are the available files: codesnapshot.20100116-redo-baseline-with-100dims.tar.gz Snapshot of the code used to induce these embeddings. embeddings-leastcommon.*.png t-SNE visualization of the least frequent words. embeddings-midcommon.*.png t-SNE visualization of the middle-most frequent words. embeddings-mostcommon.*.png t-SNE visualization of the most frequent words. embeddings-original.*.txt.gz Original embeddings, without any scaling. The first column is the word, the rest of the columns are the dimensions of the embedding. embeddings-randomized.*.png t-SNE visualization of a random selection of words. embeddings-scaled.*.txt.gz Embeddings scaled by 0.1/stddev(embeddings), as described in the ACL 2010 paper. These are the embeddings you should use by default, if you just want word features. The first column is the word, the rest of the columns are the dimensions of the embedding. hyperparameters.language-model.yaml The default hyperparameters used in the runs, unless specifed otherwise above. README.txt This file