Embeddings from Hierarchical log-bilinear (HLBL) model. They are induced as described in the paper: Joseph Turian, Lev-Arie Ratinov and Yoshua Bengio (2010) "WORD REPRESENTATIONS: A SIMPLE AND GENERAL METHOD FOR SEMI-SUPERVISED LEARNING", on the RCV1 corpus, cleaned as described in the paper (roughly 37M words of News text). Thanks to Andriy Mnih for inducing these embeddings for us, using his HLBL code. He writes that the embeddings were trained for 100 epochs (3.7B training updates). They used a context window of 5 words. The learning rate was 1e-3. The following are the available files: hlbl-embeddings-original.*.txt.gz Original embeddings, without any scaling. The first column is the word, the rest of the columns are the dimensions of the embedding. hlbl-embeddings-scaled.*.txt.gz Embeddings scaled by 0.1/stddev(embeddings), as described in the ACL 2010 paper. These are the embeddings you should use by default, if you just want word features. The first column is the word, the rest of the columns are the dimensions of the embedding. README.txt This file scale-embeddings.py Script used to scale the embeddings.