Test Collections (Under Construction)
Test collections are standard data sets used to measure the effectiveness of
information retrieval systems.  Most were originally developed to support
research on IR, but practitioners often find them useful as well.  Here's a
few widely used ones: 
RCV1
(Reuters Corpus Volume 1): A large, high quality, recently released
collection of news stories.  Likely to become the new standard benchmark in
text categorization research.  
TREC-AP
: A text categorization task based on the Associated Press articles used in
the NIST TREC evaluations.