In conjunction with the 50 years of campus journalism celebration at the University of Waterloo, Imprint took made a push to make available online, all 50 years worth of its archives. I took the lead on this as I managing the Imprint website at the time.
We received scanned copies of all the archives in PDF format. To incorporate the archives into our existing system, and to make them easy to browse through, we needed certain pieces of information, such as the volume, issue number, etc. The files were all named with a certain convention and contained all the required pieces except for a key component: the date. Getting the date would be hard; there was no easy solution other than simply opening each PDF file and recording the date manually. The downside to this was that 50 years worth of archives meant that a lot of PDF files needed to be sifted through.
Given the access to a healthy and willing volunteer base, I decided to use “community” power to streamline the process. I set up a quick PHP script that extracted volume and issue information from the file name, and on top of that, threw a quick streamlined interface that asked the user to enter the date information for the PDF file presented. With volunteers working at the task in their spare time, we managed to codify 50 years worth of archives (1,000+ issues in less than a week), whereas, my going at it alone would have taken easily over a month.
Take a walk down history lane here: Imprint Archives
Update (2008-06-04): I recently upgraded the archives to make use of the Scribd API to generate on-the-fly “iPaper” versions of PDFs in the archives. The main benefits of this are savings in bandwidth usage (since some of the PDFs are ~100MB) and a faster, enriched user experience.
Tools used: Healthy and willing volunteers, PHP, MySQL, Regular Expressions