Talk:Million Book Project

This is the talk page for discussing improvements to the Million Book Project article.
This is not a forum for general discussion of the article's subject.

Put new text under old text. Click here to start a new topic.
New to Wikipedia? Welcome! Learn to edit; get help.

Article policies

Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL

Status[edit]

It's 2005 and a quick Google search doesn't suggest much about the current status of the project; the latest info in the FAQ at http://www.library.cmu.edu/Libraries/MBP_FAQ.html#current is only recent as of June 2004. Does anyone have more information on this?

-- Schultz.Ryan 01:11, 12 Jan 2005 (UTC)

I note that the following annotation of the web page shows that work is continuing.

 March 20, 2006 -- http://www.library.cmu.edu/Libraries/MBP_FAQ.html 
 Denise Troll, Associate Dean of University Libraries, troll@andrew.cmu.edu

Ms. Denise Troll Covey cuurently has the title Principal Librarian for Special Projects, Carnegie Mellon. She can be contacted at the e-mail address shown aove.

Recent addition[edit]

The below mass of text was cut and pasted into the article by Denise Troll Covey. I've removed it here, if anyone is up to turning into an encyclopedia article and wants to re-add it. -- Stbalbach 21:15, 30 April 2007 (UTC)[reply]

Million Book Project update as of April 30, 2007

The Million Book Project has exceeded its goal of digitizing one million books by 2007. The Project inspired other large-scale digitization projects, including Google Book Search, by changing worldwide thinking about the presentation of material found in books.

Leveraging the $3,000,000 provided by the National Science Foundation for equipment and travel, the Million Book Project attracted international partners and matching funds exceeding $100 million U.S. dollars. To date the Project has scanned over 1.4 million books in China, India and Egypt, and made great strides in research areas relevant to large-scale, multi-lingual database storage and retrieval.

Though the initial term of the Million Book Project has ended, much work remains to be done. Project partners plan to continue to work together on the following issues:

Intellectual property: Copyright remains the biggest barrier to creating the digital library. In the United States, all materials published after 1963 are protected by copyright for the life of the author plus seventy years. Materials published prior to 1923 are out of copyright. In the interim from 1923 through 1963, copyright required renewal. Estimates are that 90% of the materials published during this period were not renewed and are therefore out of copyright. However, renewal records must be consulted for each title to determine its copyright status. Copyright renewal records were scanned to enable online consultation, and later re-keyed by Distributed Proofreaders to improve accuracy and facilitate searching. Project partner Michael Lesk developed the search system. Nevertheless, the labor cost of manually searching individual titles is cost prohibitive for large-scale projects. Partners at the Internet Archive are developing software to automate this process.

Machine translation and summarization: The vision of the universal digital library includes automatic translation from any language to any language of both queries submitted and content retrieved. Million Book Project director and director of the Language Technologies Institute (LTI) at Carnegie Mellon, Dr. Jaime Carbonell, has been exploring context-based machine translation, a technique that mines the broad resources of the web to find examples to facilitate translation. Project partners in China and India are also working on machine translation. India, a country with eighteen official languages, is heavily invested in this work. LTI is also developing summarization technology. Automated summaries can help address the dual problems of information overload and lack of time by quickly enabling users to determine relevance if not find exactly what they need. In combination with machine translation, automated summaries can provide people with access to information that might never be translated into their native language. The implications for teaching, learning, research and innovation would be profound.

Improving and providing centralized access to the metadata: The initial plan of the Million Book Project was to host the entire collection at Carnegie Mellon and to have mirror sites around the world. File transfer, however, turned out to be a significant problem for technical and political reasons. Given these hurtles and developments in distributed computing over the past five years, the current plan is for each country to host the material that it scans, but to provide centralized access to the metadata. Inaccuracies and non-standard cataloging practices must be addressed to make this possible. This will be a primary focus of work over the next year.

Usability: The books in the Million Book collection are stored as TIFF files, one file per page. The files are large and fetching each page can be tedious over inferior or busy networks. Project director Dr. Raj Reddy is exploring correcting the optical character recognition text to provide HTML versions of the books or converting the books to Portable Document Format (PDF). HTML and PDF files are much smaller files than TIFF files so transmission speeds would be much faster. Time is a critical factor for students and faculty. The time between page fetches affects reading comprehension. Work must be done to improve the usability of the collection.

Growing the collection: In addition to the work and research described above, project partners aim to continue efforts begun in 2005 to create a critical mass of best practices literature in agriculture around the world. In partnership with the Food and Agriculture Organization, the National Agriculture Library and relevant university libraries, additional agricultural materials will be scanned and added to the Million Book collection. Project partners will also continue to add to the collection books and other materials in different languages and disciplines.

Diversity and education: The Million Book Project has always had goals in support of diversity and education. Our efforts to provide a multi-lingual digital library aim to address the inordinate amount of web content in the English language and the inordinate amount of web content of dubious quality for teaching, learning and scholarship. In conjunction with work on machine translation and summarization, the Project looks to a future where all people can find the quality information they need free-to-read on the web.

Many students rely on the web as their information resource, turning first to Google or another internet search engines to mine the surface web and only secondarily to library licensed, restricted-access resources in the deep web. Print is a third, somewhat unpopular choice. The lack of quality information on the surface web and its impact on student learning is a primary driver of the Million Book Project. The problem is particularly acute in the sciences, where little relevant information is out of copyright and therefore readily available for digitization. Public policy and innovative technology, like machine translation and summarization of factual book content, must be explored to meet the needs and expectations of students, scholars, and lifelong learners everywhere.

The best practices and research initiatives that result from the project will continue to be shared with librarians and scientists worldwide through formal and informal channels. Applied research and best practices will enhance the quality of digitized materials, storage and delivery systems, and ultimately the users’ experience. Providing powerful tools and free-to-read access to materials in many disciplines will support education and lifelong learning. Free access to agricultural collections can help reduce hunger and food insecurity. The Million Book collection will be indexed by Google and other popular search engines. The Project will continue to drive research agendas in many areas, from computer science to public policy.

it's 2009, what has happened since 2007?[edit]

any updates? —Preceding unsigned comment added by 99.22.220.61 (talk) 00:18, 29 March 2009 (UTC)[reply]

External links modified[edit]

Hello fellow Wikipedians,

I have just modified one external link on Million Book Project. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:

Added archive https://web.archive.org/web/20120108100043/http://www.ulib.org/ULIBAboutUs.htm to http://www.ulib.org/ULIBAboutUs.htm

When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.

This message was posted before February 2018. After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{source check}} (last update: 18 January 2022).

If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
If you found an error with any archives or the URLs themselves, you can fix them with this tool.

Cheers.—InternetArchiveBot (Report bug) 04:11, 12 June 2017 (UTC)[reply]

External links modified[edit]

Hello fellow Wikipedians,

I have just modified 3 external links on Million Book Project. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:

Added archive http://arquivo.pt/wayback/20090723154731/http://www.ulib.org/ to http://www.ulib.org/
Added archive https://web.archive.org/web/20130720095617/http://vlibrary.net/ to http://www.vlibrary.net/
Added archive http://arquivo.pt/wayback/20090723154731/http://www.ulib.org/ to http://www.ulib.org/

When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.

This message was posted before February 2018. After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{source check}} (last update: 18 January 2022).

If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
If you found an error with any archives or the URLs themselves, you can fix them with this tool.

Cheers.—InternetArchiveBot (Report bug) 08:46, 9 December 2017 (UTC)[reply]