califa newspapers

Information about califa newspapers

Published on January 21, 2008

Author: Bina

Source: authorstream.com

Content

Handling Digital Newspapers:  Handling Digital Newspapers Geri Ingram OCLC Digital Collection Services Manager, Customer Services October 18, 2007 Slide2:  Why digitize newspapers? Why digitize? :  Why digitize? Because your public wants access Planning is an access issue Selection is an access issue Processing is an access issue Metadata is an access issue Preservation is an access issue Funding is an access issue The newspaper paradox— the good, the bad and the ugly:  The newspaper paradox— the good, the bad and the ugly Widespread use, profoundly embraced, long-lived record in America Cheaply produced for an ephemeral existence Difficult to make searchable, yet everyone from historians to senior genealogists is searching them online Difficult to preserve, yet fundamental to the historical record Planning is an access concern:  Planning is an access concern Project mission answers What is it that we are providing? Intellectual access to newspapers Now and in future (i.e., preservation) For whose benefit? Present in context For local, regional, global audiences How are we providing access? Browsing, searching, full-text, clippings…? Plan collaborative projects:  Plan collaborative projects Enjoy synergies Complementary material End users enjoy consistent and richer experience Staff skill-sets shared Projects that are grouped by format are more efficiently done By a team whose skills can be leveraged Selection is an access concern:  Selection is an access concern Let the users drive—what do they want, and how do they want it? End users often prefer to search by topic, subject, keyword, and by dates— Not always concerned with format at first Papers are just another material type, now with a digital format. Reformatting what you have— a modest proposal:  Reformatting what you have— a modest proposal Consider whether you have already selected these works Titles cataloged? Filmed? Card indexes? Should digitization be a mainstreamed part of processing operations? Do you have complete runs? Even if incomplete, may complement a topical repository Topical repository comprising many formats:  Topical repository comprising many formats Processing is an access concern:  Processing is an access concern How your users will access your materials informs critical processing decisions Processing involves Digitization (scanning paper or microfilm) Optical Character Recognition (OCR) Metadata generation From do-it-yourself to turnkey:  From do-it-yourself to turnkey Desktop scanning from small-format paper (newsletters) with OCR on the fly, no article segmentation Large format scanner for standard newspaper Outsourced scanning from (paper or) film Digitization—scanning:  Digitization—scanning Options: favor the right source for access Scan Scan printed materials or Scan from microfilm *Some funders will not bear cost of scanning from paper Digitization: the quality of the source:  Digitization: the quality of the source Unique issues with historical collections If microfilm, quality may vary over time and vendors If paper, will you film? Is the paper itself important? If so, film to preserve. Which gives better resolution? Best practice: sample across time and materials from contributing partners, to test feasibility of long, complete runs Digitization basics:  Digitization basics Resolution This is is the ability to distinguish fine spatial detail It is is usually expressed as dots-per-inch (dpi) or pixels-per-inch (ppi) These terms are synonymous, but dpi usually refers to printed images and ppi to screen images Digitization basics:  Digitization basics Resolution This is also sometimes referred to in absolute terms Actual pixel dimensions are given 3000 x 2400 for example This is the pixel dimensions of a 10x8 image scanned at 300 dpi. Digitization basics:  Digitization basics Bit depth This is the number of bits (Binary Digits) used to define each pixel. The greater the bit depth, the greater the number of tones (grayscale or color) that can be represented. black and white (bitonal)=1 bit per pixel grayscale=8 bits per pixel (256 shades of gray) color=24 bits per pixel (16.7 million color tones) Slide18:  From the Cornell Digital Imaging Tutorial Slide19:  Some common problems 1. curved characters and images 2.‘noise’ 4. scratches 3. broken characters Optical Character Recognition (OCR):  Optical Character Recognition (OCR) Necessary for full-text searching Depending upon display software, generated searchable text may be Edited Hidden Technical concerns for display and OCR:  Technical concerns for display and OCR Output Files may include: TIFF archival masters, JPEG2000, JPG, bound PDFs Scanning resolution, bit depth Higher resolution, larger file size (more bytes) Colors create very large files OCR performs best with appropriate resolution (no noise please!) Image processing (de-skew, crop, sharpen, page segmentation, article segmentation) To zone or not to zone for article-level handling Note that some funding may not cover extra cost for article segmentation processing Metadata application is an access concern:  Metadata application is an access concern Searchable information rules! We live in a time of unfathomable recall; we need precision searching Users search metadata, but for newspapers, need Full text searching for topics, names, etc. Information needed in context Metadata and structure provide context Users want precise searching, context-rich results:  Users want precise searching, context-rich results First things first— Some processes are incremental; some iterative:  First things first— Some processes are incremental; some iterative Metadata collection Accurate file naming scheme will generate some Hand keying can be done later to supplement and/or correct important elements Authority control tools available for verification Start somewhere! Present full-text, even with minimal metadata:  Start somewhere! Present full-text, even with minimal metadata Users often search newspapers by personal names, topics/keywords Use the presentation tools to create ‘canned’ queries e.g., records by type—birth, marriage Where copyright questionable, explain in metadata, and restrict viewing AFTER digitizing. Recommended metadata elements for digitized newspapers:  Recommended metadata elements for digitized newspapers At the ‘run’ or Title level Title Publisher Date published Place of publication Issue At the issue level Page number Article-level segmentation:  Article-level segmentation Data can be generated during digitization process Some presentation systems can use it Full article segment highlighting and extraction:  Full article segment highlighting and extraction Preservation is an access concern:  Preservation is an access concern Protect high use, single-copy sources first Intellectual versus artifactual value Preserve what you’ve got And invite partners with complementary runs Preserve what you’ve processed (cataloged, filmed, indexed) already When paper is to be preserved; maintaining original paper:  When paper is to be preserved; maintaining original paper Storage space 60 – 70 degrees F. 40 – 50 % relative humidity Storage and Handling stored flat Brittle Clippings Example—preserving for access:  Example—preserving for access Active, crumbling, paper collection Used by genealogists Local, regional, state And by historians Fire department unhappy… Funding is an access concern:  Funding is an access concern Demonstrate cradle to grave processing on small testbed Local and national funders are motivated by access to historical newspapers ONLINE is FUNDABLE! Commercial newspapers are selling online ads first—giving away current content. What is the U.S. National Digital Newspaper Program (NDNP)?:  What is the U.S. National Digital Newspaper Program (NDNP)? “The National Digital Newspaper Program (NDNP) is a partnership between the NEH and the Library of Congress … to provide enhanced access to United States newspapers. over a period of approximately 20 years, … a national, digital resource of historically significant newspapers from all the states and U.S. territories published between 1836 and 1922. …searchable database will be permanently maintained at the Library of Congress (LC) and be freely accessible via the Internet. A prototype of this digital resource: "Chronicling America: Historic American Newspapers" Who can play? :  Who can play? Partnership between National Endowment for the Humanities (NEH) and the Library of Congress (LC) Offers grant funding to any non-profit US organization Provides digitization standards and guidelines that increase efficiencies and cost effectiveness Pays for processing from film; no segmentation Yearly cycles, collaboration helpful next deadline Nov 7th for projects starting after July 2008 Slide36:  Convert to digital: TIFF 400 dpi Grayscale OCR Generate data, metadata: Available metadata (e.g., title, year, month…) From OCR = ASCII Structured data (METS/ALTO) Generate database: Standards-based XML JPEG2000 PDF Import into CONTENTdm server Search, access, view newspaper Start with microfilm OCLC Preservation Services and CONTENTdm Newspapers: print to digital (based on NDNP guidelines) Slide37:  Conforms with ALTO (Analyzed Layout and Text Object) schema ALTO is product of EU-funded METAe project Mapping of OCR’ed text to image coordinates Compatible with Acrobat 5.0 (PDF 1.4) Image with text behind Image will be a grayscale, 150dpi JPEG, using a medium (or 40) quality setting XMP/RDF/Dublin Core metadata Conforms with JPG 2000, Part 1 (.jp2) Use 9-7 irreversible (lossy) filter Compressed to 1/8 of the TIFF or 1 bit/pixel Tiling, but no precincts RDF/Dublin Core metadata in XML box Conforms with TIFF 6.0 8-bit grayscale 400 dpi preferred Uncompressed Only deskewing should be applied Cropped to page edge Additional TIFF tags required OCR text: ALTO Derivative: PDF Production Master: JPEG 2000 Archival Master: TIFF NDNP technical overview specifications Resources for evaluation and study :  Resources for evaluation and study http://www.loc.gov/ndnp/ http://www.neh.gov/projects/ndnp.html http://www.loc.gov/chroniclingamerica/ Reference contacts::  Reference contacts: Geri Ingram, Manager, Digital Collection Services, OCLC [email protected] 760.931.9313 Gayle Palmer, Manager, Digitization and Preservation Programs, OCLC Western Services [email protected] 800.854.5753

Related presentations


Other presentations created by Bina

ZigBee
08. 04. 2008
0 views

ZigBee

Ankara OIC 08 04
07. 05. 2008
0 views

Ankara OIC 08 04

2005051917260032413
02. 05. 2008
0 views

2005051917260032413

ITCCI Community College Webcast
23. 04. 2008
0 views

ITCCI Community College Webcast

Erosion
17. 04. 2008
0 views

Erosion

Lateral Thinking Presentation
15. 04. 2008
0 views

Lateral Thinking Presentation

OOlympics 2008 Beijing
14. 04. 2008
0 views

OOlympics 2008 Beijing

10 70215HWG revised
07. 04. 2008
0 views

10 70215HWG revised

36 Review ppt
11. 02. 2008
0 views

36 Review ppt

Smart Contracting Nuts And Bolts
18. 01. 2008
0 views

Smart Contracting Nuts And Bolts

Bites and Stings
08. 01. 2008
0 views

Bites and Stings

hrm10 ppt15
08. 01. 2008
0 views

hrm10 ppt15

Chakrabarti
09. 01. 2008
0 views

Chakrabarti

PrimarySchool Eng
10. 01. 2008
0 views

PrimarySchool Eng

St Valentines Day
10. 01. 2008
0 views

St Valentines Day

wlan workshop
11. 01. 2008
0 views

wlan workshop

global with decolonization
12. 01. 2008
0 views

global with decolonization

03 egypt
13. 01. 2008
0 views

03 egypt

Serway PSE quick ch39
16. 01. 2008
0 views

Serway PSE quick ch39

subject
23. 01. 2008
0 views

subject

Malaysia CP 1
24. 01. 2008
0 views

Malaysia CP 1

Aarons
20. 01. 2008
0 views

Aarons

Chap013
22. 01. 2008
0 views

Chap013

wireless
04. 02. 2008
0 views

wireless

some wedding photos
05. 02. 2008
0 views

some wedding photos

comet Brochure
16. 01. 2008
0 views

comet Brochure

Universal Access Niemann71602
17. 01. 2008
0 views

Universal Access Niemann71602

danny kenning
25. 01. 2008
0 views

danny kenning

Ch17 1
28. 01. 2008
0 views

Ch17 1

His Gift
28. 01. 2008
0 views

His Gift

2006 09 17
29. 01. 2008
0 views

2006 09 17

593 2007 lec1
06. 02. 2008
0 views

593 2007 lec1

Female Reproductive System
07. 02. 2008
0 views

Female Reproductive System

Kailash Tuli
07. 02. 2008
0 views

Kailash Tuli

OUTER AND INNER SPACE
14. 02. 2008
0 views

OUTER AND INNER SPACE

psychopatholology
14. 02. 2008
0 views

psychopatholology

anatomy gamer
18. 02. 2008
0 views

anatomy gamer

NEAWMA 11 04 03 8 hr O3 revised
20. 02. 2008
0 views

NEAWMA 11 04 03 8 hr O3 revised

Tim Hunt Internet Addiction
29. 02. 2008
0 views

Tim Hunt Internet Addiction

Putting Aid on Budget
05. 03. 2008
0 views

Putting Aid on Budget

nousek
11. 01. 2008
0 views

nousek

Fall05 bhchap1
19. 03. 2008
0 views

Fall05 bhchap1

RD 6
22. 01. 2008
0 views

RD 6

LaMont F Toliver Presentation
14. 01. 2008
0 views

LaMont F Toliver Presentation

WatER FallAGU2005 InfoMeeting
09. 01. 2008
0 views

WatER FallAGU2005 InfoMeeting

Brochureof 24 25 SAE Social
05. 02. 2008
0 views

Brochureof 24 25 SAE Social

durmanppt
15. 01. 2008
0 views

durmanppt

EnglishCITESSlipperO rchids1
10. 03. 2008
0 views

EnglishCITESSlipperO rchids1

2007 Center Presentation
15. 01. 2008
0 views

2007 Center Presentation