lis618n03a 04

Information about lis618n03a 04

Published on December 6, 2007

Author: Callia

Source: authorstream.com

Content

LIS618 lecture 4:  LIS618 lecture 4 Thomas Krichel 2003-10-19 Structure:  Structure Document preprocessing Practice: Nexis document preprocessing segment theory and practice Practice: Factiva document preprocessing:  document preprocessing There are some operations that may be done to the documents before indexing lexical analysis stemming of words elimination of stop words selection of index terms construction of term categorization structures we will look at those in turn in many cases, document preprocessing is not well documented by the provider. but searchers need to be aware of them… lexical analysis:  lexical analysis divides a stream of characters into a stream of words seems easy enough but…. should we keep numbers? hyphens. compare "state-of-the-art" with "b-52" removal of punctuation, but "333B.C." casing. compare "bank" and "Bank" stemming:  stemming in general, users search for the occurrence of a term irrespective of grammar plural, gerund forms, past tense can be subject to stemming important algorithm by Porter evidence about the effect of stemming on information retrieval is mixed stemming is relatively rare these days. elimination of stop words:  elimination of stop words some words carry no meaning and should be eliminated in fact any word that appears in 80% of all documents is pretty much useless, but consider a searcher for "to be or not to be". It is better to reduce the index weight of terms that appear very frequently index term selection:  index term selection In printed indexes, we use nouns only some nouns that appear heavily together can be considered to be one index term, such as "computer science" Dialog deals with this through phrase indexing. Most web engines, however, index all words, and all of the individually thesauri:  thesauri a list of words and for each word, a list of related words synonyms broader terms narrower terms used to provide a consistent vocabulary for indexing and searching to assist users with locating terms for query formulation allow users to broaden or narrow query use of thesauri:  use of thesauri Thesauri are limited to experimental systems, or some high-quality systems, see http://www.sosig.ac.uk/roads/cgi-bin/thesaurus.pl for an example, or look at Nexis It can be confusing to users. Frequently the relationship between terms in the query is badly served by the relationships in the thesaurus. Thus thesaurus expansion of an initial query (if performed automatically) can lead to bad results. Back to Nexis: word limits:  Back to Nexis: word limits The following are always considered word limits hyphens slashes parentheses spaces plurals:  plurals Nexis indexes plural and possive as the singular. But in power search, you can use the following PLURAL (term) only the plural of term SINGULAR (term) only the singular of term ALLCAPS (term) only capitals of term NOCAPS (term) no capitals of term CAPS (term) capitalized term only Document preprocessing in Nexis:  Document preprocessing in Nexis ampersand: if it is surrounded by blanks, it treats it as "and". If it is not, it treats it as a normal character company(at&t). apostrophe: works if not followed by "s", in which case it is a possessive at-sign: used for sections in case law, ignored otherwise, e.g. in email addresses: presidentwhitehouse.com Document preprocessing in nexis:  Document preprocessing in nexis colon and comma are read as a space unless adjacent characters are numbers. hyphen / and \ is read as a space percent and pound sign mean themselves and are not equivalent to anything. " ? $ ; are all ignored ® is replaced by the word "R", ™ is replaced by the word "TM". equivalents:  equivalents Nexis has a number of "equivalents" where, depending on sources, it replaces one with the other. Contrary to their claims they also work in quick search First (second, third, etc.)is 1st (2nd, 3rd, etc.) Monday (All days ex. Sunday) Mon (Tues, Weds, etc.) January (Abbreviations work) Jan (Feb, Mar, etc.) One (all numbers < 20) 1 (2, 3, etc.) and & company co corporation corp incorporated inc noise and reserved words:  noise and reserved words Noise words are common words in power search, noise words are ignored, replace by space in quick search, you can use phrases no list of noise words Reserved words are and or not used in Boolean expressions. They are not indexed. Nexis segments:  Nexis segments Nexis does some document preprocessing for characters, discussed in a later slide. The processed document has a number field/value pairs that are called segments Not every source has every segment. I make a distinction between native smart-indexed segments. some segments in legal docs:  some segments in legal docs CITE • CLASS DATE common search for any date field FIRST-ACTION date HISTORY • ISSUED-BY LAST-ACTION date NAME • REFERENCES TEXT full text TITLE same as name • TYPE typical segments in news:  typical segments in news BYLINE  CORRRECTION CORRECTION-DATE DATE DATELINE (not a date) GRAPHIC HEADLINE • HIGHLIGHT • LEAD HLEAD is HEADLINE, HIGHLIGHT, & LEAD typical segments in news:  typical segments in news PUBLICATION name and copyright SECTION SERIES SOURCE TICKER   TYPE typical smart-indexed segments:  typical smart-indexed segments CITY COMPANY • COUNTRY GEOGRAPHIC • INDUSTRY KEYWORD • ORGANIZATION PERSON • PRODUCT SUBJECT • TICKER TYPE TERMS includes all these segment search:  segment search You can place query terms and connectors in a segment and then search for it. Example: hlead((drug or substance) w/10 abuse) using segments for news:  using segments for news uses power search expressions, plus hlead (expression) ? headline (expression) company (expression) for a company byline (expression) for the author show (expression) for a television show transcript expression is a Boolean expression or simple keyword. power search for legal data:  power search for legal data uses power search expressions, plus name (expression) for the name of a party cite (expression) for a citation expression for case law title (expression) for the title of a law article expression is a Boolean expression or simple keyword Search forms:  Search forms There are special forms for News Company reports Market indicators Portfolio News and quotes about companies Personal news alert:  Personal news alert do a search then click on “track in personal news” to get to a screen where you can enter periodicity what documents to be sent subject This works for real estate for me. Real time news:  Real time news This uses a different query language terms are implicitly ANDed explicit AND and OR allowed phrases have to be put in quotes * starts for any number of characters, not just one as in power search parenthesis can be used I have poor experience with this. Summary on Nexis:  Summary on Nexis Nexis has a rich set of resources. It can be searched by inexperienced, but likely to get poor result. Clever learning about its features can get you quite far, however, the features are not well documented online. There is not enough detail. Factiva:  Factiva Nexis is news with legal "stuff". Factiva is news with business "stuff". It will only work with Microsoft Internet Explorer! This violates the most important rule of web site design. It is because the use asp technology. A bad choice! Login to factiva:  Login to factiva We have a public account that will serve up to 30 users concurrently up to 2003-12-31 user id: mls003 password: transcripts name space: 16 https://global.factiva.com/factivalogin/login.asp has the login Sessions time out after 30 minutes. More on Factiva:  More on Factiva http://www.factiva.com/factiva has downloadable brochures case studies white papers product tour I looked the broshure "Inside-Out". Well written, ordered copies. Free text search:  Free text search similar to nexis power search operators "and" "or" "not" "w/i", "near/n" where n is a number. /f/n requires the preceding expression to be in the first n words in the full text. "same" stands for same paragraph "atleastn" requires at least n occurences. "wc" is a word count, use <, > and then a number, e.g. wc<1000. but as well:  but as well You can add codes from indexing terms. Note that the + shows that there is more. When you press the triangle the code is dropped into the text box. http://openlib.org/home/krichel:  http://openlib.org/home/krichel Thank you for your attention!

Related presentations


Other presentations created by Callia

GIS presentation
29. 11. 2007
0 views

GIS presentation

Cryptography
05. 01. 2008
0 views

Cryptography

chpt14e
04. 10. 2007
0 views

chpt14e

optie seagull
28. 09. 2007
0 views

optie seagull

Household Waste Water Systems
08. 11. 2007
0 views

Household Waste Water Systems

ipcn leopold
01. 12. 2007
0 views

ipcn leopold

lecture notes 14
02. 11. 2007
0 views

lecture notes 14

Brenda
06. 11. 2007
0 views

Brenda

20050627 SciDAC Straatsma
29. 10. 2007
0 views

20050627 SciDAC Straatsma

arts royalties
16. 11. 2007
0 views

arts royalties

EID zoonoses
19. 11. 2007
0 views

EID zoonoses

Trip Info PPT
20. 11. 2007
0 views

Trip Info PPT

Gothenburg
23. 11. 2007
0 views

Gothenburg

Reiki Lecture
18. 12. 2007
0 views

Reiki Lecture

Cold War what is a cold war
19. 12. 2007
0 views

Cold War what is a cold war

DO YOU NEED A FRIEND
23. 12. 2007
0 views

DO YOU NEED A FRIEND

Mesquite biomass web site
02. 01. 2008
0 views

Mesquite biomass web site

rothberg
01. 10. 2007
0 views

rothberg

jpl zoo1
13. 11. 2007
0 views

jpl zoo1

nguyenngan
24. 02. 2008
0 views

nguyenngan

rekIMSA
27. 02. 2008
0 views

rekIMSA

Gettysburg
29. 02. 2008
0 views

Gettysburg

Sachs Cyber TA ThreatOps
05. 03. 2008
0 views

Sachs Cyber TA ThreatOps

Sperling1762 00
14. 03. 2008
0 views

Sperling1762 00

EN15038 Berlin
18. 03. 2008
0 views

EN15038 Berlin

p2p
02. 10. 2007
0 views

p2p

booklet
30. 03. 2008
0 views

booklet

attr 103788 115
15. 11. 2007
0 views

attr 103788 115

BK12e Ch03 basic
13. 04. 2008
0 views

BK12e Ch03 basic

VTParcelization ButlerII
07. 01. 2008
0 views

VTParcelization ButlerII

ans11298
17. 12. 2007
0 views

ans11298

07 Lutes
29. 12. 2007
0 views

07 Lutes

Elgg SITE 2007
27. 12. 2007
0 views

Elgg SITE 2007

Network
28. 11. 2007
0 views

Network

Ontologies and friends
10. 12. 2007
0 views

Ontologies and friends

wars ncss 4
05. 11. 2007
0 views

wars ncss 4

DFN2004 aso
21. 11. 2007
0 views

DFN2004 aso

Activities Bran
05. 11. 2007
0 views

Activities Bran

Math Marvels Sample Problems
26. 11. 2007
0 views

Math Marvels Sample Problems

bugtrack
03. 10. 2007
0 views

bugtrack

9 10 saha
21. 12. 2007
0 views

9 10 saha

euphoriamix
05. 11. 2007
0 views

euphoriamix