Processing Semi Structured Data 1998

Information about Processing Semi Structured Data 1998

Published on October 24, 2007

Author: Kestrel

Source: authorstream.com

Content

Processing Semi-Structured Data:  Processing Semi-Structured Data Kazimierz Subieta Institute of Computer Science PAS Warsaw, Poland [email protected] http://www.ipipan.waw.pl/~subieta Polish-Japanese Institute of Computer Techniques, Warsaw, Poland October 1998 Motivation: Why Semi-Structured Data?:  Motivation: Why Semi-Structured Data? Some data are really unstructured or semistructured: WWW. HTML pages and WWW databases are untyped, unformatted, with dangling links. There is little doubt that they will require efficient programming techniques to deal with their irregularities and inconsistencies (e.g. for implementing active agents). Interoperability. Heterogeneous and/or distributed databases require resolution of missing or conflicting data values that occur when semantically identical data items may have some attribute values different or missing in some data sources. Schema evolution. As a result of schema evolution some new attributes may appear, some may disappear, some may change type, some single attributes can be changed into repeating ones. This naturally leads to irregular data. Data warehouses. They assume collecting decision-support data from heterogeneous sources. Analytical processing of heterogeneous information (statistical analysis, overview reports, data mining, etc.) requires capabilities of integrated query/programming languages that will make querying and processing irregular data possible . Is the Problem New?:  Is the Problem New? No. But it has many new aspects. The problem of semi-structured data is closely related to the classical problem of null values in databases and has about thirty years of history. Many authors proposed extensions of capabilities of database systems for storing and querying irregular, uncertain or missing information: various extensions of the relational algebra, specialized probabilistic DBMSs nested relational algebras dealing with null values querying disjunctive information or so-called “possible worlds” three/four/multi-valued logics A bibliography on uncertainty management by Dyreson presents about 400 papers devoted to this topic. Null values received the first-class citizenship in the relational model and became a component of many important notions such as outer joins and universal relations. Null Values: the-state-of-the art:  Null Values: the-state-of-the art Currently relational databases, SQL and their object-oriented counterparts introduce null values and some simple means to query/process them as a standard feature. However, it is rather a common opinion that the problem remains unsolved. Up to now the theoretical proposals do not fit well with other features of practical database systems, thus they are not implemented, even in prototypes. Pirotte and Zimanyi: “...a great variety of approaches that are not easily comparable, and (...) few practically usable results”. Dubois, Prade and Testamale: “...the problem is generally not well understood, and any attempt to incorporate support for null values into an implemented system should be considered premature at this time'“. These doubts are actually supported by many professionals, including some of the former proponents of theoretical ideas related to null values. Sources of Null Values:  Sources of Null Values Information is irrelevant, for example, “nickname” for a person who has no nickname. Information is known, but not filled yet because data input lasts some time. For example, a new employee is hired, all his/her data are already introduced to the database, except “salary” which has to be input by another administration division. Information is valid only for fixed time; after it must be nullified (to prevent errors) and then input again or recalculated; for example, weather or stock market forecasts. Information is known and could be filled in, but it is considered inessential for the particular domain of applications. For example, in some airline ticket reservation system the nationality of passengers is not filled in for domestic flights. Information consists of “legacy” records, with limited amount of information, and actual records having an extended set of attributes. The result of some processing is an intermediary data structure containing null-valued elements; e.g. outer joins. None of these cases can be qualified as uncertain information. They are related to specific aspects of data storing and processing. Null Values in Theories:  Null Values in Theories The fundamental defect of relational theories w.r.t. null values is the scope mismatch. It concerns the difference between the scope of theories and the scope in which null values need to be considered in practice. Theories usually address an idealized query language with very limited capabilities. The scope for null values is wider and should include: all advanced query constructs, such as grouping, aggregate functions, quantifiers, ordering; imperative constructs based on queries: creating, updating, inserting, deleting; the interface for embedding a query language into a programming language; database semantic enhancements: views, database procedures, integrity constraints, deductive rules, active rules; object-oriented extensions: ADT-s, types, classes, interfaces, encapsulation, inheritance; static typing systems; privacy and security; transaction mechanisms; interfaces for end users and programming interfaces: graphical query lang., 4GLs, .... The scope mismatch is the main reason that theoretical results in this area can be summarised as absolute zero. Ironically, this does not concerns careers of researchers that developed these “theories”: the result is great! Unions/Variants:  Unions/Variants In the domain of programming languages there is another well-known feature dealing with semi-structured data: “variants” in the Pascal family of languages, or “unions” in the C family. Unions and null values are conceptually similar notions, however, not equivalent. For example, if some record type involves n attributes that can be null valued, then the number of corresponding components of a union is 2n. Unions cover the situation when names of attributes are the same but types are different. The programming language community noticed unions and ignored null values. The database community did vice-versa. Unions are proposed in IDL of OMG CORBA, with an explicit discrimination attribute, and in the ODMG model. Unfortunately, ODMG 2.0 contains no construct to deal with unions in the query language OQL and contains no suggestion how unions will be treated by the assumed strong typing system. The issue is not trivial, especially concerning unions with an explicit discrimination attribute. Actually there is no powerful query language consistently dealing with unions. Null Values in SQL:  Null Values in SQL A null value can be stored in a database and returned by an SQL expression if it cannot be evaluated correctly. This may happen because of null-valued arguments of an operation, as well as in the case of wrong arguments of operations. For example, function sum returns NULL for an empty argument table. SQL does not allow explicit comparisons of nulls and ordinary values but involves a special predicate is [not] null. A special function if_null returns an attribute's value if it is not null, or some constant value otherwise. Comparison of an expression which returns NULL with any other value returns a third truth value UNKNOWN; in the WHERE clause, however, it is equivalent to FALSE. In embedded SQL the admission of null values requires for every such attribute two variables in an underlying host language program. An indicator variable is used to store boolean information determining if the particular value of an attribute is null. Flaws of Null Values in SQL:  Flaws of Null Values in SQL The SQL solutions related to null values are semantically inconsistent. C.J.Date presents a severe criticism of null values in the relational model (“...the null value concept is far more trouble than it is worth”, “...the SQL null value concept introduces far more problems than it solves”). Anomalies implied by null values: Although in SQL officially every null value is distinct (thus A = B returns UNKNOWN if both A and B return nulls), the group by and unique operators treat them as identical. Aggregate functions sum, avg, min, max ignore null values. If relation R contains numerical attributes A and B which could be null valued, than in general select sum(A+B) from R may return the result different from select sum(A) + sum(B) from R. Date concludes that these flaws are not only the property of an inadequate SQL design, but they are the inevitable consequence of the idea of null values. Unfortunately, this idea of null values is continued and even extended in SQL3. Default values:  Default values As an alternative to null values Date and other authors propose the concept of default values. They are ordinary values that are filled in into a tuple in the case of absent information. Default values can be determined in a relational schema. For example, the schema declare SUPPLIER relation SNO char(4) nodefault, SNAME char(15) default( “ “ ), STATUS int default ( -1 ), CITY char(20) default (“???” ), primary key SNO; presents defaults for attributes SNAME, STATUS, and CITY as a string of spaces, integer -1, and a string of three question marks, correspondingly. An advantage of default values over null values is that they need not any special options in a query/programming language to process them. However, default values do not solve all problems, they are not universal and they are error-prone. The New Wave: Semi-Structured Data:  The New Wave: Semi-Structured Data A new wave known under the term “semi-structured data” approaches the problem from a different angle (and actually does not refer to null values), but inherits a lot of problems and doubts related to null values. The main limitation of relational approaches is repeated! Querying semi-structured data is not enough! Any data processing system must eventually deal with complete computational and pragmatic power of user/programmer interfaces. Features related to semi-structured data should be combined with all aspects of database systems: database design, data description, query languages, programming capabilities, typing, object-orientedness, procedures, views, active rules, transactions, security, catalogs,.... Related Works:  Related Works 1. Management of semi-structured data TSIMMIS (Stanford) INRIA Rufus UnQL (Uni Penn) 2. Semantic markup technique HTML generation and semantic markup Lightweight data mark-up 1. Combine Web information service and Database system 2. Intermediate Representation Model for semi-structured data (HTML) importing AVPL (Semantic tag markup) 3. Preserve Data Model and Query Facility Introduce the concept of schema construction 4. Can be used as a Database Entry System independently Applicable to datawarehousing Mark-up, Wrappers and Mediators:  Mark-up, Wrappers and Mediators Wrappers and mediators are new architectural elements of processing unstructured (semi-structured) information. Mark-up: special tags attached to e.g. the content of a Web page allowing to determine a structure of plain text. Wrapper: a piece of software that converts (virtually) “external” interface to “internal” interface. A wrapper allows “to see” and process some data in a manner that is presents some standard. Mediator: a piece of software on higher abstraction level that allows to convert requests (queries) from higher abstraction level to lower level. The concept is close to the concept of database view (updatable). Wrappers and mediators increase the level of data independence. They make it possible the access to data in various formats. They isolate the programmer from details of internal data organization. They increase the level of abstractions allowing to query and process data by interfaces that are more user friendly. The precise meaning of these terms is however not clear. The rules of designing wrappers and mediators are very informal. Current Approaches to Semi-Structured Data:  Current Approaches to Semi-Structured Data After application of mark-up techniques, wrappers and mediators semi-structured data should have some form that is more convenient to formal treatment. In comparison to classical database models (network, hierarchical, relational, object-oriented) the main novelty of the approaches to semi-structured data is that data names are kept together with data values. A data structure is viewed as a graph with nodes representing some values (or internal identifiers) and edges representing data names (i.e. logical access paths). Every instance of some conceptual entities can be assigned different attributes. In Tsimmis objects are triples <identifier, label, value>, where identifier is a unique internal object identification, label is a data name, and value is some atomic value (integer, string, etc.) or a set of objects. The database is a pair <O, N>, where O is a set of objects, and N (a subset of O) is a set of entry points to the database. A similar approach (but without identifiers) is assumed in UnQL. Almost identical assumptions (the most general) are assumed in the stack-based approach. TSIMMIS at Stanford:  TSIMMIS at Stanford Lightweight Object Model: Object Exchange Model (OEM), like AVPL, O2’s Complex Value Lightweight Repository LORE Lightweight Query Language LOREL LOREL select -from-where Clause: SELECT Frodos.Group.Name FROM Frodos WHERE Frodos.Group.Category=“Opera” SQL sugar + path expressions (limited). The retrieval power is impossible to determine. Object-orientation (classes, methods, encapsulation, inheritance): not considered The integration with fully-fledged programming capabilities: not considered. Conclusion: limited goals that will be not enough for many applications. Example - Movie Database:  Example - Movie Database Entry Entry Entry Movie Title “Casablanca” Cast Director “Bogart” “Bacall” Title Cast “Play it again, Sam” Credi Actors “Alan” Actors TV Show Title Cast Episode 1 2 3 Special Guests Data labels (names of objects, names of attributes, etc.) are kept together with data values. Relational Databases are a Particular Case:  Relational Databases are a Particular Case A B C “a” “b” 2 4 3 5 C D 3 4 5 “c” “d” “e” R1 R2 R1 R2 Tup Tup Tup Tup Tup A B C A B C C C C D D D “a” 2 3 “b” 4 5 3 “c” 5 “d” 5 “e” Querying Paradigms:  Querying Paradigms LISP-like. Data structures are lists which are served by special functions (low level). Data structure is a graph. A query determines a sub-graph of it (UnQL). Comprehensions syntax, structural recursion as an intellectual support. Navigation in a labeled graph. Path expressions + conditions determine the target to be retrieved (Lorel). Extensions to SQL syntax: special constructs testing the presence and absence of data (Lorel). Stack based. A query language is based on binding names occuring in a query according to the state of the environment stack. The stack is managed by query operators (Loqis). My Old Experience: The Great Emigration:  My Old Experience: The Great Emigration In late 70-ties I implemented the system “The Great Emigration” devoted to historical research. The system recorded data on Polish emigrants (about 15000) after the unsuccessful November Revolution against Russian domination in 1830-31. The data were collected from police archives all over the Europe, mostly from France. Each emigrant was described by about 50 attributes, such as AttendedSchools, Opinions, StayingPlaces, Duels, Jobs, etc. Each attribute might be optional (null valued), repeated and complex; each occurence of an attribute might be (optionally) associated with dates. To the purpose of this system I implemented a powerful query/programming language, which allowed us to ask non-trivial queries; for example “Identify the groups of emigrants according to the same time of their staying in at least three places”. The generic part of this system we upgraded to a universal DBMS called LINDA. This system had several other applications, e.g. for processing medical data. It may be interesting to note that the LINDA data model was almost identical with the Tsimmis data model (!). The Stack-Based Approach: An Abstract Store Model:  The Stack-Based Approach: An Abstract Store Model I - a set of internal identifiers N - a set of external names V - a set of atomic values < i, n, v > < i1, n, i2 > < i, n, T > T is a set of (any) objects atomic object pointer object complex object Store: A set of objects + A set of identifiers (roots) + some obvious constraints (uniqueness of identifiers, referential integrities) No record, tuple, array, set, and bag constructors in the model: essentially all of them are collections of objects (“environments”). No uniqueness of external names on any level of data hierarchy: modeling bulk data. The model allows us to treat uniformly relational, nested-relational, functional, object-oriented databases. Ordering, encapsulation and classes require some extension of the model. later Tiny Database :  Tiny Database TinyDatabase i1 EMP i9 EMP i5 EMP i13 DEPT i17 DEPT i2 NAME Brown i10 NAME Jones i6 NAME Smith i3 SAL 2500 i7 SAL 2000 i11 SAL 1500 i4 WORKS_IN i13 i12 WORKS_IN i17 i8 WORKS_IN i17 i14 DNAME Toys i18 DNAME Sales i15 LOC Paris i16 LOC London i19 LOC Berlin A Complex Object:  A Complex Object i1 EMP i2 ENO e127 i3 NAME Smith i4 JOB designer i6 SAL 7500 i5 WORKS_IN i17 i10 PREV_JOB i11 COMPANY ICL i12 WHEN 1977-90 i7 PREV_JOB i8 COMPANY IBM i9 WHEN 1975-77 Universality of the Store Model:  Universality of the Store Model Complex hierarchical objects can be defined (no limit in levels) Programming variables can be defined We abstract from the persistence status of objects and variables, i.e. we define in the same way persistent and transient objects Binary relationships (associations) can be defined via pointer objects. Ternary and higher order relationships, attributes of relationships: we do not deal with them - they must be decomposed into binary ones. Bulk data: we deal with sets/bags. They are modeled by the same name assigned to many objects on the same hierarchy level Relational structures: each tuple is understood as an object with subobjects. Relativity: no difference in the treatment of objects on any hierarchy level. big advantage for the universality, minimality and simplicity of semantics. Semi-Structured Data:  Semi-Structured Data Dynamic arrays: the only difference is that names of sub-objects are first-class. Variants/unions: different structures of sub-objects. Null-values: absence of some sub-objects (same approach as for variants). Consistent! ... in contrast to the SQL approach. i22 KIND part-time i21 NAME Brown i20 EMP Null Type-less, Schema-less?:  Type-less, Schema-less? There is misunderstanding concerning the terms “type-less” and “schema-less”. Type-less structures have their advantages (dynamic flexibility) and disadvantages (error prone, storage overhead). However, “type-less” does not mean “schema-less”. An important aspects of types concerns conceptual modeling. Types not only constraint data: they also inform the user what the database contains; i.e. they represent informal data semantics. The user must be informed what the database could contain by a sufficiently precise statement. This statement presents the database schema. It is possible that the schema is not implemented in the system (thus do not present a typing constraint), but it must be known for the user for correct formulation of queries. Such a schema is also necessary for writing wrappers. For example, the designer of a wrapper must take decisions concerning data names (labels of graph edges). Such decisions must be supported by a data schema. Processing Semi-Structured Data: Assumptions of the Stack-Based Approach:  Processing Semi-Structured Data: Assumptions of the Stack-Based Approach A special null value, understood as a generic property of a database system, is avoided. Default values are useful but do not solve the problem of semi-structured data. Default values are the properties (invariants) of classes. The approach unifies null values, unions and repeating data. These cases irregularities are modeled by absent objects. Every kind of objects (atomic, complex, reference objects) can be absent. Such irregularities are served by standard query facilities. The classical environmental stack is used to express the semantics of query operators in terms of the naming, scoping and binding mechanism. The same mechanisms should be used for determining imperative statements, procedures, methods, views and other procedural abstractions. The query/programming interface should have full programming power. Some technique of typing semi-structured data should be provided. Optional, Alternative and Repeating Data:  Optional, Alternative and Repeating Data Classes and Inheritance:  Classes and Inheritance Default Values:  Default Values They are stored inside class objects as their invariants. Research Problems::  Research Problems: Assuming semi-structured data are modeled by absent (sub) objects, there are the following problems: Consistent integration of this idea with object-orientedness Powerful constructs querying data with absent (sub)objects Consistent integration of this idea with imperative constructs Consistent integration of this idea with programming abstractions Consistent integration of this idea with schema declarations and typing Methodologies and technologies for writing wrappers for unstructured data Methodologies for designing integrators of heterogeneous sources Methodologies for analysis and design of IS based on unstructured data. We believe that these problems should be treated with respect to all programming issues. Querying semistructured data is not enough. Simplistic theories concerning semi-structured data are not progress.

Related presentations


Other presentations created by Kestrel

English Literature
16. 02. 2008
0 views

English Literature

birth control
10. 10. 2007
0 views

birth control

ramasetu22may2007 1796
30. 09. 2007
0 views

ramasetu22may2007 1796

purpletree
07. 10. 2007
0 views

purpletree

Verdi
08. 10. 2007
0 views

Verdi

preschool
11. 10. 2007
0 views

preschool

playingwithliterature
12. 10. 2007
0 views

playingwithliterature

mlin2
15. 10. 2007
0 views

mlin2

cartoon pool a d
15. 10. 2007
0 views

cartoon pool a d

J1
16. 10. 2007
0 views

J1

Corps Safety Council 24 Jul 06
18. 10. 2007
0 views

Corps Safety Council 24 Jul 06

reliable multicast
03. 10. 2007
0 views

reliable multicast

Discovery of Cosmic Rays
13. 10. 2007
0 views

Discovery of Cosmic Rays

State of Georgia Sunum
17. 10. 2007
0 views

State of Georgia Sunum

North Africa
23. 10. 2007
0 views

North Africa

slides ch10
01. 11. 2007
0 views

slides ch10

LaoTzu2
04. 12. 2007
0 views

LaoTzu2

drying
07. 12. 2007
0 views

drying

Mikheev SPM210207
15. 10. 2007
0 views

Mikheev SPM210207

benelux intervet 09 12 2004
16. 10. 2007
0 views

benelux intervet 09 12 2004

mpls
29. 10. 2007
0 views

mpls

Network Series Overview
30. 10. 2007
0 views

Network Series Overview

wu zixin defense
01. 11. 2007
0 views

wu zixin defense

Chapter1
02. 11. 2007
0 views

Chapter1

20050812
23. 10. 2007
0 views

20050812

olnes pki interoperability
07. 11. 2007
0 views

olnes pki interoperability

TomGibbs
21. 10. 2007
0 views

TomGibbs

Presentaci n RSC EFQM francais
16. 11. 2007
0 views

Presentaci n RSC EFQM francais

Parts of a Flower
14. 12. 2007
0 views

Parts of a Flower

L1Ch3 2
17. 12. 2007
0 views

L1Ch3 2

dyfi pathfinder presentation
02. 11. 2007
0 views

dyfi pathfinder presentation

Robot
03. 01. 2008
0 views

Robot

2004Summary
03. 01. 2008
0 views

2004Summary

sdocjung e
09. 10. 2007
0 views

sdocjung e

MOM 7
15. 10. 2007
0 views

MOM 7

01 what is AI
04. 01. 2008
0 views

01 what is AI

Do Air Cleaners Work
19. 11. 2007
0 views

Do Air Cleaners Work

transporte maritimo en CA
22. 10. 2007
0 views

transporte maritimo en CA

Wind Hedges905 2 ppt
04. 10. 2007
0 views

Wind Hedges905 2 ppt

richard hansberger
03. 10. 2007
0 views

richard hansberger

EspositoNOW2006
15. 10. 2007
0 views

EspositoNOW2006

Lochbuhler
17. 10. 2007
0 views

Lochbuhler

PRES Canada US Market
29. 10. 2007
0 views

PRES Canada US Market

davereese july99
30. 10. 2007
0 views

davereese july99

884 philmont05 finalpreparations
28. 11. 2007
0 views

884 philmont05 finalpreparations

ch2 voc2
20. 02. 2008
0 views

ch2 voc2

1 74presentation
19. 10. 2007
0 views

1 74presentation

The Causes of War
06. 03. 2008
0 views

The Causes of War

BF CCC conf 2
23. 11. 2007
0 views

BF CCC conf 2

IndustryForumSlidesF eb16
11. 03. 2008
0 views

IndustryForumSlidesF eb16

EAGER UNECE 2007
15. 10. 2007
0 views

EAGER UNECE 2007

spin05 coverity
26. 02. 2008
0 views

spin05 coverity

IM845
24. 03. 2008
0 views

IM845

PresentationAbridged
03. 10. 2007
0 views

PresentationAbridged

chap08 ppt10thedition
08. 04. 2008
0 views

chap08 ppt10thedition

Be Happier Smarter And Healthier
26. 09. 2007
0 views

Be Happier Smarter And Healthier

benefits e
25. 10. 2007
0 views

benefits e

crec ipv6 apan 2
09. 10. 2007
0 views

crec ipv6 apan 2

Cisco Dec4
18. 04. 2008
0 views

Cisco Dec4

Graphplan
30. 10. 2007
0 views

Graphplan

Patoka Final ALS poster
26. 11. 2007
0 views

Patoka Final ALS poster

real options
28. 04. 2008
0 views

real options

Anderson
07. 05. 2008
0 views

Anderson

The Future of Healthcare
08. 05. 2008
0 views

The Future of Healthcare

Lecture22
08. 05. 2008
0 views

Lecture22

prairiechicken
03. 01. 2008
0 views

prairiechicken

PediatricObesityProb lem
30. 04. 2008
0 views

PediatricObesityProb lem

Innovation
02. 05. 2008
0 views

Innovation

PassingGas Dr Railton
02. 05. 2008
0 views

PassingGas Dr Railton

Guide Lines for Bladder Cancer
02. 05. 2008
0 views

Guide Lines for Bladder Cancer

MECH500
02. 05. 2008
0 views

MECH500

l3
23. 11. 2007
0 views

l3

smallpoxmaster
21. 10. 2007
0 views

smallpoxmaster

endeavour 1 00
21. 11. 2007
0 views

endeavour 1 00

backgr
19. 10. 2007
0 views

backgr

Copenhagen
12. 10. 2007
0 views

Copenhagen

presentacion valparaiso
24. 10. 2007
0 views

presentacion valparaiso

ZP582pp
23. 10. 2007
0 views

ZP582pp

pre e301
21. 10. 2007
0 views

pre e301

Riesgo
31. 10. 2007
0 views

Riesgo

Gallick 1 10 07
16. 10. 2007
0 views

Gallick 1 10 07

podcastingagnews
08. 10. 2007
0 views

podcastingagnews

Berton Astronomie par Internet
15. 11. 2007
0 views

Berton Astronomie par Internet

jean michel lauren
11. 04. 2008
0 views

jean michel lauren

Shining Examples 2007
29. 10. 2007
0 views

Shining Examples 2007

lecture9
10. 12. 2007
0 views

lecture9

kathyppt
06. 11. 2007
0 views

kathyppt

Lawrence CAS2K3
28. 12. 2007
0 views

Lawrence CAS2K3

swieta
20. 11. 2007
0 views

swieta