Published on October 24, 2007
Processing Semi-Structured Data: Processing Semi-Structured Data Kazimierz Subieta Institute of Computer Science PAS Warsaw, Poland [email protected] http://www.ipipan.waw.pl/~subieta Polish-Japanese Institute of Computer Techniques, Warsaw, Poland October 1998 Motivation: Why Semi-Structured Data?: Motivation: Why Semi-Structured Data? Some data are really unstructured or semistructured: WWW. HTML pages and WWW databases are untyped, unformatted, with dangling links. There is little doubt that they will require efficient programming techniques to deal with their irregularities and inconsistencies (e.g. for implementing active agents). Interoperability. Heterogeneous and/or distributed databases require resolution of missing or conflicting data values that occur when semantically identical data items may have some attribute values different or missing in some data sources. Schema evolution. As a result of schema evolution some new attributes may appear, some may disappear, some may change type, some single attributes can be changed into repeating ones. This naturally leads to irregular data. Data warehouses. They assume collecting decision-support data from heterogeneous sources. Analytical processing of heterogeneous information (statistical analysis, overview reports, data mining, etc.) requires capabilities of integrated query/programming languages that will make querying and processing irregular data possible . Is the Problem New?: Is the Problem New? No. But it has many new aspects. The problem of semi-structured data is closely related to the classical problem of null values in databases and has about thirty years of history. Many authors proposed extensions of capabilities of database systems for storing and querying irregular, uncertain or missing information: various extensions of the relational algebra, specialized probabilistic DBMSs nested relational algebras dealing with null values querying disjunctive information or so-called “possible worlds” three/four/multi-valued logics A bibliography on uncertainty management by Dyreson presents about 400 papers devoted to this topic. Null values received the first-class citizenship in the relational model and became a component of many important notions such as outer joins and universal relations. Null Values: the-state-of-the art: Null Values: the-state-of-the art Currently relational databases, SQL and their object-oriented counterparts introduce null values and some simple means to query/process them as a standard feature. However, it is rather a common opinion that the problem remains unsolved. Up to now the theoretical proposals do not fit well with other features of practical database systems, thus they are not implemented, even in prototypes. Pirotte and Zimanyi: “...a great variety of approaches that are not easily comparable, and (...) few practically usable results”. Dubois, Prade and Testamale: “...the problem is generally not well understood, and any attempt to incorporate support for null values into an implemented system should be considered premature at this time'“. These doubts are actually supported by many professionals, including some of the former proponents of theoretical ideas related to null values. Sources of Null Values: Sources of Null Values Information is irrelevant, for example, “nickname” for a person who has no nickname. Information is known, but not filled yet because data input lasts some time. For example, a new employee is hired, all his/her data are already introduced to the database, except “salary” which has to be input by another administration division. Information is valid only for fixed time; after it must be nullified (to prevent errors) and then input again or recalculated; for example, weather or stock market forecasts. Information is known and could be filled in, but it is considered inessential for the particular domain of applications. For example, in some airline ticket reservation system the nationality of passengers is not filled in for domestic flights. Information consists of “legacy” records, with limited amount of information, and actual records having an extended set of attributes. The result of some processing is an intermediary data structure containing null-valued elements; e.g. outer joins. None of these cases can be qualified as uncertain information. They are related to specific aspects of data storing and processing. Null Values in Theories: Null Values in Theories The fundamental defect of relational theories w.r.t. null values is the scope mismatch. It concerns the difference between the scope of theories and the scope in which null values need to be considered in practice. Theories usually address an idealized query language with very limited capabilities. The scope for null values is wider and should include: all advanced query constructs, such as grouping, aggregate functions, quantifiers, ordering; imperative constructs based on queries: creating, updating, inserting, deleting; the interface for embedding a query language into a programming language; database semantic enhancements: views, database procedures, integrity constraints, deductive rules, active rules; object-oriented extensions: ADT-s, types, classes, interfaces, encapsulation, inheritance; static typing systems; privacy and security; transaction mechanisms; interfaces for end users and programming interfaces: graphical query lang., 4GLs, .... The scope mismatch is the main reason that theoretical results in this area can be summarised as absolute zero. Ironically, this does not concerns careers of researchers that developed these “theories”: the result is great! Unions/Variants: Unions/Variants In the domain of programming languages there is another well-known feature dealing with semi-structured data: “variants” in the Pascal family of languages, or “unions” in the C family. Unions and null values are conceptually similar notions, however, not equivalent. For example, if some record type involves n attributes that can be null valued, then the number of corresponding components of a union is 2n. Unions cover the situation when names of attributes are the same but types are different. The programming language community noticed unions and ignored null values. The database community did vice-versa. Unions are proposed in IDL of OMG CORBA, with an explicit discrimination attribute, and in the ODMG model. Unfortunately, ODMG 2.0 contains no construct to deal with unions in the query language OQL and contains no suggestion how unions will be treated by the assumed strong typing system. The issue is not trivial, especially concerning unions with an explicit discrimination attribute. Actually there is no powerful query language consistently dealing with unions. Null Values in SQL: Null Values in SQL A null value can be stored in a database and returned by an SQL expression if it cannot be evaluated correctly. This may happen because of null-valued arguments of an operation, as well as in the case of wrong arguments of operations. For example, function sum returns NULL for an empty argument table. SQL does not allow explicit comparisons of nulls and ordinary values but involves a special predicate is [not] null. A special function if_null returns an attribute's value if it is not null, or some constant value otherwise. Comparison of an expression which returns NULL with any other value returns a third truth value UNKNOWN; in the WHERE clause, however, it is equivalent to FALSE. In embedded SQL the admission of null values requires for every such attribute two variables in an underlying host language program. An indicator variable is used to store boolean information determining if the particular value of an attribute is null. Flaws of Null Values in SQL: Flaws of Null Values in SQL The SQL solutions related to null values are semantically inconsistent. C.J.Date presents a severe criticism of null values in the relational model (“...the null value concept is far more trouble than it is worth”, “...the SQL null value concept introduces far more problems than it solves”). Anomalies implied by null values: Although in SQL officially every null value is distinct (thus A = B returns UNKNOWN if both A and B return nulls), the group by and unique operators treat them as identical. Aggregate functions sum, avg, min, max ignore null values. If relation R contains numerical attributes A and B which could be null valued, than in general select sum(A+B) from R may return the result different from select sum(A) + sum(B) from R. Date concludes that these flaws are not only the property of an inadequate SQL design, but they are the inevitable consequence of the idea of null values. Unfortunately, this idea of null values is continued and even extended in SQL3. Default values: Default values As an alternative to null values Date and other authors propose the concept of default values. They are ordinary values that are filled in into a tuple in the case of absent information. Default values can be determined in a relational schema. For example, the schema declare SUPPLIER relation SNO char(4) nodefault, SNAME char(15) default( “ “ ), STATUS int default ( -1 ), CITY char(20) default (“???” ), primary key SNO; presents defaults for attributes SNAME, STATUS, and CITY as a string of spaces, integer -1, and a string of three question marks, correspondingly. An advantage of default values over null values is that they need not any special options in a query/programming language to process them. However, default values do not solve all problems, they are not universal and they are error-prone. The New Wave: Semi-Structured Data: The New Wave: Semi-Structured Data A new wave known under the term “semi-structured data” approaches the problem from a different angle (and actually does not refer to null values), but inherits a lot of problems and doubts related to null values. The main limitation of relational approaches is repeated! Querying semi-structured data is not enough! Any data processing system must eventually deal with complete computational and pragmatic power of user/programmer interfaces. Features related to semi-structured data should be combined with all aspects of database systems: database design, data description, query languages, programming capabilities, typing, object-orientedness, procedures, views, active rules, transactions, security, catalogs,.... Related Works: Related Works 1. Management of semi-structured data TSIMMIS (Stanford) INRIA Rufus UnQL (Uni Penn) 2. Semantic markup technique HTML generation and semantic markup Lightweight data mark-up 1. Combine Web information service and Database system 2. Intermediate Representation Model for semi-structured data (HTML) importing AVPL (Semantic tag markup) 3. Preserve Data Model and Query Facility Introduce the concept of schema construction 4. Can be used as a Database Entry System independently Applicable to datawarehousing Mark-up, Wrappers and Mediators: Mark-up, Wrappers and Mediators Wrappers and mediators are new architectural elements of processing unstructured (semi-structured) information. Mark-up: special tags attached to e.g. the content of a Web page allowing to determine a structure of plain text. Wrapper: a piece of software that converts (virtually) “external” interface to “internal” interface. A wrapper allows “to see” and process some data in a manner that is presents some standard. Mediator: a piece of software on higher abstraction level that allows to convert requests (queries) from higher abstraction level to lower level. The concept is close to the concept of database view (updatable). Wrappers and mediators increase the level of data independence. They make it possible the access to data in various formats. They isolate the programmer from details of internal data organization. They increase the level of abstractions allowing to query and process data by interfaces that are more user friendly. The precise meaning of these terms is however not clear. The rules of designing wrappers and mediators are very informal. Current Approaches to Semi-Structured Data: Current Approaches to Semi-Structured Data After application of mark-up techniques, wrappers and mediators semi-structured data should have some form that is more convenient to formal treatment. In comparison to classical database models (network, hierarchical, relational, object-oriented) the main novelty of the approaches to semi-structured data is that data names are kept together with data values. A data structure is viewed as a graph with nodes representing some values (or internal identifiers) and edges representing data names (i.e. logical access paths). Every instance of some conceptual entities can be assigned different attributes. In Tsimmis objects are triples <identifier, label, value>, where identifier is a unique internal object identification, label is a data name, and value is some atomic value (integer, string, etc.) or a set of objects. The database is a pair <O, N>, where O is a set of objects, and N (a subset of O) is a set of entry points to the database. A similar approach (but without identifiers) is assumed in UnQL. Almost identical assumptions (the most general) are assumed in the stack-based approach. TSIMMIS at Stanford: TSIMMIS at Stanford Lightweight Object Model: Object Exchange Model (OEM), like AVPL, O2’s Complex Value Lightweight Repository LORE Lightweight Query Language LOREL LOREL select -from-where Clause: SELECT Frodos.Group.Name FROM Frodos WHERE Frodos.Group.Category=“Opera” SQL sugar + path expressions (limited). The retrieval power is impossible to determine. Object-orientation (classes, methods, encapsulation, inheritance): not considered The integration with fully-fledged programming capabilities: not considered. Conclusion: limited goals that will be not enough for many applications. Example - Movie Database: Example - Movie Database Entry Entry Entry Movie Title “Casablanca” Cast Director “Bogart” “Bacall” Title Cast “Play it again, Sam” Credi Actors “Alan” Actors TV Show Title Cast Episode 1 2 3 Special Guests Data labels (names of objects, names of attributes, etc.) are kept together with data values. Relational Databases are a Particular Case: Relational Databases are a Particular Case A B C “a” “b” 2 4 3 5 C D 3 4 5 “c” “d” “e” R1 R2 R1 R2 Tup Tup Tup Tup Tup A B C A B C C C C D D D “a” 2 3 “b” 4 5 3 “c” 5 “d” 5 “e” Querying Paradigms: Querying Paradigms LISP-like. Data structures are lists which are served by special functions (low level). Data structure is a graph. A query determines a sub-graph of it (UnQL). Comprehensions syntax, structural recursion as an intellectual support. Navigation in a labeled graph. Path expressions + conditions determine the target to be retrieved (Lorel). Extensions to SQL syntax: special constructs testing the presence and absence of data (Lorel). Stack based. A query language is based on binding names occuring in a query according to the state of the environment stack. The stack is managed by query operators (Loqis). My Old Experience: The Great Emigration: My Old Experience: The Great Emigration In late 70-ties I implemented the system “The Great Emigration” devoted to historical research. The system recorded data on Polish emigrants (about 15000) after the unsuccessful November Revolution against Russian domination in 1830-31. The data were collected from police archives all over the Europe, mostly from France. Each emigrant was described by about 50 attributes, such as AttendedSchools, Opinions, StayingPlaces, Duels, Jobs, etc. Each attribute might be optional (null valued), repeated and complex; each occurence of an attribute might be (optionally) associated with dates. To the purpose of this system I implemented a powerful query/programming language, which allowed us to ask non-trivial queries; for example “Identify the groups of emigrants according to the same time of their staying in at least three places”. The generic part of this system we upgraded to a universal DBMS called LINDA. This system had several other applications, e.g. for processing medical data. It may be interesting to note that the LINDA data model was almost identical with the Tsimmis data model (!). The Stack-Based Approach:An Abstract Store Model: The Stack-Based Approach: An Abstract Store Model I - a set of internal identifiers N - a set of external names V - a set of atomic values < i, n, v > < i1, n, i2 > < i, n, T > T is a set of (any) objects atomic object pointer object complex object Store: A set of objects + A set of identifiers (roots) + some obvious constraints (uniqueness of identifiers, referential integrities) No record, tuple, array, set, and bag constructors in the model: essentially all of them are collections of objects (“environments”). No uniqueness of external names on any level of data hierarchy: modeling bulk data. The model allows us to treat uniformly relational, nested-relational, functional, object-oriented databases. Ordering, encapsulation and classes require some extension of the model. later Tiny Database : Tiny Database TinyDatabase i1 EMP i9 EMP i5 EMP i13 DEPT i17 DEPT i2 NAME Brown i10 NAME Jones i6 NAME Smith i3 SAL 2500 i7 SAL 2000 i11 SAL 1500 i4 WORKS_IN i13 i12 WORKS_IN i17 i8 WORKS_IN i17 i14 DNAME Toys i18 DNAME Sales i15 LOC Paris i16 LOC London i19 LOC Berlin A Complex Object: A Complex Object i1 EMP i2 ENO e127 i3 NAME Smith i4 JOB designer i6 SAL 7500 i5 WORKS_IN i17 i10 PREV_JOB i11 COMPANY ICL i12 WHEN 1977-90 i7 PREV_JOB i8 COMPANY IBM i9 WHEN 1975-77 Universality of the Store Model: Universality of the Store Model Complex hierarchical objects can be defined (no limit in levels) Programming variables can be defined We abstract from the persistence status of objects and variables, i.e. we define in the same way persistent and transient objects Binary relationships (associations) can be defined via pointer objects. Ternary and higher order relationships, attributes of relationships: we do not deal with them - they must be decomposed into binary ones. Bulk data: we deal with sets/bags. They are modeled by the same name assigned to many objects on the same hierarchy level Relational structures: each tuple is understood as an object with subobjects. Relativity: no difference in the treatment of objects on any hierarchy level. big advantage for the universality, minimality and simplicity of semantics. Semi-Structured Data: Semi-Structured Data Dynamic arrays: the only difference is that names of sub-objects are first-class. Variants/unions: different structures of sub-objects. Null-values: absence of some sub-objects (same approach as for variants). Consistent! ... in contrast to the SQL approach. i22 KIND part-time i21 NAME Brown i20 EMP Null Type-less, Schema-less?: Type-less, Schema-less? There is misunderstanding concerning the terms “type-less” and “schema-less”. Type-less structures have their advantages (dynamic flexibility) and disadvantages (error prone, storage overhead). However, “type-less” does not mean “schema-less”. An important aspects of types concerns conceptual modeling. Types not only constraint data: they also inform the user what the database contains; i.e. they represent informal data semantics. The user must be informed what the database could contain by a sufficiently precise statement. This statement presents the database schema. It is possible that the schema is not implemented in the system (thus do not present a typing constraint), but it must be known for the user for correct formulation of queries. Such a schema is also necessary for writing wrappers. For example, the designer of a wrapper must take decisions concerning data names (labels of graph edges). Such decisions must be supported by a data schema. Processing Semi-Structured Data: Assumptions of the Stack-Based Approach: Processing Semi-Structured Data: Assumptions of the Stack-Based Approach A special null value, understood as a generic property of a database system, is avoided. Default values are useful but do not solve the problem of semi-structured data. Default values are the properties (invariants) of classes. The approach unifies null values, unions and repeating data. These cases irregularities are modeled by absent objects. Every kind of objects (atomic, complex, reference objects) can be absent. Such irregularities are served by standard query facilities. The classical environmental stack is used to express the semantics of query operators in terms of the naming, scoping and binding mechanism. The same mechanisms should be used for determining imperative statements, procedures, methods, views and other procedural abstractions. The query/programming interface should have full programming power. Some technique of typing semi-structured data should be provided. Optional, Alternative and Repeating Data: Optional, Alternative and Repeating Data Classes and Inheritance: Classes and Inheritance Default Values: Default Values They are stored inside class objects as their invariants. Research Problems:: Research Problems: Assuming semi-structured data are modeled by absent (sub) objects, there are the following problems: Consistent integration of this idea with object-orientedness Powerful constructs querying data with absent (sub)objects Consistent integration of this idea with imperative constructs Consistent integration of this idea with programming abstractions Consistent integration of this idea with schema declarations and typing Methodologies and technologies for writing wrappers for unstructured data Methodologies for designing integrators of heterogeneous sources Methodologies for analysis and design of IS based on unstructured data. We believe that these problems should be treated with respect to all programming issues. Querying semistructured data is not enough. Simplistic theories concerning semi-structured data are not progress.