An architecture for federated data discovery and lineage over on-prem datasources and public cloud with Apache Atlas

Information about An architecture for federated data discovery and lineage over on-prem...

Published on June 20, 2018

Author: Hadoop_Summit

Source: slideshare.net

Content

1. Comcast collects, stores, and uses all data in accordance with our privacy disclosures to users and applicable laws. An architecture for integrated data discovery and lineage over on-prem datasources and public cloud with Apache Atlas Barbara Eckman, Ph.D. Principal Architect

2. Our Group’s Mission Gather and organize metadata and lineage from diverse sources to make data universally discoverable, integrate-able and accessible to empower insight-driven decision making • Dozens of tenants and stakeholders • Millions of messages/second captured • Tens of PB of long term data storage • Thousands of cores of distributed compute

3. Data Discovery and Lineage • Where can I find data about X? • How is this data structured? • Who produced it? • What does it mean? • How ”mature” is it? • What attributes in your data match attributes in mine? (e.g., potential join fields) • How has the data changed in its journey from ingest to where I’m viewing it? • Where are the derivatives of my original data to be found?

4. Challenges of Heterogeneity • There are many excellent data discovery tools -OS and commercial • BUT limited in scope of data set types supported -Only a certain Big Data ecosystem provider -Only RDBMS’s, text documents, emails • We need to add new data set types from multiple providers • We need to integrate metadata from diverse data sets, both in public cloud and on-prem • We need to integrate lineage from diverse loading jobs, both batch and streaming

5. Outline • #TBT** to DataWorks Summit 2017 • Reorganization Yields New Requirements (12/2017) • Challenges encountered and met • New Integrative Data Discovery and Lineage Architecture • Next steps ** “Throw Back Tuesday”

6. #TBT to DataWorks Summit 2017

7. Data Platform Architecture (DataWorks Summit 2017) METADATA

8. DataWorks Summit 2017: Key Metadata Technologies Apache Avro Atlas.apache.orgAvro.apache.org

9. What are Avro and Atlas? • A data serialization system -A JSON-based schema language -A compact serialized format • APIs in a bunch of languages • Benefits: -Cross-language support for dynamic data access -Simple but expressive schema definition and evolution -Built-in documentation, defaults • Data Discovery, Lineage -Browser UI -Rest/Java and kafka APIs -Synchronous and Asynchonous messaging -Free-text, typed, graph search • Integrated Security (Apache Ranger) • Schema Registry as well as Metadata Repo Open Source Extensible

10. DataWorks Summit 2017: Atlas Metadata Types Built-in Atlas Types Custom Atlas Entities Custom Atlas Processes • DataSet • Process • Hive tables • Kafka topics • Avro Schemas -Reciprocally linked to all other dataset types • Extensions to Kafka topic -sizing parameters • AWS S3 Object Store - S3 Bucket, Pseudo- Directory • Lineage Processes -Avro schema versioning -Storing data to S3 objects • Enrichment Processes on streaming data -Re-publishing to kafka topics

11. Reorganization Yields New Requirements

12. Reorganization Yields New Requirements • Integrate on-prem data sources -Traditional warehousing, RDBMS, Atlas on Hadoop • Increase signal/noise ratio • Atlas Connectors for all metadata sources -At rest -Streaming • Lineage capture -Streaming and batch • End-user annotations -Stakeholders, documentation

13. New Data Source Type Generic RDBMS Atlas typedefs • Instance • Database (schema) • Table • Column • Index • Foreign Key Used for: • Informatica Metadata Manager, on top of Teradata EDW • Oracle • Others to come Comments: • Back pointers to parent class at every level of hierarchy • Owned_refs ( aka “composite”) at every level • Load only whitelisted databases to increase signal, reduce noise

14. Connectors for all metadata sources • One codebase for all sources • Differ in means of acquiring metadata, but use the same methods to package data into AtlasEntities for publishing via kafka api -RDBMS -Atlas to Atlas (supports different versions) -Kafka topics -Avro schemas -AWS datalake objects

15. Dataset Lineage capture • Generic lineage process typedef -Used for both batch and streaming lineage capture -Attributes include transforms performed, config parameters -May be subclassed from to add attributes for individual cases • Unlike metadata, lineage capture is event-triggered -In AWS, Cloudwatch event on Glue crawler triggers lambda function -In on-prem hadoop, Inotify event on hdfs triggers microservice • Triggered components assemble requisite info and publish to Atlas lineage connector Acknowledgements: Datalake Team

16. End-user annotations: new tag typedefs Stakeholders • Individuals -Data Business Owner -Data Technical Owner -Data Steward -Delivery Manager -Data Architect • Teams -Delivery Team -Support Team Documentation • array<map<string, string>> -Name -Description -URL Acknowledgements: Portal team

17. Portal UI REST/Asynchronous API Apache Atlas on AWS Integrative Metadata Store for Search Duplicatesmetadatafrommetadatasourcessufficienttoenablediscovery Free-text searchSQL-like search Graph search RDMBS Connector Atlas-to-Atlas ConnectorLineage Connector Other Connectors DataStream Connector New Integrative Data Discovery and Lineage Architecture Drill downtoindividual metastores’ UIsfordeepexploration Drill downtoindividual metastores’ UIsfordeepexploration Metadata Sources Batch Data Ingest Jobs Other Metadata repos ML Pipeline Models Feature Eng Jobs Streaming Data Ingest Jobs Oracle, MySQL, MSSQL, etc Catalog AWS S3 Datalake, Kinesis Streams Avro Schemas Kafka Topics Informatica MDM Teradata On-prem Atlas for Hive Tables Model Connector AWSobject Connector On-premPublic Cloud

18. Challenges encountered Integrating EDW metadata • Problem: -vastly larger size objects than previous, even though integration restricted to whitelist of top priority databases • Solution -Increase config params in java, kafka, atlas -Use kafka api -Build a single AtlasEntity object -Send a single message for each database -Optimize query to get data from Informatica (pivot) Schema on read legacy in on- prem datalake • Problem: -Many, many duplicate hive tables -Unknown lineage/semantic equivalence • Solution: -Data-based ML solution to find dups and equivalences (POC in progress)

19. Next Steps

20. • End-to-end metadata repository -Models are first-class objects, captured with rich metadata (eg input file schema, feature set schema, model parameters, etc) -Feature engineering jobs are first-class objects, captured with rich metadata (eg model, data quality threshold, input file schema, owner) -Build metadata capture on models and feature engineering jobs into the ML pipeline Metadata repo for discovery and documentation of models Data Lake Meta-Data Data Set A Feature Engineering ML Model Prediction Feature Set B

21. Extending avro schema governance to other schema types •Interactive user app facilitates creation of schemas and enforces compliance with Comcast conventions -Each schema is reviewed and approved by at least one human being •Comcast conventions: -Non-vacuous doc comments required to document every attribute -All attributes must have default values -Unnecessary complexity is discouraged (YAGNI principle) •Library of commonly used subschemas -Available via app, use is encouraged by reviewers

22. An architecture for integrated data discovery and lineage over on-prem datasources and public cloud with Apache Atlas • Throw Back To DataWorks Summit 2017 • Reorganization Yields New Requirements (12/2017) -New Atlas data types -Connectors for all metadata sources -Lineage capture -End-user annotations via Tags • Today’s Architecture • Challenges encountered and met • Next steps •Extra Goodies!

23. Comcast Contributions to Apache Atlas OS Community Jira Ticket Description ATLAS-2694 Avro schema typedef and support for Avro schema evolution in Atlas ATLAS-2696 Typedef extensions for Kafka in Atlas ATLAS-2708 AWS S3 data lake typedefs for Atlas ATLAS-2709 RDBMS typedefs for Atlas ATLAS-2724 UI enhancement for Avro schemas and other JSON-valued attributes (coming soon) https://issues.apache.org/jira/browse/ATLAS-XXXX

24. Suggested Reading Creating A Data-Driven Enterprise in Media Comcast Chapter: How a Focus on Customer Experience Led to a Focus on Data Science Both can be reached from: https://www.oreilly.com/ideas/data-governance-and-the-death-of-schema-on-read Data Governance and the Death of Schema on Read

25. My collaborators SonalRob Teja Sean Vadim Vaks Principal Solutions Architect Gabe

26. [email protected]

#tbt presentations

Facebook Best Practices 2018
09. 02. 2018
0 views

Facebook Best Practices 2018

Related presentations


Other presentations created by Hadoop_Summit

Predictive Analytics with Hadoop
24. 04. 2014
0 views

Predictive Analytics with Hadoop

Putting Wings on the Elephant
25. 04. 2014
0 views

Putting Wings on the Elephant

Hadoop-2 @ eBay
25. 04. 2014
0 views

Hadoop-2 @ eBay

Th 210p-230 a-vogt
17. 06. 2014
0 views

Th 210p-230 a-vogt

Th 130p-230 a-matyas-v2
17. 06. 2014
0 views

Th 130p-230 a-matyas-v2

Th 130p-211-minder
17. 06. 2014
0 views

Th 130p-211-minder

W 525p-230 a-gardner
17. 06. 2014
0 views

W 525p-230 a-gardner

W 1115 a-230a-corugedo-v2
17. 06. 2014
0 views

W 1115 a-230a-corugedo-v2

W 145p-230 a-taylor-v2
17. 06. 2014
0 views

W 145p-230 a-taylor-v2

W 1205p-230 a-radhakrishnan v3
17. 06. 2014
0 views

W 1205p-230 a-radhakrishnan v3

HBase Low Latency
17. 06. 2014
0 views

HBase Low Latency

W 235p-210 c-dunning
17. 06. 2014
0 views

W 235p-210 c-dunning

Hive for Analytic Workloads
17. 06. 2014
0 views

Hive for Analytic Workloads

T 525p-230 c-kambatla
17. 06. 2014
0 views

T 525p-230 c-kambatla

T 1205p-230 c-stella
17. 06. 2014
0 views

T 1205p-230 c-stella

Analyzing Hadoop Using Hadoop
25. 04. 2015
0 views

Analyzing Hadoop Using Hadoop

Hive Now Sparks
24. 04. 2015
0 views

Hive Now Sparks

Data Analysis With Apache Flink
27. 04. 2015
0 views

Data Analysis With Apache Flink

Get most out of Spark on YARN
27. 04. 2015
0 views

Get most out of Spark on YARN

Apache Kylin – Cubes on Hadoop
27. 04. 2015
0 views

Apache Kylin – Cubes on Hadoop

SQL In/On/Around Hadoop
27. 04. 2015
0 views

SQL In/On/Around Hadoop

Scaling self service on Hadoop
27. 04. 2015
0 views

Scaling self service on Hadoop

PageRank for Anomaly Detection
17. 06. 2015
0 views

PageRank for Anomaly Detection

Practical Computing With Chaos
17. 06. 2015
0 views

Practical Computing With Chaos

Data Warehousing using Hadoop
19. 06. 2015
0 views

Data Warehousing using Hadoop

The Challenges of SQL on Hadoop
24. 06. 2015
0 views

The Challenges of SQL on Hadoop

Inspiring Travel at Airbnb [WIP]
17. 06. 2015
0 views

Inspiring Travel at Airbnb [WIP]

Ozone: An Object Store in HDFS
19. 06. 2015
0 views

Ozone: An Object Store in HDFS

Empower Hive with Spark
30. 06. 2015
0 views

Empower Hive with Spark

Hive Does ACID
17. 06. 2015
0 views

Hive Does ACID

Dawn of YARN @ Rocket Fuel
20. 06. 2015
0 views

Dawn of YARN @ Rocket Fuel

Disaggregated Hadoop Stacks
19. 06. 2015
0 views

Disaggregated Hadoop Stacks

Redefine Big Data
19. 06. 2015
0 views

Redefine Big Data

Marketing Digital Command Center
22. 06. 2017
0 views

Marketing Digital Command Center

Running Zeppelin in Enterprise
22. 06. 2017
0 views

Running Zeppelin in Enterprise

Solving cyber at scale
27. 06. 2017
0 views

Solving cyber at scale

YARN - Past, Present, & Future
21. 09. 2017
0 views

YARN - Past, Present, & Future

Tensorflow on Apache Hadoop YARN
21. 09. 2017
0 views

Tensorflow on Apache Hadoop YARN

Data in the Cloud Crash Course
12. 02. 2019
0 views

Data in the Cloud Crash Course

Containers and Big Data
12. 02. 2019
0 views

Containers and Big Data

The Manulife Journey
12. 02. 2019
0 views

The Manulife Journey

Evolving Streaming Applications
01. 04. 2019
0 views

Evolving Streaming Applications

Data Science Crash Course
01. 04. 2019
0 views

Data Science Crash Course