[go: up one dir, main page]

0% found this document useful (0 votes)
16 views3 pages

Data Stream

Uploaded by

Koushik Paul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views3 pages

Data Stream

Uploaded by

Koushik Paul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

● Useful Links

○ Introduction to Datastream for BigQuery


○ Replicate your Oracle data into BigQuery in realtime using Datastream and Data Fusion
○ Data Analytics Deep Dives - Datastream - Postgres to BigQuery
○ Near real-time CDC using DataStream
○ https://www.data-max.io/post/streaming-data-from-postgresql-to-bigquery-with-datastream

● Can/Pros
○ Datastream provides seamless replication of data from operational databases(Oracle, MySQL, PostgreSQL, AlloyDB) into BigQuery.
○ Integration across Dataflow, Cloud Data Fusion, Pub/Sub, BigQuery,
○ Supports writing the change event stream into Cloud Storage.
○ near real-time, serverless
○ Datastream enables simple, source-independent processing by converting all source-specific data types into a unified Datastream type schema,
based on Avro types.
● Cons
○ It is just lift and shift
○ Cannot write in Iceberg format
○ We don't have control on writing to specific partition table in bigquery
○ No supports on JSON and Avro while writing to GCS
● Limitations
○ How do Datastream and BigQuery handle tables that don't have a primary key? If the source table doesn't have a primary key, the tables are treated as append-only, and
each event for a given row appears as a separate row in BigQuery.
○ Throughput: ~5 MBPS with a maximum 30-MB row size limit for Cloud Storage destination and 10-MB row size limit for BigQuery.
○ Some data definition language (DDL) operations aren't supported during replication, including:
■ Dropping a column from the middle of a table. This may cause a data discrepancy because values are associated with the wrong column.
■ Changing the data type of a column. This may cause a data discrepancy because data isn't mapped properly to the correct Datastream unified type, and the data
may get corrupted.
■ Cascading deletes are ignored.
■ Table truncation is ignored.
○ For source-specific limitations, see the following pages:
■ MySQL limitations
■ Oracle limitations
■ PostgreSQL limitations
● Notes:
○ High availability: Datastream is a regional service, running on multiple zones in each region. A single-zone failure in any one region will not impact
the availability or quality of the service in other zones.
○ Disaster recovery:
■ If there's a failure in a region, then any streams running on that region will be down for the duration of the outage. After the outage is
resolved, Datastream will continue exactly where it left off, and any data that hasn't been written to the destination will be retrieved again
from the source. In this case, duplicates of data may reside in the destination.
■ Or we can switch to a different region .
● Create a stream in a new region or project with the same configuration as the existing stream, but don't select the Backfill historical data checkbox.
● Start the stream that you created.
● After the stream that you created has a status of RUNNING, pause the existing stream.
● Optionally, modify the new stream by selecting the Backfill historical data checkbox. Existing data in tables added to the stream in the future will be
streamed from the source into the destination.
○ How are BigQuery costs calculated when used with Datastream? BigQuery costs are calculated and charged separately from Datastream. BigQuery processes the
events and applies changes to the underlying table. As the data volume grows, the BigQuery analysis cost increases as BigQuery needs to process more data to apply
the changes to the underlying table. The main pricing component is the analysis cost. It depends on the pricing model (on-demand or capacity-based pricing), and
factors like the complexity, frequency, and number of rows modified. Costs can be controlled by allocating reservation slots to the relevant project. Another way to
control costs is by changing the frequency of merge operations, which is done through the BigQuery table staleness property: the higher the staleness limit, the fewer
merge operations are performed, the lower the cost. We recommend setting the staleness limit based on the maximum of the following two values:
■ Maximum tolerable data freshness of your application
■ Run time of each round of background upsert operations
○ When writing to a schemaless destinationo, such as Cloud Storage, Datastream simplifies downstream processing of data across sources by normalizing data types
across all sources. Datastream takes the original source data type (for example, a MySQL or PostgreSQL NUMERIC type or an Oracle NUMBER type), and normalizes it
into a Datastream unified type.if
○ Non Native options for CDC
■ Fivetran
■ Alooma ( acquired by google )
■ Striim
○ DataDtream/CloudSQL runs not in client VPC but in google .
● Datastream has five entities:
○ Private connectivity configurations enable Datastream to communicate with data sources over a secure, private network connection. This
communication happens through Virtual Private Cloud (VPC) peering.
○ Connection profiles represent connectivity information to a specific source or destination database.
○ Streams represent a source and destination connection profile pair, along with stream-specific settings.
○ Objects represent a sub-portion of a stream. For instance, a database stream has a data object for every table being streamed.
○ Events represent every data manipulation language (DML) change for a given object.
● Destinations
○ Configure a BigQuery destination
○ Configure a Cloud Storage destination

● What is
○ Cloud SQl
○ Cloud Spanner
○ Icebarg table
● Supported sinks
○ Datastream supports Oracle, MySQL and PostgreSQL (including AlloyDB for PostgreSQL) sources.
○ ?

You might also like