Designing a pipeline
- What are elements the inside a pipeline? PCollection, PTransforms, IO Transforms
- What does a Runner do? Determines what back-end your pipeline will run on
- What are Dataflow transforms called? PTransforms
- What is a PTransform's output called? PCollection
- Do PTransforms "consume" PCollections? In other words, do they support random access? No, they consider each individual element of a PCollection and can apply different transforms to them
- What are tagged outputs? a single transform output to multiple PCollections
- What is a Flatten transform? merge multiple PCollections of the same type.
- What is a CoGroupByKey transform? relational join of multiple PCollections of the same key type
- What is a root transform? A root transform creates a PCollection from either an external data source or some local data you specify.
- What are the two kinds of root transform? Read and Create. Read transforms read data from an external source, such as a text file or a database table. Create transforms create a PCollection from an in-memory java.util.Collection.
- Can pipelines consume batch or stream? both
- What is a bounded PCollection? A fixed data source, they are processed using batch
- What is an unbounded PCollection? A data source that constantly updates, they are processed using stream
- Can Pipelines share a PCollection? No, they are individually owned by a Pipeline
- Can elements in a PCollection be of a different type? No
- How do you add elements to a PCollection? You can't, They are immutable. A PTransform needs to process it to create a new PCollection
- How does Beam consume streaming data? Beam uses windowing to divide a continuously updating unbounded PCollection into logical windows of finite size. These logical windows are determined by some characteristic associated with a data element, such as a timestamp. Aggregation transforms (such as GroupByKey and Combine) work on a per-window basis — as the data set is generated, they process each PCollection as a succession of these finite windows.
- What is a fixed time window? Given a timestamped PCollection which might be continuously updating, each window might capture (for example) all elements with timestamps that fall into a 30 second interval.
- What are sliding time windows? A sliding time window also represents time intervals in the data stream; however, sliding time windows can overlap. For example, each window might capture 60 seconds worth of data, but a new window starts every 30 seconds. The frequency with which sliding windows begin is called the period. Therefore, our example would have a window duration of 60 seconds and a period of 30 seconds.