A first approach to stream re-partitioning

January 29, 2017

Two days ago, I received reviews on my work regarding distributed stream partitioning. In those, reviewers wanted to know the reason that I did not compare my proposed partitioning algorithms with Flux, which is a re-partitioning operator proposed by Shah et al. in IEEE ICDE 2003, with the article titled: “Flux: An adaptive partitioning operator for continuous query systems”. The reviews made me decide to write a quick review on Flux because it is considered seminal work in the field of load balancing stream processing, and to ensure that this work is similar but orthogonal to my proposed solutions (there is a difference between partitioning and re-partitioning).

In Continuous Query (CQ) systems different parts of a query can be parallelized by partitioning the input into numerous sub-streams each one handled by different processes. The effectiveness of partitioning plays a crucial role on the performance of the CQ system. Unfortunately, data streams tend to have their distributions change over time, which results in repartitioning streams in an online fashion.

Shah et al. proposed Flux, which is an operator that adapts partitioning, according to data characteristic changes. Flux uses hash partitioning and when it detects skew in the input, it employs re-partitioning based on a simple yet efficient model. In addition, Flux offers a smart mechanism of buffering tuples and re-ordering them to battle transient skew.

Strengths

Buffer and re-ordering mechanism to battle transient skew
Load balancing policy that strives to keep imbalance low

Weaknesses

Many parameters require tuning and do not get automatically picked