Ph.D Thesis | |

Ph.D Student | Gabel Moshe |
---|---|

Subject | Communication-Efficient Algorithms for Distributed Stream Mining |

Department | Department of Computer Science |

Supervisors | PROF. Assaf Schuster |

ASSOCIATE PROF. Daniel Keren | |

Full Thesis text |

Recent years have seen an explosion in the number of connected devices, which means not only growth in velocity and volume of data, but also that data sources are increasingly spatially distributed, incurring higher communication costs. Data mining algorithms often assume that data is centralized or that communication is inexpensive: the setting is implicitly assumed to be a data center. In settings like wireless sensor networks, however, communication costs limited battery power. Moreover, most work only considers one-shot computation: computing a result once from a fixed data set. Yet data is increasingly dynamic, and many applications require current results over a recent time window.

We focus on computing approximations over aggregated distributed data streams with reduced communication. We describe three novel distributed approximations for important non-linear functions: variance, Shannon's entropy, and least-squares regression. Using a geometric safe zone approach (also called geometric monitoring), we convert global approximation bounds on the data aggregate to local threshold constraints that each node can check independently. Our algorithms provide deterministic user-defined approximation bounds, while avoiding messages unless they are needed to maintain those bounds. Compared to the centralized solution, our algorithms reduce communication by up to two orders of magnitude on several real data sets that represent real applications, including machine health monitoring, network monitoring with netflows, traffic monitoring, and others. We also extend the formulation of geometric monitoring to variable-sized windows, which is critical to achieve low communication when window sizes are dynamic and vary across nodes.