# Sublinear Algorithms Workshop

## SUMMER SCHOOL AND WORKSHOP

### Synopsis

With the emergence of big data, and the huge combinatorial objects that make up this data, there is a need for understanding data that is too large to store or view in its entirety. Determining the scope of what is possible to understand in such settings is a basic and fundamental endeavor, that has been addressed in multiple ways within the communities of statistics, mathematics, computer science, and information theory. Models that have been studied include: streaming algorithms, where big data streams by, and the algorithm has limited space in which to store information and intermediate computations; sketching algorithms, where the input is compressed (sketched) into a smaller domain in a way that certain operations can be supported on the sketches; compressive sensing algorithms, where a signal is recovered from noisy measurements; sublinear time algorithms, in which algorithms approximate a parameter of the data after viewing a miniscule fraction of the data; local algorithms that compute and make decisions on parts of the output considering only a portion of the input; property testing algorithms, in which a quick determination of what properties the data has must be made in sublinear time in the size of the data; and distribution property testing algorithms, in which properties of the distribution must be understood in time sublinear in the size of the sampled domain.

### Program

The workshop will take place at MIT on August 3-5, 2022 and will be held in the Stata Center, 32 Vassar Street, Room 32-123. We will also have a summer school August 1-2 in the same location - 32-123.

### Organizers

• Piotr Indyk (MIT)
• Michael Mitzenmacher (Harvard)
• Jelani Nelson (UC Berkeley)
• Huy Nguyen (Northeastern)
• Ronitt Rubinfeld (MIT)

### Summer School Program

#### Day 1: Monday, August 1, 2022

Time (EDT) Speaker Title
9:30 - 10:00 Coffee and snacks
10:00 - 11:00 Jelani Nelson (UC Berkeley) Dimensionality Reduction I Slides
11:10 - 12:10 Jelani Nelson (UC Berkeley) Dimensionality Reduction II
12:15 - 2:00 Lunch (provided by organizers)
2:00 - 3:00 Michael Mitzenmacher (Harvard) Algorithms with Predictions I
3:10 - 4:10 Michael Mitzenmacher (Harvard) Algorithms with Predictions II
4:10 - 4:30 Short break
4:30 - 5:30 Ronitt Rubinfeld (MIT) Sublinear-time Algorithms I (Slides)

#### Day 2: Tuesday, August 2, 2022

Time (EDT) Speaker Title
9:30 - 10:00 Coffee and snacks
10:00 - 11:00 Sam Hopkins (MIT) Sum of Squares Methods for Statistical Problems I ( Tselil Schramm lecture Notes)
11:10 - 12:10 Sam Hopkins (MIT) Sum of Squares Methods for Statistical Problems II
12:15 - 2:00 Lunch break
2:00 - 3:00 Guy Bresler (MIT) Average-Case Reduction Techniques I Slides
3:10 - 4:10 Guy Bresler (MIT) Average-Case Reduction Techniques II
4:10 - 4:30 Short break
4:30 - 5:30 Ronitt Rubinfeld (MIT) Sublinear-time Algorithms II (Slides)

### Workshop Program

#### Day 1: Wednesday, August 3, 2022

Time (EDT) Speaker Title
9:00 - 9:30 Coffee and snacks
9:30 - 10:15 Andrew McGregor (UMass Amherst) Reconstructing Graphs via Sampling Random Subgraphs[Abstract]
A generic problem in many areas of computer science, and beyond, is to learn from multiple noisy or partial observations. A simple example could be averaging the results of multiple noisy experiments in order to estimate the acidity of an unknown chemical. Another example could be inferring a message from multiple corrupted copies of the message. Natural questions include how many observations are required to learn with the necessary accuracy and how to efficiently make the inference from these observations. In this talk, we focus on an example from theoretical computer science: learning a graph from multiple random subgraphs where each subgraph is the induced subgraph of a set of randomly sampled nodes. We present both upper and lower bounds for a variety of families of graphs. Joint work with Rik Sengupta.
10:15 - 10:40 Sepehr Assadi (Rutgers) A (Slightly) Sublinear Space Streaming Algorithm for Matchings[Abstract]
Given an n-vertex graph G presented as a stream of edges, how well can we approximate the largest matching in G in a limited space? Two trivial solutions exist for this problem: store the entire input in O(n^2) space and find a maximum matching of G at the end the stream, or greedily maintain a maximal matching in the stream in O(n log n) space to obtain a ½-approximation. Despite extensive attention, this still remains the state-of-the-art for this problem, since the introduction of the problem almost two decades ago. In this talk, we present an algorithm with non-trivial albeit small improvement over these long standing bounds: a deterministic (1-o(1))-approximation streaming algorithm in o(n^2) space. Our results rely on tools from extremal combinatorics for analyzing dense graphs such as Szemeredi’s regularity lemma and Fox’s triangle removal lemma. As such, the space improvements obtained by our algorithms over the trivial n^2 bound is only by some 2^{Theta(log* n)} factors. Nevertheless, this shows that the “right” answer to the problem is not what is dictated by the previous bounds. Joint work with Soheil Behnezhad, Sanjeev Khanna, and Huan Li
10:40 - 11:10 Break
11:10 - 11:55 Madhu Sudan (Harvard) Sketching and Streaming Complexity of Constraint Satisfaction Problems[Abstract]Slides
In this talk we will describe some of our recent work giving new upper and lower bounds on the approximability of constraint satisfaction problems (CSPs) in the streaming and sketching settings. (Streaming algorithms process the input with small space, while sketching algorithms are restricted streaming algorithms that have additional composability requirements.) When the sketching algorithms are constrained to sub-polynomial space in the input length we get a fine dichotomy: CSP approximation problems on n variables are either solvable in polylogarithmic space or require at least sqrt(n) space. Our positive results show the broad applicability of what we call bias-based sketching algorithms'', and our negative results work by abstracting and significantly generalizing previous bounds for the Maximum Cut problem. We will also mention some partial extensions to streaming algorithms, linear space lower bounds, ordering CSPs. Based on joint works with Chi-Ning Chou, Sasha Golovnev, Noah Singer, Ameya Velingker and Santhoshini Velusamy.
11:55 - 12:20 Omri Ben Eliezer (MIT) Practical sublinear algorithms for node sampling in large networks[Abstract]Slides
We consider the task of sampling multiple random nodes in a huge real-world network given limited query access to it (where we start with a single “entry point” node in the network, and in each round we may query an already visited node to reveal its neighbors). Existing algorithms, based on random walks, scale poorly in many cases: their amortized query complexity on large social networks can be up to hundreds of queries per sample. In this talk, I will describe a new and substantially different approach, based on a sublinear-query core-periphery decomposition of the network, that in practice improves upon the query complexity of random walk based algorithms by up to a factor of 20. Joint work with Talya Eden, Dimitris Fotakis and Joel Oren (https://arxiv.org/abs/2110.13324, WSDM’22)
12:20 - 2:00 Lunch Break
2:00 - 2:45 C. Seshadhri (UC Santa Cruz) Partition oracles for minor-closed families, or rather, how to coordinate diffusions[Abstract]Slides
Consider the family of bounded degree graphs in any minor-closed family (such as planar graphs). Let d be the degree bound and n be the number of vertices of such a graph. Graphs in these classes have hyperfinite decompositions, where, for sufficiently small eps > 0, one removes eps*dn edges to get connected components of size independent of n. An important tool for sublinear algorithms and property testing for such classes is the partition oracle, introduced by the seminal work of Hassidim-Kelner-Nguyen-Onak (FOCS 2009). A partition oracle is a local procedure that gives consistent access to a hyperfinite decomposition, without any preprocessing. We design a partition oracle for minor-free graph classes that run in poly(d/eps) time per query, resolving an open problem in sublinear graph algorithms. In the innards of our result, we discover (in a somewhat convoluted manner) tools to coordinate diffusions from different vertices. The aim of this talk is to highlight these tools, and hopefully inspire more applications of the techniques.
2:45 - 3:10 Will Rosenbaum (Amherst College) Bias Reduction for Sum Estimation[Abstract]Slides
In classical statistics and distribution testing, it is often assumed that elements can be sampled exactly from some distribution $P$, and that when an element $x$ is sampled, the probability $P(x)$ of sampling $x$ is also known. In this setting, recent work in distribution testing has shown that many algorithms are robust in the sense that they still produce correct output if the elements are drawn from any distribution $Q$ that is sufficiently close to $P$. This phenomenon raises interesting questions: under what conditions is a "noisy" distribution $Q$ sufficient, and what is the algorithmic cost of coping with this noise? In this talk, we will discuss the problem of estimating a sum of $N$ values $x_1, x_2, \ldots, x_N$ when the values can be sampled according to a distribution $Q$ that is (pointwise) $\gamma$-close to some known distribution $P$. For every positive integer $k$ we define an estimator $\zeta_k$ for $\mu = \sum_i x_i$ whose bias is proportional to $\gamma^k$ (where our $\zeta_1$ reduces to the classical Hansen-Hurwitz estimator). As a special case, we show that if $P$ is uniform and all $x_i \in \{0, 1\}$, for any $\varepsilon > 0$, we can estimate $\mu$ to within additive error $\varepsilon N$ using $m = \Theta({N^{1-\frac{1}{k}} / \varepsilon^{2/k}})$ samples, where $k = \lceil (\log \varepsilon)/ \log \gamma\right\rceil$. We then show that this sample complexity is essentially optimal. Interestingly, our upper and lower bounds demonstrate that the sample complexity need not vary uniformly with the desired error parameter $\varepsilon$: for some values of $\varepsilon$, perturbations in its value have no asymptotic effect on the sample complexity, while for other values, any decrease in its value results in an asymptotically larger sample complexity. This talk is based on joint work with Talya Eden, Jakob Hauen, Shyam Narayanan, and Jakub Tetek.
3:10 - 3:25 Short break
3:25 - 4:10 Maryam Aliakbarpour (Boston University / Northeastern University) Estimation of Entropy in Constant Space[Abstract] Slides
Recent work of Acharya et al. (NeurIPS 2019) showed how to estimate the entropy of a distribution D over an alphabet of size k up to \epsilon additive error by streaming over (k/\epsilon^3)⋅polylog(1/\epsilon) i.i.d. samples and using only O(1) words of memory. In this work, we give a new constant memory scheme that reduces the sample complexity to (k/\epsilon^2)⋅polylog(1/\epsilon). We conjecture that this is optimal up to polylog(1/\epsilon) factors.
4:10 - 6:00 Break/Poster Session

#### Day 2: Thursday, August 4, 2022

Time (EDT) Speaker Title
9:00 - 9:30 Coffee and snacks
9:30 - 10:15 Vahab Mirrokni (Google Research) Recent advances in learning and clustering tera-scale graphs[Abstract]
Motivated by the GraphMining@Google, I will present recent advances in building and clustering graphs with trillions of edges. I will start by describing the Adaptive MPC model and show how it can help solve graph problems in constant #rounds as opposed to polylogarithmic #rounds for the MPC model. Then I'll describe a technique for learning similarity graphs with 10x speedup using the idea of 2hop-spanner. Finally, I'll present a tera-scale hierarchical clustering and present a near-linear-time algorithm for this problem with polylogarithmic depth, and show an implementation of this algorithm can beat the state-of-the-art empirically.
10:15 - 11:00 Sanjeev Khanna (UPenn Sublinear Algorithms for Hierarchical Clustering[Abstract] Slides
Hierarchical clustering is a popular unsupervised learning method for organizing data as a rooted tree structure that simultaneously clusters data at multiple levels of granularity. A well-studied recent objective function views the input as a weighted graph with edges indicating similarity between data points, and focuses on finding a tree that minimizes the cost of hierarchical partitioning. The resulting problem is NP-hard, and previous algorithms for approximating this objective require at least linear time/space. In this talk, we will consider algorithms for hierarchical clustering that use sublinear resources (space, time, and communication). Specifically, we will present sublinear algorithms for hierarchical clustering in the streaming model (space), in the query model (time), and in the MPC model (communication). At the core of our algorithmic results is a connection between hierarchical clustering and a suitably relaxed notion of cut sparsifiers of graphs that lends itself to efficient sublinear algorithms. We complement our algorithmic results by establishing nearly matching lower bounds that rule out algorithms with better performance guarantees in each of these models. This is joint work with Arpit Agarwal, Huan Li, and Prathamesh Patil.
11:00 - 11:30 Break
11:30 - 12:15 Michael Kapralov (EPFL) Streaming lower bounds through boolean Fourier analysis[Abstract]Slides
Fourier analytic methods pioneered in the celebrated analysis of the boolean hidden matching problem [Gavinsky et al'09] have been fundamental to many graph streaming lower bounds in the past few years. I will outline key ideas behind the communication complexity lower bound for boolean hidden matching and then talk about recent works that apply the Fourier analytic approach to multi-player settings, obtaining tight lower bounds for approximating MAX-CUT, subgraph counting and more.
12:15 - 12:40 Rajesh Jayaram (Google NYC) Advances in High-Dimensional Geometric Streaming[Abstract] Slides
Given a set of points living in R^d, one of the most successful approaches to sketching geometric optimization problems, such as Minimum Spanning Tree (MST), Earth Mover's Distance (EMD), and Facility Location, is to recursively partition the space, and record only the number of points which fall in each piece of the partition. The result is an embedding from the initial space into a tree metric corresponding to the recursive partition, which has the benefit of being sketchable in low-space (i.e., the cost of the optimization problem in the tree can be estimated without storing the tree). This method, known broadly as the Quadtree algorithm, gives good approximations for low-dimensional spaces, however the approximation degrades as the dimensionality increases. In this talk, we describe new advances in geometric sketching via data-dependent tree embeddings. While the standard Quadtree is oblivious, i.e., the tree embedding is independent of the point set, we describe a new tree embedding whose edge weights are data-dependent. We show how this embedding reduces the approximations for the aforementioned problems from O(log n log d) to tilde{O}(log n). Despite the fact that the embedding itself now depends on the full data-set, we describe how it is still possible to estimate the cost of EMD and MST within the tree using small-space, resulting in the state of the art streaming algorithms for those problems. Based on joint work with Xi Chen, Amit Levi, and Erik Waingarten.
12:40 - 2:00 Lunch Break
2:00 - 2:45 Elena Grigorescu (Purdue) Local Codes for Insertion and Deletion Errors[Abstract]  Slides
Locally Decodable Codes (LDCs) are error-correcting codes for which individual message symbols can be quickly recovered despite errors in the codeword. LDCs for Hamming errors have been studied extensively in the past few decades, where a major goal is to understand the amount of redundancy that is necessary and sufficient to decode from large amounts of error, with small query complexity. In this talk I will describe our recent results on LDCs and their variants, when the errors are in the form of insertions and deletions, which are more difficult to recover from than classical Hamming errors. Local codes against insertions and deletions are well-motivated by recent progress on DNA storage technologies. I will conclude with several intriguing open problems. Based on joint work with Alex Block, Jeremiah Blocki, Kuan Cheng, Shubhang Kulkarni, Xin Li, Yu Zheng, Minshen Zhu.
2:45 - 3:10 Deeparnab Chakrabarty (Dartmouth) Graph Connectivity and Single Element Recovery via Linear Measurements: Rounds v Query Trade-offs[Abstract]Slides
Given an unknown non-negative, non-zero n-dimensional vector with access via linear measurements (dot-products with n-dimensional vectors), the single element recovery problem is to recover a single positive coordinate of this vector. What is the query complexity when only r-rounds of measurements are allowed? In this talk we'll see an optimal answer to this question, and how it is related to finding a spanning forest in undirected graphs using only "cut queries". We will also discuss some open questions. This is joint work with Sepehr Assadi and Sanjeev Khanna
3:10 - 3:40 Break
3:40 - 4:25 Erik Waingarten (UPenn) The Johnson-Lindenstrauss Lemma for Clustering and Subspace Approximation[Abstract]Slides
The Johnson-Lindenstrauss lemma says that for any set of vectors in a high-dimensional space, there exists an embedding into a much lower dimensional space which approximately preserves all pairwise distances. Here, we explore dimensionality reduction for other geometric optimization problems: given a geometric optimization problem (for example, k-means clustering or principal components analysis) is there always an embedding to a lower dimensional space which approximately preserves the cost of the optimization? In this talk, I will overview a few results in this space, and present a new technique for using coreset constructions in order to get improved dimensionality reduction. Joint with Moses Charikar.
4:25 - 4:50 Quanquan Liu (Northwestern) Massively Parallel Algorithms for Small Subgraph Counting[Abstract]Slides
Over the last two decades, frameworks for distributed-memory parallel computation, such as MapReduce, Hadoop, Spark and Dryad, have gained significant popularity with the growing prevalence of large network datasets. The Massively Parallel Computation (MPC) model is the de-facto standard for studying graph algorithms in these frameworks theoretically. Subgraph counting is one such fundamental problem in analyzing massive graphs, with the main algorithmic challenges centering on designing methods which are both scalable and accurate. Given a graph G=(V, E) with n vertices, m edges and T triangles, the first result we will discuss is an algorithm that outputs a (1+epsilon)-approximation to T, with asymptotically optimal round and total space complexity provided T is at least square root of the average degree. Our result gives a quadratic improvement on the bound on T over previous works. Our second result is an O(\log \log n)-round algorithm for exactly counting the number of triangles whose total space usage is parametrized by the arboricity \alpha of the input graph and uses sublinear O(n^{\delta}) space per machine for any constant \delta > 0. In addition to our theoretical results, we simulate our triangle counting algorithms in real-world graphs obtained from the Stanford Network Analysis Project (SNAP) database. Our results show that both our approximate and exact counting algorithms exhibit improvements in terms of round complexity and approximation ratio, respectively, compared to two previous widely used algorithms for these problems. To appear in APPROX 2022. Joint work with Amartya Shankha Biswas, Talya Eden, Slobodan Mitrović, Ronitt Rubinfeld.

#### Day 3: Friday, August 5, 2022

Time (EDT) Speaker Title
9:00 - 9:30 Coffee and snacks
9:30 - 10:15 Vincent Cohen-Addad (Google Research) Sublinear time algorithms for Euclidean clustering coresets and correlation clustering[Abstract]Slides
For over 20 years, extracting information from massive data has been a major challenge in various fields. How fast can we obtain meaningful information from large datasets? In this talk we will survey recent results on two classic problems: - Computing the mean of a set of points in R^d, for possibly large d. - Computing a clustering of a dataset given pairwise similarities between the data elements. We will see how this can be done in sublinear time without sacrificing much on the quality of the solution. The work is based on the following works: - Improved Coresets and Sublinear Algorithms for Power Means in Euclidean Spaces. Vincent Cohen-Addad, David Saulpic, Chris Schwiegelshohn: NeurIPS 2021: -Correlation Clustering in Constant Many Parallel Rounds. Vincent Cohen-Addad, Silvio Lattanzi, Slobodan Mitrovic, Ashkan Norouzi-Fard, Nikos Parotsidis, Jakub Tarnawski: ICML 2021
10:15 - 10:40 Jessica Shi (MIT) Theoretically and Practically Efficient Parallel Nucleus Decomposition[Abstract]Slides
We study the nucleus decomposition problem, which has been shown to be useful in finding dense substructures in graphs. We present a novel parallel algorithm that is efficient both in theory and in practice. Our algorithm achieves a work complexity matching the best sequential algorithm while also having low depth (parallel running time), which significantly improves upon the only existing parallel nucleus decomposition algorithm (Sariyuce et al., PVLDB 2018). The key to the theoretical efficiency of our algorithm is a new lemma that bounds the amount of work done when peeling cliques from the graph, combined with the use of theoretically-efficient parallel algorithms for clique listing and bucketing. We introduce several new practical optimizations, including a new multi-level hash table structure to store information on cliques space-efficiently and a technique for traversing this structure cache-efficiently. On a 30-core machine with two-way hyper-threading on real-world graphs, we achieve up to a 55x speedup over the state-of-the-art parallel nucleus decomposition algorithm by Sariyuce et al., and up to a 40x self-relative parallel speedup. We are able to efficiently compute larger nucleus decompositions than prior work on several million-scale graphs for the first time.
10:40 - 11:10 Break
11:10 - 11:55 Sepideh Mahabadi (Microsoft Research) Sampling a Near Neighbor in High Dimensions — Who is the Fairest of Them All?[Abstract]
Similarity search is a fundamental algorithmic primitive, widely used in many computer science disciplines. Given a set of points 𝑆 and a radius parameter 𝑟 > 0, the 𝑟 -near neighbor (𝑟 -NN) problem asks for a data structure that, given any query point 𝑞, returns a point 𝑝 within distance at most 𝑟 from 𝑞. In this paper, we study the 𝑟 -NN problem in the light of individual fairness and providing equal opportunities: all points that are within distance 𝑟 from the query should have the same probability to be returned. Locality sensitive hashing (LSH), the theoretically strongest approach to similarity search in high dimensions, does not provide such a fairness guarantee. In this work, we show that LSH based algorithms can be made fair, without a significant loss in efficiency. We propose several efficient data structures for the exact and approximate variants of the fair NN problem. Our approach works more generally for sampling uniformly from a sub-collection of sets of a given collection and can be used in a few other applications. This is a joint work with Martin Aumuller, Sariel Har-Peled, Rasum Pagh, and Francesco Silvestri.
11:55 - 12:20 Morteza Zadimoghaddam (Google Research) Data compression and submodular optimization for feature engineering applications[Abstract]
The emergence of massive datasets signifies the importance of data efficiency algorithms. Feature engineering, a common method of reaching data efficiency, is one of the main ingredients in many machine learning pipelines. As an example of feature engineering, we review the role of combinatorial optimization techniques in designing subset selection algorithms. These methods focus on summarizing input features into concise sketches with provable performance guarantees. We look at some recent advances in submodular optimization and its potential application in this area. Another approach to data efficiency is to focus on the content of the features and develop compression techniques that retain the most important pieces of input data with regards to the classification tasks. >
12:20 - 2:00 Lunch Break
2:00 - 2:45 David Woodruff (CMU) Memory Bounds for the Experts Problem[Abstract] Slides
2:45 - 3:10 Paul Valiant (Purdue) Mean Estimation in Low and High Dimensions[Abstract]Slides
3:40 - 4:25 Daniel Kane (UCSD) Ak Testing of Distributions[Abstract] Slides
4:25 - 4:50 Zihan Tan (Rutgers University) Query Complexity of the Metric Steiner Tree Problem[Abstract]Slides