Density-Based clustering in mapReduce with guarantees on parallel time, space, and solution quality

Document Type : Research Paper

Authors

1 Department of Computer Engineering, Sharif University of Technology, Tehran, Iran

2 Department of Computer Engineering, Sharif University of Technology, Tehran, Iran. School of Computer Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran.

10.22108/toc.2024.138377.2091

Abstract

A well-known clustering problem called Density-Based Spatial Clustering of Applications with Noise~(DBSCAN) involves computing the solutions of at least one disk range query per input point, computing the connected components of a graph, and bichromatic fixed-radius nearest neighbor. MapReduce class is a model where a sublinear number of machines, each with sublinear memory, run for a polylogarithmic number of parallel rounds.
 
Most of these problems either require quadratic time in the sequential model or are hard to compute in a constant number of rounds in MapReduce. In the Euclidean plane, DBSCAN algorithms with near-linear time and a randomized parallel algorithm with a polylogarithmic number of rounds exist.
 
We solve DBSCAN in the Euclidean plane in a constant number of rounds in MapReduce, assuming the minimum number of points in range queries is constant and each connected component fits inside the memory of a single machine and has a constant diameter.

Keywords

Main Subjects


[1] J. Gan and Y. Tao, On the hardness and approximation of Euclidean DBSCAN, ACM Trans. Database Syst., 42 no. 3 (2017) 1–45.
[2] Y. Wang, Y. Gu and J. Shun, Theoretically-efficient and practical parallel DBSCAN, In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, (2020) 2555–2571.
[3] N. Bansal, A. Blum and S. Chawla, Correlation clustering, Mach. Learn., 56 (2004) 89–113.
[4] N. Ailon, M. Charikar and A. Newman, Proofs of conjectures in “aggregating inconsistent information: Ranking and clustering”, Technical report, Technical Report TR-719-05, Department of Computer Science,
Princeton University, 2005.
[5] S. Chawla, K. Makarychev, T. Schramm and G. Yaroslavtsev, Near optimal LP rounding algorithm for correlation clustering on complete and complete k-partite graphs, In Proceedings of the 2015 ACM Symposium on Theory of Computing, ACM, New York, (2015) 219–228.
[6] M. Ester, H.-P. Kriegel, J. Sander and X. Xu, A density-based algorithm for discovering clusters in large spatial databases with noise, Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, (1996) 226–231.
[7] J. H. Friedman, J. L. Bentley and R. A. Finkel, An algorithm for finding best matches in logarithmic expected time, ACM Transactions on Mathematical Software, 3 no. 3 (1977) 209–226.
[8] J. L. Bentley, Survey of techniques for fixed radius near neighbor searching, Technical report, Stanford Linear Accelerator Center, Calif.(USA), 1975.
[9] J. Han, M. Kamber and J. Pei, Data mining concepts and techniques third edition, University of Illinois at Urbana-Champaign Micheline Kamber Jian Pei Simon Fraser University, 2012.
[10] R. C. Hoetzlein, Fast fixed-radius nearest neighbors: interactive million-particle fluids, In GPU Technology Conference, 18 (2014) pp. 2.
[11] M. de Berg, A. Gunawan and M. Roeloffzen, Faster DBSCAN and HDBSCAN in low-dimensional Euclidean spaces, Internat. J. Comput. Geom. Appl., 29 no. 1 (2019) 21–47.
[12] E. Schubert, J. Sander, M. Ester, H. P. Kriegel and X. Xu, DBSCAN revisited, revisited: why and how you should (still) use DBSCAN, ACM Trans. Database Syst., 42 no. 3 (2017) 21 pp.
[13] P. K. Agarwal, K. Fox, K. Munagala and A. Nath, Parallel algorithms for constructing range and nearest-neighbor searching data structures, In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, (2016) 429–440.
[14] S. Aghamolaei, F. Baharifard and M. Ghodsi, Geometric spanners in the MapReduce model, Computing and combinatorics, Lecture Notes in Comput. Sci., Springer, Cham, 2018 675–687.
[15] S. Aghamolaei, V. Keikha, M. Ghodsi and A. Mohades, Windowing queries using Minkowski sum and their extension to MapReduce, J. Supercomput., 77 (2021) 936–972.
[16] S. Aghamolaei, V. Keikha, M. Ghodsi and A. Mohades, Sampling and sparsification for approximating the packedness of trajectories and detecting gatherings, Int. J. Data Sci. Anal., 15 no. 2 (2023) 201–216.
[17] J. H. Reif and S. Sen, Optimal randomized parallel algorithms for computational geometry, Algorithmica, 7 no. 1 (1992) 91–117.
[18] F. Frei and K. Wada, Efficient circuit simulation in MapReduce, In Proceedings of the 30th International Symposium on Algorithms and Computation, Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2019 pp. 21.
[19] F. Frei and K. Wada, Efficient deterministic MapReduce algorithms for parallelizable problems, Journal of Parallel and Distributed Computing, 177 (2023) 28–38.
[20] M. Garofalakis, J. Gehrke and R. Rastogi, Data stream management: A brave new world, In Data Stream Management: Processing High-Speed Data Streams, (2016) 1–9.
[21] L. Arge, External memory data structures, Handbook of massive data sets, Massive Comput., 4, Kluwer Acad. Publ., Dordrecht, 2002 313–357.
[22] G. Yaroslavtsev and A. Vadapalli, Massively parallel algorithms and hardness for single-linkage clustering under ℓp -distances, In Proceedings of the 35th International Conference on Machine Learning, (2018).
[23] D. Nanongkai and M. Scquizzato, Equivalence classes and conditional hardness in massively parallel computations, Distrib. Comput., 35 no. 2 (2022) 165–183.
[24] A. Andoni, A. Nikolov, K. Onak and G. Yaroslavtsev, Parallel algorithms for geometric graph problems, In STOC’14—Proceedings of the 2014 ACM Symposium on Theory of Computing, ACM, New York, (2014) 574–583.
[25] G. Narasimhan and M. Smid, Geometric spanner networks, Cambridge University Press, 2007.
[26] C. D. Toth, J. O’Rourke and J. E. Goodman, Handbook of discrete and computational geometry, CRC press, second edition edition, 2004.
[27] M. T. Goodrich, N. Sitchinava and Q. Zhang, Sorting, searching, and simulation in the MapReduce frame-work, In Proceedings of the 22nd International Symposium on Algorithms and Computation, Springer Science & Business Media, 7074 (2011) 374–383.
[28] L. Barba, P. Bose, M. Damian, R. Fagerberg, W. L. Keng, J. O’Rourke, A. van Renssen, P. Taslakian, S. Verdonschot and G. Xia, New and improved spanning ratios for yao graphs, J. Comput. Geom., 6 no. 2 (2015) 19–53.
[29] D. Bakhshesh and M. Farshi, A lower bound on the stretch factor of yao graph y4, Scientia Iranica, 29 no. 6 (2022) 3244–3248.
[30] R. E. Tarjan and U. Vishkin, An efficient parallel biconnectivity algorithm, SIAM J. Comput., 14 no. 4 (1985) 862–874 (1985).
[31] K. Bringmann, Fine-grained complexity theory: Conditional lower bounds for computational geometry, In Proceedings of the 17th conference on computability in Europe, Springer, (2021) 60–70.
[32] S. Guha and S. Khuller, Approximation algorithms for connected dominating sets, Algorithmica, 20 (1998) 374–387.
[33] H. Karloff, S. Suri and S. Vassilvitskii, A model of computation for MapReduce, In Proceedings of the 21st annual ACM-SIAM symposium on Discrete Algorithms, SIAM, Philadelphia, PA, (2010) 938–948.
[34] P. Beame, P. Koutris and D. Suciu, Communication steps for parallel query processing, J. ACM, 64 no. 6 (2017) 58 pp.
[35] J. L. Bentley, D. F. Stanat and E. H. Williams Jr, The complexity of finding fixed-radius near neighbors, Information Processing Lett., 6 no. 6 (1977) 209–212.
[36] S. Har-Peled, Geometric approximation algorithms, Mathematical Surveys and Monographs, 173, American Mathematical Society, Providence, RI, 2011.
[37] S. Suri and S. Vassilvitskii, Counting triangles and the curse of the last reducer, In Proceedings of the 20th international conference on World wide web, (2011) 607–614.

Articles in Press, Corrected Proof
Available Online from 23 April 2024
  • Receive Date: 12 July 2023
  • Revise Date: 06 April 2024
  • Accept Date: 06 March 2024
  • Published Online: 23 April 2024