Workshop

ERC BigFastData Workshop

October 5-6, 2023
Ecole polytechnique, France

At the end of our 5-year ERC project (BigFastData), we are organizing a workshop on October 5-6, 2023, at Ecole Polytechnique, France, to exchange results and ideas with close colleagues and foster future collaboration.

A tentative program is shown below, where each invited speaker will give a long talk of 45 minutes, or a short talk of 30 minutes, followed by 15 minutes for Q&A, and participate in the discussion of other sessions as well.

We look forward to seeing you in Paris in early October!

Schedule

Day 1

09:00 – 10:00

Yanlei Diao:

Talk: A Deep Learning-Enhanced Multi-Objective Optimizer for Large-Scale Cloud Analytics

Abstract and Biography

Abstract:

Data analytics in the cloud has become an integral part of enterprise businesses. Cloud analytics systems, however, still lack the ability to take user objectives such as performance goals and budgetary constraints and automatically configure analytical jobs to achieve these objectives. This talk presents the design of a next-generation cloud optimizer that automatically determines a cluster configuration including CPU and memory resources as well as numerous other parameters to best meet the user objectives. At the core of our work is a Deep Learning-based modeling approach that automatically learns a model for each user objective as a complex function of the underlying parameters and a principled multi-objective optimization (MOO) approach that computes a Pareto-optimal set of configurations to reveal tradeoffs between different objectives. We further devise novel optimizations to enable smart recommendations within a second. Evaluation using production workloads shows that our optimizer could reduce 36-67% of task latency and, at the same time, 37-75% of cloud cost, while running under 200 msec. This talk closes by pointing out new areas that could benefit from our techniques including autonomics of cloud services and green machine learning.

Biography: 

Yanlei Diao is Professor of Computer Science at Ecole Polytechnique, France, and an adjunct professor at the University of Massachusetts Amherst, USA after a 17-year employment with tenure. She also holds a part-time position at Amazon AWS as an Amazon Scholar. She received her Ph.D. in Computer Science from the University of California, Berkeley, in 2005. Her research interests lie in big data analytics and scalable, intelligent information systems, with a focus on optimization in cloud analytics, data stream analytics, explainable anomaly detection, interactive data exploration, genomic data analysis, and uncertain data management.

Prof. Diao was a recipient of the 2016 ERC Consolidator Award, 2013 CRA-W Borg Early Career Award (one female computer scientist selected each year for outstanding contributions), IBM Scalable Innovation Faculty Award, and NSF Career Award. She has given keynote speeches at the ACM DEBS Conference, Berlin Institute for the Foundations of Learning and Data, Max Planck Institut (MPI) Informatik, IBM Almaden Research Center, Naver Research Labs, Northeastern University, Technische Universitaet Darmstadt, and University of Texas at Austin. She has served as Chair of the ACM SIGMOD Awards Committee, Chair of the SIGMOD Research Highlight Award Committee, Editor-in-Chief of the ACM SIGMOD Record, Associate Editor of ACM TODS, and member of the SIGMOD and PVLDB Executive Committees. She was PC Co-Chair of IEEE ICDE 2017 and ACM SoCC 2016, and served on the organizing committees of SIGMOD, PVLDB, and CIDR, as well as on the program committees of many international conferences and workshops.

10:00 – 10:45

Chenghao Lyu

Talk: An Adaptive, Multi-Resolution, and Multi-Objective Parameter Tuning Approach for Spark SQL

Abstract and Biography

Abstract:

Despite extensive research on tuning parameters to optimize latency and cost in big data analytics systems, the distinctive design of Apache Spark and Spark SQL presents additional challenges. First, Spark’s configuration parameters impact SQL execution at various granularity levels. Second, the parallel SQL submodules (query stages) will compete for resources within the same Spark context. Moreover, the Adaptive Query Execution (AQE) mechanism re-optimizes the query plan with real-time statistics during runtime. Unfortunately, the SQL parameters cannot adapt to these changes, resulting in sub-optimal scenarios.
In this talk, I will introduce the first and novel solution that addresses these challenges by enabling adaptive, multi-resolution, and multi-objective parameter tuning for Spark. It leverages a model server for performance modeling considering the runtime status in the context and a meta-optimizer for efficient latency and cost optimization during the job submission and query runtime.

Biography:

Chenghao Lyu is a Ph.D. candidate at the University of Massachusetts Amherst in the USA and a scientific collaborator at Ecole Polytechnique in France

10:45 – 11:15

Coffee break + Tour of Turing Building

11:15 – 12:15

Prashant Shenoy

Talk: A CarbonFirst Approach for Decarbonizing Cloud Computing

Abstract and Biography

Abstract:

The exponential growth of cloud computing has been a defining trend of our time, fueled by rapidly growing demands from data-intensive and machine learning workloads. Despite the end of Dennard scaling, the cloud’s energy demand grew more slowly than expected over the past decade due to the aggressive implementation of energy-efficiency optimizations. Unfortunately, there are few significant remaining optimization opportunities using traditional methods, and moving forward, the cloud’s continued exponential growth will translate into rising energy demand, which, if left unchecked, will translate to increasing carbon emissions.

In this talk, I will argue for a CarbonFirst approach to designing cloud computing systems by making carbon efficiency a first-class design metric, similar to traditional metrics of performance and reliability. I will explain how today’s systems can be made first carbon-aware by exposing energy and carbon usage information to software platforms and then made carbon-efficient by providing control over the system’s carbon usage. I will present an initial design of a system to enable such carbon awareness and management and present several application case studies on how modern cloud applications can employ these mechanisms to reduce their carbon footprint. I will end with open research challenges in the emerging field of sustainable computing and also discuss how these concepts apply to large-scale data management systems.

Biography:

Prashant Shenoy is currently a Distinguished Professor and Associate Dean in the College of Information and Computer Sciences at the University of Massachusetts Amherst. He received the B.Tech degree in Computer Science and Engineering from the Indian Institute of Technology, Bombay and the M.S and Ph.D degrees in Computer Science from the University of Texas, Austin. His research interests lie in distributed systems and networking, with a recent emphasis on cloud and green computing. He has been the recipient of several best paper awards at leading conferences, including a Sigmetrics Test of Time Award. He serves on editorial boards of the several journals and has served as the program chair of over a dozen ACM and IEEE conferences. He is a fellow of the ACM, the IEEE, and the AAAS.

12:30 – 2 pm

Lunch break

2:00 – 3:00

Ioana Manolescu

Talk: Data Graphs for Data Journalism: Querying and Abstractions

Abstract and Biography

Abstract:

Data Journalism, as well as Open Source Intelligence,
need to combine and analyze data from highly heterogeneous sources, which can be structured, semistructured, or unstructured (text). This integration setting resembles data lakes, yet it is more challenging in cases when the data structure is partial or lacking. In the SourcesSay AI ANR Chair (https://sourcessay.inria.fr), we developed a novel method to uniformly represent and interconnect data of any model as graphs, enabled also by Information Extraction. This model
enables: (a) a direct way of building a graph warehouse from any set of data sources, (b) querying the data graph by combining structured and unstructured search; for the latter, known hard problem, we propose efficient best-effort algorithms; (c) computing ER-like models, called abstraction, to allow non-expert users to discover data at first glance.

Joint work with A. Anadiotis, O. Balalau, N. Barret, M. Mohanty and others

Biography:

Ioana Manolescu is a senior researcher at Inria Saclay and a part-time professor at Ecole Polytechnique, France. She is the lead of the CEDAR INRIA team focusing on rich data analytics at cloud scale. She is also the scientific director of LabIA, a program ran by the French government fostering the adoption of AI solutions in the local and national French public administration.

Ioana is a Senior ACM Member and an associate editor of the VLDB Journal. She has been a recipient of the SIGMOD 2020 Contributions Award, a member of the PVLDB Endowment Board of Trustees, chair of the IEEE ICDE conference, and a program chair of EDBT, SSDBM, ICWE among others.

She has co-authored more than 150 articles in international journals and conferences and co-authored books on  “Web Data Management” and on “Cloud-based RDF Data Management”.
Her main research interests algebraic and storage optimizations for semistructured data, in particular Semantic Web graphs, novel data models and languages for complex data management, data models and algorithms for fact-checking and data journalism, a topic where she is collaborating with journalists from Le Monde and RadioFrance, the national French radio
organization.

3:00 – 3:45

Vincent Jacob

Talk: Anomaly Detection on High-Dimensional Time Series  in the AIOps Domain

Abstract and Biography

Abstract:

The widespread adoption of Internet-based services by software companies, as well as the scale and complexity at which they operate, have made incidents in their IT operations increasingly more likely, diverse and impactful. This has led to the rapid development of a central aspect of the “Artificial Intelligence for IT Operations” (AIOps) domain, focusing on detecting abnormal patterns in vast amounts of multivariate time series (MTS) data generated by service entities. Although numerous MTS anomaly detection methods have been developed, the state-of-the-art still presents some limitations due to the unique challenges posed by AIOps. These challenges include 1) the presence of complex, noisy and diverse normal behaviors, 2) the wide variety of anomaly types and difficulty in providing detailed anomaly labels, and 3) the need to generalize or quickly adapt to significant shifts in normal behavior from continuously evolving environments. In this work, we identify the limitations of the state-of-the-art with respect to those challenges, and aim to address them through fast domain adaptation and weakly-supervised learning techniques.

Biography:

Vincent Jacob is a PhD candidate at Ecole Polytechnique

3:45 – 4:00

Coffee break

4:00 – 5:00

Themis Palpanas

Talk: Complex High-Dimensional Vector Analytics at Scale: A Promising Future

Abstract and Biography

Abstract:

There is an increasingly pressing need, by several applications in diverse domains, for developing techniques able to analyze very large collections of high-dimensional vectors. Examples of such applications come from scientific, manufacturing and social domains, where in several cases they need to apply machine learning techniques for knowledge extraction. It is not unusual for these applications to involve vector collections in the order of hundreds of millions to billions, which are often times not analyzed in their full detail due to their sheer size. In this talk, we describe examples of data sources that produce high-dimensional vectors, and focus on two popular types: data series and deep network embeddings. We discuss the solutions that have been independently developed and are used for each one of these types, and argue that the data series solutions are the overall winners, even on general high-d datasets. Finally, we describe the current efforts in this area, as well as the open research problems.

Biography:

Themis Palpanas is an elected Senior Member of the French University Institute (IUF), a distinction that recognizes excellence across all academic disciplines, and Distinguished Professor of computer science at the University Paris Cite (France), where he is director of the Data Intelligence Institute of Paris (diiP), and director of the data management group, diNo. He received the BS degree from the National Technical University of Athens, Greece, and the MSc and PhD degrees from the University of Toronto, Canada. He has previously held positions at the University of California at Riverside, University of Trento, and at IBM T.J. Watson Research Center, and visited Microsoft Research, and the IBM Almaden Research Center. His interests include problems related to data science (big data analytics and machine learning applications). He is the author of 11 US patents, and 2 French patents. He is the recipient of 3 Best Paper awards, and the IBM Shared University Research (SUR) Award. He has served (among others) on the VLDB Endowment Board of Trustees, as Editor in Chief for the BDR Journal and for PVLDB 2025, PC Chair for IEEE BigData 2023, Research PC Vice Chair for ICDE 2020, Associate Editor for the TKDE and IDA journals, and for PVLDB 2024, 2022, 2019 and 2017, as well as General Chair for VLDB 2013.

5:00 – 6:00

Campus tour

7:00 pm

Dinner in Paris

Day 2

9:00 – 10:00

Michael Franklin

Talk: Designing Data Markets: Platforms for the Data Economy

Abstract and Biography

Abstract:

While it is widely acknowledged that data is one of the most valuable commodities of the 21st century,  the development of market platforms for data is still in its early stages.   This talk will look at several different types of data markets to see if we can identify some common requirements and functions and use those as the motivation for the design of an architectural framework for such platforms.   Particular attention will be paid to areas where data management technologies such as those of interest to the VLDB community can play a role.    The goal is to identify new opportunities for data systems research as well as to develop ways to help users and organizations unlock the value and potential of their data resources to enable data-driven discovery.

Biography:

MICHAEL J. FRANKLIN is the Morton D. Hull Distinguished Service Professor of Computer Science and Sr. Advisor to the Provost for Computing and Data Science at the University of Chicago.   At Chicago he served as Liew Family Chair of the Computer Science Department and is a co-founder of the Data Science Institute.   He is currently on sabbatical as a Visiting Scientist with the Database Group at MIT.   Previously he was Thomas M. Siebel Professor of Computer Science at the University of California, Berkeley where he also served a term as Chair of the Computer Science Division.  As Co-Director of the Algorithms, Machines and People Laboratory (AMPLab) he was one of the original creators of Apache Spark, a leading open source platform for advanced data analytics and machine learning that was initially developed at the lab.  He is a Member of the American Academy of Arts and Sciences and is a Fellow of the ACM and the American Association for the Advancement of Science.  He received the 2022 ACM SIGMOD Systems Award with the team that developed Spark, and is a two-time recipient of the ACM SIGMOD “Test of Time” award. He holds a Ph.D. in Computer Sciences from the Univ. of Wisconsin (1993).

10:00 – 10:30

Coffee break

10:30 – 11:30

Peter Haas

Talk: In-Database Decision Support: Opportunities and Challenges

Abstract and Biography

Abstract:

Decision makers in a broad range of domains, such as finance, transportation, manufacturing, and healthcare, often need to derive optimal decisions given a set of constraints and objectives. Traditional solutions to such constrained optimization problems are typically application-specific, complex, and do not generalize. Further, the usual workflow requires slow, cumbersome, and error-prone data movement between a database and predictive-modeling and optimization packages. All of these problems are exacerbated by the unprecedented size of modern data-intensive optimization problems. The emerging research area of in-database prescriptive analytics aims to provide domain-independent, declarative, and scalable approaches powered by the system where the data typically resides: the database. The goal is to open up prescriptive analytics to a much broader community, amplifying its benefits. In the context of our prior and ongoing work on “package queries”, an important class of optimization problems, we show how deep integration between the DBMS, predictive models, and optimization software creates opportunities for rich prescriptive-query functionality with good scalability and performance. In particular, we discuss some strategies for addressing key challenges related to usability, scalability, data uncertainty, and dynamic environments with changing data and models.

Biography:

Peter J. Haas is a Professor in the Manning College of Information and Computer Sciences at the University of Massachusetts Amherst. Prior to that, he was a Principal Research Staff Member at the IBM Almaden Research Center, where from 1987-2017 he pursued research at the interface of information management, applied probability, statistics, and computer simulation. He was also a Consulting Professor in the Department of Management Science and Engineering at Stanford University from 1992-2017. He was designated an IBM Master Inventor in 2012, and his ideas have been incorporated into products including IBM’s DB2 database system. He is a Fellow of both ACM and INFORMS, and has received a number of awards from IBM and both the Simulation and Computer Science communities, including VLDB 2016 and EDBT 2018 Best Paper Awards and the 2007 ACM SIGMOD 10-year Best Paper Award for his work on sampling-based exploration of massive datasets. Other work has included the Splash platform for collaborative modeling and simulation, techniques for massive-scale data analytics (matrix completion, dynamic graph analysis, and declarative machine learning), Monte Carlo methods for scalable querying and Bayesian learning over massive uncertain data, automated relationship discovery in databases, query optimization methods, and autonomic computing. He serves on the editorial boards of ACM Transactions on Database Systems and ACM Transactions on Modeling and Computer Simulation, and was an Associate Editor for the VLDB Journal from 2007 to 2013 and for Operations Research from 1995-2017.  He is the author of over 140 conference publications, journal articles, and books, and has been granted over 30 patents.

11:30 – 12:15

Aman Raghu Malali

Talk: Predictive ML model maintenance

Abstract and Biography

Abstract:

Machine learning (ML) models are used in a multitude of real world applications. These models are usually created assuming that the datasets they are trained upon are representative of the data they will encounter when deployed in production. A change in the underlying distribution of the features that an ML model relies upon can have disastrous consequences. These data distribution changes can be hard to spot and their impact on the quality of the resulting ML model predictions can be even harder to determine. Current model maintenance pipelines for dealing with this type of “data drift” typically either wait until ML model errors are observed before taking action or retrain the model periodically according to some ad hoc schedule. The former approach can incur serious damage by the time errors are observed and corrective action is taken, and the latter approach can either waste resources if retraining is too frequent or incur damage if retraining is not frequent enough. Our current research aims at developing a model maintenance pipeline that is predictive, preemptively re-training models before excessive model losses occur while minimizing false alarms.

Biography:

Aman Raghu Malali is a PhD candidate at the University of Massachusetts Amherst.

12:30 – 2:pm

Lunch break

2:00 – 3:00

Minos Garofalakis

Talk: Supporting Real-time Analytics over Big Streaming Data

Abstract and Biography

Abstract:

Massive, continuous data streams arise naturally in several dynamic big data analytics applications, such as enabling observability for complex distributed systems, network-operations monitoring in large ISPs, or incremental federated learning over dynamic distributed data. In such settings, usage information from numerous devices needs to be continuously collected and analyzed for interesting trends and real-time reaction to different conditions (e.g.,anomalies/hotspots, DDoS attacks, or concept drifts). The massive, distributed nature of these data streams raises important  memory-, time-, and communication-efficiency issues, making it critical to carefully optimize the use of available computation and communication resources.  In this talk, I will provide an overview of centralized and distributed data streaming models and some of the key algorithmic tools in the space of streaming data analytics, along with relevant applications and directions for future research.

Biography:

Minos Garofalakis is the Director of the Information Management Systems Institute (IMSI) at the ATHENA Research Center, a Professor at the Technical University of Crete, and also works as a (part-time) research consultant for Huawei’s Edinburgh Research Center.  He received his PhD in Computer Science from the University of Wisconsin-Madison in 1998, and previously held senior/principal researcher positions at Bell Labs, Lucent Technologies (1998-2005), Intel Research Berkeley (2005-2007), and Yahoo! Research (2007-2008). He also held an Adjunct Associate Professor position at the EECS Department of UC Berkeley (2006-2008) and worked as a research consultant for Amazon Web Services (2022-2023). 

Minos’s research interests are in the broad areas of Big Data Analytics and Large-Scale Machine  Learning. He has published over 175 refereed scientific papers in these areas, is the co-editor of a volume on Data Stream Management published by Springer in 2016, and has delivered several invited keynote talks and tutorials in major international events. His work has resulted in 36 US Patent filings (29 patents issued) for companies such as Lucent, Yahoo!, and AT&T. Google Scholar gives over 16,000 citations to his work and an h-index value of 69. Minos is a Fellow of the ACM and IEEE, a Member of Academia Europaea, and a recipient of the TUC “Excellence in Research” Award (2015), the 2009 IEEE ICDE Best Paper Award, the Bell Labs President’s Gold Award (2004), and the Bell Labs Teamwork Award (2003).

3:00 – 3:45

Qi Fan

Talk: Multi-Objective Optimization for Spark-based Data Analytics

Abstract and Biography

Abstract:

Spark has been widely used for data analytics in the cloud. Determining an optimal configuration of a Spark physical plan based on user-specified objectives is a complex task. It is challenging from three aspects. Firstly, a Spark physical plan, or query, can be represented as a Directed Acyclic Graph (DAG) of “query stages,” where parameters of each stage are controlled under the granularity of the query (i.e. Spark-context parameters, e.g. resources are shared among all stages) and the granularity of the stage (i.e. different among different stages). The correlation of parameters under multiple granularity makes the performance tuning of a query more complicated. Secondly, the parameters of each stage face timing constraints. Spark-context parameters should be set at compile time and cannot change during runtime, while stage-level parameters can be modified during runtime. Thirdly, Multi-Objective Optimization (MOO) is necessary when there are multiple potentially conflicting, user performance objectives such as latency and cost. This talk focuses on the algorithm design to return Pareto optimal configurations for a query with parameters under multi-granularity control and different timing constraints. It captures tradeoffs among various objectives and recommends an optimal configuration based on user preferences. The expectation is to provide recommendations for all stages in a query within a few seconds.

Biography:

Qi Fan is a PhD candidate at Ecole Polytechnique

3:45 – 4:00

Coffee break

4:00 – 5:00

Nesime Tatbul

Talk: Improving Data Systems through Machine Learning and Observability

Abstract and Biography

Abstract:

Recent advances in machine learning techniques have led to an explosion in their applications across all fields of computer science. In performance-critical domains such as database systems, ML is becoming a key enabler for enhanced efficiency and self-adaptivity in the face of complex and dynamic workloads. For example, it has been shown that learned query optimizers can outperform even highly tuned commercial optimizers and we are now starting to see industry adoption. This talk will cover results from our recent research in ML-enhanced query optimization with contributions in three practical areas: learning, integration, and debugging. We will then discuss new challenges and opportunities as we extend these results to increasingly more complex settings, including the need for better support for collecting and managing observability data.

Biography:

Nesime Tatbul is a senior research scientist at Intel’s Parallel Computing Lab and MIT’s Computer Science and Artificial Intelligence Lab. She serves as a research lead and industry PI for Intel’s MIT university program on Data Systems and AI. Previously, she worked as a faculty member at ETH Zurich, after receiving her Ph.D. degree in computer science from Brown University. Her research interests are broadly in large-scale data management systems and modern data-intensive applications, with a recent focus on learned data systems, time series analytics, and observability data management. Nesime is the recipient of an IBM Faculty Award and a PVLDB Distinguished Associate Editor Recognition, and a co-recipient of an ACM SIGMOD Research Highlight Award, an ACM SIGMOD Best Paper Award, two ACM SIGMOD Best Demonstration Awards, and an ACM DEBS Grand Challenge Award. She has been an active member of the database research community for 20+ years, serving in various leadership roles for the VLDB Endowment, ACM SIGMOD, and others.

5:00 – 5:30

A small hike to the train station (RER B Lozère)

Design a site like this with WordPress.com
Get started