Professor of Computer
Science and Engineering and Data Science
Department of Computer
Science and Engineering
Center for Data Science
Courant Institute of
New York University
370 Jay Street, 11th floor
Brooklyn, NY 11201
email: [firstname].[lastname] [ at ] nyu [dot] edu
- Our paper "Proactive Discovery of Fake News Domains from Real-Time Social Media Feeds", which was presented at the Workshop on Computational Methods in
Online Misbehavior (CyberSafety), The Web Conference, 2020, received
the best-paper award.
- In 2018-2019, I was on sabbatical at INRIA, visiting Ioana Manolescu and
- Our paper "A Large-scale Study about Quality and Reproducibility of
Jupyter Notebooks" has received an ACM SIGSOFT Distinguished Paper
Award and a FOSS Impact Paper Award.
Research Interests and Projects
My research is in the general area of
data management. I develops methods and systems that enable a wide range of users to obtain trustworthy insights from data. This spans topics in large-scale data analysis and integration, visualization, machine learning, provenance management, and web information discovery, and different application areas, including urban analytics, predictive modeling, and computational reproducibility.
- Dataset search. Recent years have seen an explosion in our
ability to collect and catalog immense amounts of data about our
environment, society, and populace. Moreover, with the push towards
transparency and open data, scientists, governments, and
organizations are increasingly making structured data available on
the Web. Combined with advances in analytics and machine learning,
the availability of such data should in theory allow us to make
progress on many of our most important scientific and societal
questions. However, this opportunity is often missed due to a
central technical barrier: it is currently nearly impossible for
domain experts to weed through the vast amount of publicly available
information to discover datasets that are needed for their specific
application. We are working on techniques and new kinds
of queries to support effective dataset discovery. We have also
Auctus, a dataset search
engine that supports a rich set of data discovery queries, including
for data augmentation.
- Human-assisted automated machine learning. The Data-Driven
Discovery of Models (D3M) program aims to
develop automated model discovery systems that enable users with
subject matter expertise but no data science background to create
empirical models of real, complex processes. At the NYU VIDA
Center we address two important challenges in automating machine
learning: i) pipeline synthesis and model understanding, and ii)
dataset search and discovery. See
https://vida.engineering.nyu.edu/research/d3m for details.
- Debugging Science and Computational Pipelines. Applications
in domains ranging from large-scale simulations in astrophysics and
biology to enterprise analytics rely on computational pipelines. A
pipeline consists of modules and their associated parameters, data
inputs, and outputs, which are orchestrated to produce a set of
results. If some modules derive unexpected outputs, the pipeline can
crash or lead to incorrect results. Debugging these pipelines is
difficult since there are many potential sources of errors including:
bugs in the code, input data, software updates, improper parameter
settings. We have developed algorithms and an open-source system, BugDoc , to
automatically infers the root causes and derive succinct explanations
of failures and behavior for black-box pipelines.
- Previous projects
- ACM SIGMOD Contributions Award, 2020.
- AAAS Fellow, 2021.
- ACM Fellow, 2014.
- IBM Faculty Award, 2014.
- Google Faculty Research Award, 2013.
- CAREER award. National Science Foundation, 2008.
- IBM Faculty Award, 2008.
- Other awards
Recent and Selected Talks (All talks)
- Editor in Chief for PVLDB, 2022.
- Associate Editor for PVLDB, 2020-2021.
- Member, PVLDB Executive Committee.
- Member, ACM SIGMOD Executive Committee.
- Member, Diversity and Inclusion Task in Databases, 2020--.
- Member, DBCares committee, ACM SIGMOD and VLDB, 2019--.
- Steering Committes, ACM SIGMOD Workshop on Human-In-the-Loop
Data Analytics (HILDA).
- Advisory Board Member, Center for Reproducible Biomedical Modeling, 2018-.
- Member, Harvard Data Science Initiative Trust in Science
External Advisory Board, 2021-.
- Elected Chair, ACM Special Interest Group on Management of
Data (SIGMOD), 2017-2021.
- Council Member, Computing Research Associations Computing
Community Consortium (CCC), 2017-2020.
- PI and Executive Director
Moore Sloan Data Science Environment, New York University, 2015-2019.
- Past Activities
SponsorsOur research has been funded by the National Institutes of Health, DARPA
( Memex and D3M,),
National Science Foundation (grants
IIS 2106888 ,
ACI DIBBS 1640864 ,
IIS 0844572 ,
IIS 0746500 ,
CNS 0751152 ,
IIP SBIR 0712592,
CNS-0524096), Department of Energy (Sandia and Los Alamos National
Laboratory), University of Utah Seed Grant, and Army Small
Business Technology Transfer (STTR) grant #A054-002-0330.
Link to some of Juliana's NSF grants.