Our paper "Proactive Discovery of Fake News Domains from Real-Time Social Media Feeds", which was presented at the Workshop on Computational Methods in
Online Misbehavior (CyberSafety), The Web Conference, 2020, received
the best-paper award.
In 2018-2019, I was on sabbatical at INRIA, visiting Ioana Manolescu and
Jean-Daniel Fekete
Our paper "A Large-scale Study about Quality and Reproducibility of
Jupyter Notebooks" has received an ACM SIGSOFT Distinguished Paper
Award and a FOSS Impact Paper Award.
Our paper "Anonymizing NYC Taxi Data: Does It Matter?" received
the Best-Paper honorable mention at the IEEE International Conference on Data Science and Advanced
Analytics (DSAA), 2016.
Check out noWorkflow, a new system that captures provenance
of Python scripts and automatically generates a reproducible
package: https://github.com/gems-uff/noworkflow
Together with Bill Howe (UW, organizer), Mike Franklin (UC
Berkeley), Jim Frew (UC Santa Barbara) and Tim Kraska (Brown),
I participated in panel on teaching data science which took
place at ACM SIGMOD 2014. We considered the question of whether
introductory courses in databases should be replaced with
introductory courses in data science, retaining some of the
same critical database material on the relational models and
languages, query processing, and scalable systems, but
augmenting it with material in statistics, machine learning,
and visualization. Somewhat surprisingly, most of the crowd was
amenable to this pitch! For more information, see
http://escience.washington.edu/blog/teaching-data-science-instead-databases
In collaboration with the University of Washington and
Berkeley, and under the sponsorship of the Moore and Sloan
foundations, we are working on a new initiative to 'harness the
potential of data scientists and big data'. For more information,
see:
NYU announcement
NYTimes article
To aid the process of creating reproducible experiments, we have
developed ReproZip, a tool that automatically captures the provenance
of existing experiments and packs all the components necessary to
reproduce the results in different environments. The system was
demonstrated at SIGMOD 2013
, in New York City and at the Beyond the PDF 2
Conference , in Amsterdam. The first public release of the tool is
planned for Spring 2013.
For more information about our activities around reproducibility in science, see:
http://www.reproduciblescience.org .
Check out the September issue of the IEEE Data Engineering Bulletin: Data
Management beyond Database Systems. It contains a collection
of articles which highlight the importance of cross-domain
synergies and the need to go beyond traditional database systems and
to make database technology more accessible---both easier to use for
end-users and easier to integrate with other systems.
If you'd like to learn how workflows and provenance can be used
to automate the creation of customized applications/mashups, check out
our paper in IEEE Vis 2009: VisMashup: Streamlining the Creation of
Custom Visualization Applications, by Emanuele Santos, Lauro Lins, James Ahrens, Juliana Freire and Claudio Silva.
Reproducibility in Science. We are building infrastructure to simplify the creation, review and sharing of computational experiments.
VisTrails. VisTrails
is an open-source data analysis and visualization sytem.
It captures detailed provenance for the data exploration process
and uses this information to streamline the creation, execution,
and sharing of computational processes (aka workflows, dataflow,
pipelines) which are widely used to construct visualizations,
perform data analysis and mining.
Provenance Analytics
BirdVis: Visualizing Geo-Temporal
Data. BirdVis is an interactive visualization system that
supports the analysis of spatio-temporal bird distribution models.
Finding and Querying Structured Data on the Web. In this
project, we have addressed the problem of large-scale information
integration to enable on-the-fly queries over structured Web data.
Uncovering Hidden Web data. Our goal in this project is
to develop a scalable infrastructure that automates, to a large
extent, the process of discovering, organizing, and extracting
data from hidden-Web sources. We have built DeepPeep, a new search engine
specialized in Web forms. For more details about this project, see
http://fleixeiras.cs.utah.edu/webdb.
NSF CAREER. This award has supported my group in the development of
new algorithms and infrastructure for efficiently managing workflows and
their provenance and for enabling casual users (who do not
necessarily have programming expertise) to perform exploratory
tasks and solve problems through workflows. In the context of this project, we are
collaborating closely with domain scientists in different domains and are activelly
contributing to the open-source VisTrails system.
Towards and Infrastructure to Create Reproducible Papers. Beyond the PDF Workshop. San Diego. January 19th, 2011.
Publishing Reproducible Results with VisTrails. Juliana Freire and Claudio Silva. SIAM Mini-Symposium on Reproducible Research. Las Vegas. March 4, 2011.
Provenance-Rich Science. Juliana Freire. FORTH. Crete, Greece. June 22, 2011.
A Provenance-Based Infrastructure for Creating Reproducible Papers. AMP 2011 Workshop on Reproducible Research. Vancouver, Canada. July 14, 2011.
Provenance-Rich Science.
Keynote at the DB/IR Day, AT&T Shannon Labs, Florham Park, NJ, October 22, 2010.
Provenance Management for Data Exploration.
Keynote at the International Conference on Data Integration in the Life Sciences (DILS), Sweden, August 2010.
Infrastructure for Understanding Human Knowledge. ICiS Workshop on Integrating, Representing, and Reasoning over Human Knowledge: A Computational Grand Challenge for the 21st Century. Snowbird, August 8, 2010.
Supporting Provenance-Rich Science with VisTrails. CScADS Scientific Data and Analytics for Petascale Computing Workshop. Snowbird, July 26, 2010.
The WebDB Group: Research Overview. Federal University of Amazonas, Manaus, Brazil, June 29th, 2010.