Can licensing mitigate the negative implications of commercial web scraping?

A Virtual CSCW Workshop - Oct 15th, 2023

The rise of prominent AI models such as ChatGPT and Stable Diffusion has brought the scale of commercial web scraping to the forefront attention of content creators and researchers. Billions of webpages and images are used to train these models without content creators' knowledge, sparking extensive criticism and even lawsuits against AI firms. Amidst such debates, licensing is proposed by researchers and legal experts to be a potential approach to mitigate content creators' concerns and promote more responsible data reuse. However, it remains unclear what specific licensing terms will be effective and what sociotechnical environments are necessary to facilitate the use of licensing at scale. This workshop will provide a venue for researchers, content creators, and legal experts to answer these questions.

Our workshop is open to all who are interested in the intersection of web scraping, data licensing, and copyright. If you are attending CSCW this year, you can add our workshop to your registration to receive notifications (our access code is AccessW19).


Time: Oct 15th, 2023, 11am-12:30pm (US Central)

Speakers:

Bart De Witte and Sreekanth Mukku, Hippo AI Foundation - Regenerative AI in Healthcare: A Framework to Establish Digital Sovereignty through Free Data Flows

Kat Walsh, Creative Commons: Generative AI and Creative Commons

Michael Clemens, University of Utah: Data Scraping with Sound Judgment

Yiwei Wu, UT Austin: A review of licensing discussion in NeurIPS dataset papers

Scott Cambo, Responsible AI Collaborative and Jesse Josua Benjamin, Lancaster University: Analyzing the Language of RAIL Clauses

Kyle Lo and Luca Soldaini, Allen Institute for AI: ImpACT, RAIL, and Beyond


Organizers

Hanlin Li is an assistant professor at the University of Texas at Austin. She studies the social and economic impact of user-generated data and explores approaches to collective, responsible data governance.

Nicholas Vincent is a postdoc scholar at University of California, Davis. His work focuses on studying the dependence of modern computing technologies, including the broad set of systems called "AI", on human-generated data, with the goal of mitigating negative impacts of these technologies.

Yacine Jernite leads the ML and Society team at Hugging Face. He works on ML systems governance at the intersection of regulatory and technical tools, with a focus on NLP models and data curation, documentation, and governance.

Nick Merrill is a research fellow at the UC Berkeley Center for Long-Term Cybersecurity. His work aims to shift the way people understand, identify, and implement safeguards against harms and expand the kinds of decision-makers able to do so.

Jesse Josua Benjamin is a Post Doctoral Research Associate whose research focuses on combining Philosophy of Technology and Design Research to investigate emergent AI challenges and Human-Computer Interaction.

Alek Tarkowski is the Director of Strategy at Open Future. He has over 15 years of experience with public interest advocacy, movement building, and research into the intersection of society, culture, and digital technologies.


Call for Participation

We welcome research abstracts, essays, policy briefs, infographics, and multimedia content that address the following aspects of web scraping and licensing. Submissions should outline the author's research interests and how they relate to our workshop topics. Submissions will be reviewed by organizers for relevance and diversity of perspectives. Authors with accepted work will have 5 minutes to present at the virtual workshop, followed by a Q&A with the audience.

  • Understanding the current landscape of web scraping and investigating how firms and developers approach the legal and ethical risks of scraping and aggregating web content.

  • Understanding current practices around licensing among content creators, e.g. how content creators license their content and what rights they would like to preserve when making their content publicly visible.

  • Identifying specific opportunities to operationalize licensing to counter the negative effects of web scraping and other unauthorized data reuses, including but not limited to privacy violation, lack of compensation for content creators, copyright infringements, etc.

  • Examining parallels between licensing and other creator-oriented responsible AI initiatives, such as data stewardship and refusal.

How to submit:

To apply, please upload your submission via this Google Form by Sep 18 Sep 7, 2023. If you have any questions about how to submit, please email Hanlin Li at lihanlin@utexas.edu.