One million public Bluesky posts scraped for AI training

The data was posted to an AI company, then later removed after an outcry.
By Chase DiBenedetto  on 
The Bluesky app logo on an iPhone home screen.
The dataset, composed of millions of Bluesky posts, has since been taken down. Credit: Jaap Arriens / NurPhoto via Getty Images

Bluesky is already facing its first major AI scrape, despite the stance of its owners that it will never train generative AI on user data.

Reported by 404Media on Nov. 26, one million public Bluesky posts — complete with identifying user information — were crawled and then uploaded to AI company Hugging Face. The dataset was created by machine learning librarian Daniel van Strien, intended to be used in the development of language models and natural language processing, as well as general analysis of social media trends, content moderation, and posting patterns. It contains users' decentralized identifiers (DIDs) and even has a search function to find content from specific users.

According to the dataset's description, the set "contains 1 million public posts collected from Bluesky Social's firehose API (Application Programming Interface), intended for machine learning research and experimentation with social media data. Each post contains text content, metadata, and information about media attachments and reply relationships."

Mashable Light Speed
Want more out-of-this world tech, space and science stories?
Sign up for Mashable's weekly Light Speed newsletter.
By signing up you agree to our Terms of Use and Privacy Policy.
Thanks for signing up!

Bluesky users didn't opt-in to such uses of their content, but neither is it expressly prohibited by Bluesky. The platform's firehose API is an "aggregated, chronological stream of all the public data updates as they happen in the network, including posts, likes, follows, handle changes, and more." Bluesky's API — coupled with the public and decentralized Authenticated Transfer (AT) Protocol the site is built on — means Bluesky content is open and available to the third party developers the platform is trying to court, 404Media explains.

This could be a major warning sign to many of the site's millions of new users, many of whom left competitor X in the wake of an alarming new AI training policy. A Bluesky representative responded to 404Media's requests for comment: "Bluesky is an open and public social network, much like websites on the Internet itself. Just as robots.txt files don't always prevent outside companies from crawling those sites, the same applies here. We'd like to find a way for Bluesky users to communicate to outside orgs/developers whether they consent to this and that outside orgs respect user consent, and we're actively discussing how to achieve this."

Shortly after the article's publication, the dataset was removed from Hugging Face. "I've removed the Bluesky data from the repo. While I wanted to support tool development for the platform, I recognize this approach violated principles of transparency and consent in data collection. I apologize for this mistake," van Strien wrote in a follow-up Bluesky post.

Chase sits in front of a green framed window, wearing a cheetah print shirt and looking to her right. On the window's glass pane reads "Ricas's Tostadas" in red lettering.
Chase DiBenedetto
Social Good Reporter

Chase joined Mashable's Social Good team in 2020, covering online stories about digital activism, climate justice, accessibility, and media representation. Her work also touches on how these conversations manifest in politics, popular culture, and fandom. Sometimes she's very funny.


Recommended For You

Brazilian users flock to Bluesky after Elon Musk's X banned
X and Bluesky on smartphones with Brazil's flag in the background

Bluesky adds videos to the platform to take on Elon Musk's X
Bluesky logo on smartphone

BlueSky is pitching itself as a Threads alternative now
bluesky logo on a phone

X rival Bluesky sees more than 700,000 new users after the U.S. election
Bluesky logo

More in Tech
The best Cyber Monday deals still live in 2024
A colorful Black Friday background with an Apple watch, Hisense TV, iRobot vacuum, Microsoft 2-in-1 laptop, and Apple AirPods.

Samsung still has its 'buy one, get one free' sale on Odyssey gaming monitors going on now
By Mashable Shopping
samsung gaming monitors on blue background with badge that reads 'black friday cyber monday'

Cyber Monday is over, but these deals are still live at Amazon
pink and orange background with amazon logo

The gorgeous 'Wicked'-edition Shark FlexStyle is 25% off post-Cyber Monday
hand holding Shark FlexStyle Wicked edition with teal and purple background

12 Cyber Weekend deals that Mashable readers loved — including five that are still live
Kindle Paperwhite Signature Edition, Roborock Qrevo S, and AirPods Pro on purple and orange backdrop

Trending on Mashable
NYT Connections today: Hints, answers for December 3, 2024
A phone displaying the New York Times game 'Connections.'

Tesla suspends Cybertruck production. Who could have predicted this?
Tesla vehicles, including Cybertrucks, loaded on a transport that seems to be going nowhere.

Wordle today: Answer, hints for December 3
a phone displaying Wordle

NYT Strands hints, answers for December 3
A game being played on a smartphone.

NYT Connections hints today: Clues, answers for December 2, 2024
A phone displaying the New York Times game 'Connections.'
The biggest stories of the day delivered to your inbox.
This newsletter may contain advertising, deals, or affiliate links. Subscribing to a newsletter indicates your consent to our Terms of Use and Privacy Policy. You may unsubscribe from the newsletters at any time.
Thanks for signing up. See you at your inbox!