Now that the seal is broken on scraping Bluesky posts into datasets for machine learning, people are trolling users and one-upping each other by making increasingly massive datasets of non-anonymized, full-text Bluesky posts taken directly from the social media platform’s public firehose—including one that contains almost 300 million posts.
Last week, Daniel van Strien, a machine learning librarian at open-source machine learning library platform Hugging Face, released a dataset composed of one million Bluesky posts, including when they were posted and who posted them. Within hours of his first post—shortly after our story about this being the first known, public, non-anonymous dataset of Bluesky posts, and following hundreds of replies from people outraged that their posts were scraped without their permission—van Strein took it down and apologized.
"I've removed the Bluesky data from the repo," he wrote on Bluesky. "While I wanted to support tool development for the platform, I recognize this approach violated principles of transparency and consent in data collection. I apologize for this mistake." Bluesky’s official account also posted about how crawling and scraping works on the platform, and said it’s “exploring methods for consent.”
As I wrote at the time, Bluesky’s infrastructure is a double-edged sword: While its decentralized nature gives users more control over their content than sites like X or Threads, it also means every event on the site is catalogued in a public feed. There are legitimate research uses for social media posts, but researchers typically follow ethical and legal guidelines that dictate how that data is used; for example, a research paper published earlier this year that used Bluesky posts to look at how disinformation and misinformation spread online uses a dataset of 235 million posts, but that data was anonymized. The researchers also provide clear instructions for requesting one’s data be excluded.
If there’s one constant across social media, regardless of the platform, it’s the Streisand effect. Van Strien’s original post and apology both went massively viral, and since a lot of people are straddling both Bluesky and Twitter as their primary platforms, the dataset drama crossed over to X, too—where people love to troll. The dataset of one million posts is gone from Hugging Face, but several much larger datasets have taken its place.
There’s a two million posts dataset by Alpine Dale, who claims to be associated with PygmalionAI, a yet to be released “open-source AI project for chat, role-play, adventure, and more,” according to its site. That dataset description says it “could be used for: Training and testing language models on social media content; Analyzing social media posting patterns; Studying conversation structures and reply networks; Research on social media content moderation; Natural language processing tasks using social media datas.” The goal, Dale writes in the dataset description, “is for you to have fun :)”
The community page for that dataset is full of people saying this either breaks Bluesky’s developer guidelines (specifically “All services must have a method for deleting content a user has requested to be deleted”) or is against the law in European countries, where the General Data Protection Regulation (GDPR) would apply to this data collection.
I asked Neil Brown, a lawyer who specializes in internet law and GDPR, if that’s the case. The answer isn’t a straightforward one. “Merely processing the personal data of people in the EU does not make the person doing that processing subject to the EU GDPR,” he said in an email. To be subject to GDPR, the processing would need to fall within its material and territorial scopes. Material scope involves how the data is processed: “processing of personal data done through automated means or within a structured filing system, including collection, storage, access, analysis, and disclosure of personal information,” according to the law. Territorial scope involves where the person who is doing the data collecting is located, and also where the subjects of that data are located.
“But I imagine that there are some who would argue that this activity is consistent with the EU GDPR,” Brown said. “These arguments are normally based in the thinking that, if someone has made personal data public, then they are ‘fair game’ but, IMHO, the EU GDPR simply does not work that way.”
None of these legal questions have stopped others from creating more and bigger datasets. There’s also an eight million posts dataset compiled by Alim Maasoglu, who is “currently dedicated to developing immersive products within the artificial intelligence space,” according to their website. “This growing dataset aims to provide researchers and developers with a comprehensive sample of real world social media data for analysis and experimentation,” Maasoglu’s description of the dataset on Hugging Face says. “This collection represents one of the largest publicly available Bluesky datasets, offering unique insights into social media interactions and content patterns.”
It was quickly surpassed by a lot. There’s now a 298 million posts dataset released by someone with the username GAYSEX. They wrote an imaginary dialogue in their Hugging Face project description between themselves and someone whose posts are in the dataset: “‘NOOO you can't do this!’ Then don't post. If you don't want to be recorded, then don't post it. ‘But I was doing XYZ!!’ Then don't. Look. Just about anything on the internet stays on the internet nowadays. Especially big social network sites. You might want to consider starting a blog. Those have lower chances of being pulled for AI training + there are additional ways to protect blogs being scraped aggressively.” As a co-owner of a blog myself, I can say that being scraped has been a major pain in the ass for us, actually, and generative AI companies training on news outlets is a serious problem this industry is facing—so much so that many major outlets have struck deals with the very big tech companies that want to eat their lunch.
There are at least six more similar datasets of user posts currently on Hugging Face, in varying amounts. Margaret Mitchell, Chief Ethics Scientist at Hugging Face, posted on Bluesky following van Strien’s removal of his dataset: “The best path forward in AI requires technologists to be reflective/self-critical about how their work impacts society. Transparency helps this. Appreciate Bsky for flagging AI ethics &my colleague’s response. Let’s make informed consent a real thing.” When someone replied to her post linking to the two million dataset asking her to “address” it, she said, “Yes, I'm trying to address as much as I can.”
Like just about every other industry that relies on human creative output, including journalism, music, books, academia, and the arts, social media platforms seem to be taking one of two routes when it comes to AI: strike a deal, or wait and see how fair use arguments shake out in court, where what constitutes “transformative” under copyright law is still being determined. In the meantime, everyone from massive generative AI corporations to individuals on troll campaigns are snapping up data while the area’s still gray.
Leave a Reply