Many of the biggest websites opted out of Apple Intelligence training

Share via:


Generative AI systems are trained by letting them surf the web to scrape content. Apple allows publishers to opt out of its scraping, and a new report says that many of the biggest websites have specifically opted out of Apple Intelligence training.

This includes both Facebook and Instagram, as well as many high-profile news and media sites like The New York Times and The Atlantic

Apple’s AI training

Large language models like ChatGPT are trained by giving them access to millions of words of source material, ranging from news stories to user comments.

In Apple’s case, the company has for years been using Applebot to train Siri and surface Spotlight suggestions. More recently, the company has also been using Applebot to train Apple Intelligence.

The practice is controversial, as AIs are effectively using copyrighted material to generate their own versions of it. For more niche topics, where source material is scarce, they have even been found to regurgitate entire paragraphs with almost no changes made.

But Apple does this in an ethical way, allowing publishers to opt out, and screening out personal data (though it did get caught out by one third-party source).

We train our foundation models on licensed data, including data selected to enhance specific features, as well as publicly available data collected by our web-crawler, AppleBot. Web publishers have the option to opt out of the use of their web content for Apple Intelligence training with a data usage control […]

We apply filters to remove personally identifiable information like social security and credit card numbers that are publicly available on the Internet.

Apple uses an Applebot-Extended tag to allow sites to opt out of AI training while still allowing search indexing – meaning that their pieces can still be included in Spotlight and Siri searches.

Many big web publishers opting out

Since opting out is done using a publicly-accessible robots.txt file, it’s easy to see which sites have done this. Wired checked a number of the biggest news and social media sites.

WIRED can confirm that Facebook, Instagram, Craigslist, Tumblr, The New York Times, The Financial Times, The Atlantic, Vox Media, the USA Today network, and WIRED’s parent company, Condé Nast, are among the many organizations opting to exclude their data from Apple’s AI training […]

In a separate analysis conducted this week, data journalist Ben Welsh found that just over a quarter of the news websites he surveyed (294 of 1,167 primarily English-language, US-based publications) are blocking Applebot-Extended.

Applebot-Extended is a relatively new tag, so it’s likely that more websites will also opt out once awareness increases.

Money is of course one factor

Apple is believed to have struck deals with some media companies, paying a fee in return for the right to use their content for training. It’s likely this is the motivation for at least some sites currently blocking Apple – holding out for a payment offer.

“A lot of the largest publishers in the world are clearly taking a strategic approach,” says Originality AI founder Jon Gillham. “I think in some cases, there’s a business strategy involved—like, withholding the data until a partnership agreement is in place.”

iOS 18.1 beta 3 includes several new Apple Intelligence features, including Photo Clean Up and more notification summaries.

Photo by Kelli McClintock on Unsplash

FTC: We use income earning auto affiliate links. More.



Source link

Disclaimer

We strive to uphold the highest ethical standards in all of our reporting and coverage. We StartupNews.fyi want to be transparent with our readers about any potential conflicts of interest that may arise in our work. It’s possible that some of the investors we feature may have connections to other businesses, including competitors or companies we write about. However, we want to assure our readers that this will not have any impact on the integrity or impartiality of our reporting. We are committed to delivering accurate, unbiased news and information to our audience, and we will continue to uphold our ethics and principles in all of our work. Thank you for your trust and support.

Popular

More Like this

Many of the biggest websites opted out of Apple Intelligence training


Generative AI systems are trained by letting them surf the web to scrape content. Apple allows publishers to opt out of its scraping, and a new report says that many of the biggest websites have specifically opted out of Apple Intelligence training.

This includes both Facebook and Instagram, as well as many high-profile news and media sites like The New York Times and The Atlantic

Apple’s AI training

Large language models like ChatGPT are trained by giving them access to millions of words of source material, ranging from news stories to user comments.

In Apple’s case, the company has for years been using Applebot to train Siri and surface Spotlight suggestions. More recently, the company has also been using Applebot to train Apple Intelligence.

The practice is controversial, as AIs are effectively using copyrighted material to generate their own versions of it. For more niche topics, where source material is scarce, they have even been found to regurgitate entire paragraphs with almost no changes made.

But Apple does this in an ethical way, allowing publishers to opt out, and screening out personal data (though it did get caught out by one third-party source).

We train our foundation models on licensed data, including data selected to enhance specific features, as well as publicly available data collected by our web-crawler, AppleBot. Web publishers have the option to opt out of the use of their web content for Apple Intelligence training with a data usage control […]

We apply filters to remove personally identifiable information like social security and credit card numbers that are publicly available on the Internet.

Apple uses an Applebot-Extended tag to allow sites to opt out of AI training while still allowing search indexing – meaning that their pieces can still be included in Spotlight and Siri searches.

Many big web publishers opting out

Since opting out is done using a publicly-accessible robots.txt file, it’s easy to see which sites have done this. Wired checked a number of the biggest news and social media sites.

WIRED can confirm that Facebook, Instagram, Craigslist, Tumblr, The New York Times, The Financial Times, The Atlantic, Vox Media, the USA Today network, and WIRED’s parent company, Condé Nast, are among the many organizations opting to exclude their data from Apple’s AI training […]

In a separate analysis conducted this week, data journalist Ben Welsh found that just over a quarter of the news websites he surveyed (294 of 1,167 primarily English-language, US-based publications) are blocking Applebot-Extended.

Applebot-Extended is a relatively new tag, so it’s likely that more websites will also opt out once awareness increases.

Money is of course one factor

Apple is believed to have struck deals with some media companies, paying a fee in return for the right to use their content for training. It’s likely this is the motivation for at least some sites currently blocking Apple – holding out for a payment offer.

“A lot of the largest publishers in the world are clearly taking a strategic approach,” says Originality AI founder Jon Gillham. “I think in some cases, there’s a business strategy involved—like, withholding the data until a partnership agreement is in place.”

iOS 18.1 beta 3 includes several new Apple Intelligence features, including Photo Clean Up and more notification summaries.

Photo by Kelli McClintock on Unsplash

FTC: We use income earning auto affiliate links. More.



Source link

Disclaimer

We strive to uphold the highest ethical standards in all of our reporting and coverage. We StartupNews.fyi want to be transparent with our readers about any potential conflicts of interest that may arise in our work. It’s possible that some of the investors we feature may have connections to other businesses, including competitors or companies we write about. However, we want to assure our readers that this will not have any impact on the integrity or impartiality of our reporting. We are committed to delivering accurate, unbiased news and information to our audience, and we will continue to uphold our ethics and principles in all of our work. Thank you for your trust and support.

Website Upgradation is going on for any glitch kindly connect at office@startupnews.fyi

More like this

Here’s the full list of 44 US AI startups...

For some, AI fatigue is real — but...

WhatsApp now lets you save message drafts

WhatsApp dropped the most “I can’t believe this...

Ethena adopts fee-sharing proposal for ENA token

Now the protocol is working on a value...

Popular

Upcoming Events

Startup Information that matters. Get in your inbox Daily!