Apple has released a technical paper outlining the models it has created to drive Apple Intelligence, the suite of generative AI features coming to iOS, macOS, and iPadOS in the coming months.
In the paper, Apple counters claim that it took an ethically dubious approach to training its models, emphasizing that it did not utilize private user data and relied on a mix of publicly available and licensed data for Apple Intelligence.
In July, Proof News revealed that Apple had used a dataset named The Pile, which includes subtitles from hundreds of thousands of YouTube videos, to develop a series of models for on-device processing. Many YouTube creators whose subtitles were included in The Pile were unaware and had not consented to this use. Additionally, Apple subsequently stated that it did not plan to use these models for any AI features in its products.
The technical paper, which provides insights into the Apple Foundation Models (AFM) introduced at WWDC 2024 in June, highlights that the training data for these models was gathered in a manner deemed “responsible” — according to Apple’s standards, at least.
The training data for the AFM models includes both publicly available web data and licensed content from unnamed publishers. The New York Times reported that in late 2023, Apple contacted several publishers, such as NBC, Condé Nast, and IAC, proposing multi-year agreements valued at over $50 million to use their news archives for training. Additionally, Apple’s AFM models were trained on open-source code from GitHub, encompassing languages like Swift, Python, C, Objective-C, C++, JavaScript, Java, and Go.