Tracing the Evolution of NLP
The interaction between AI and language corpora has been ongoing for several decades. Its early beginnings can be traced back to the 1950s-1980s with the advent of rule-based systems, which marked the exchanges of the machine world and linguistics, giving life to NLP (Natural Language Processing). Being an aboriginal kid of the AI-based systems, rule based systems relied much on carefully crafted rules to process language. It wasn’t until the late 1980s that statistical models started to replace rule-based systems, using a large collection of textual corpora to build models that reflected real language use instead of complex and rigid rules. Resources like the Penn Treebank and the rapid growth of the internet provided an unprecedented amount of data for training NLP systems. In 2001, Yoshua Bengio and his team introduced neural language models, setting a new standard for language processing. This led to improvements in NLP applications like translation, search engines, and voice-activated assistants. During the 2000s and 2010s, NLP advanced even further with machine learning techniques and the development of new algorithms like Word2Vec (2013). Thus, language models trained on massive text datasets became more capable of generating human-like text, translating languages and answering complex questions. This became possible by combining the language patterns and knowledge learnt during training, leading the models to create new text that is both grammatically correct and meaningful.
The shift to neural models in translation services in 2017, with Google Translate’s adoption of a neural sequence-to-sequence model, marked a turning point from traditional statistical methods. This new approach enabled more precise and context-aware translations by interpreting entire sentences holistically. This era highlighted rapid advancements in NLP, driven by deep learning and neural networks, which greatly enhanced language comprehension and generation. Researchers and developers harnessed these transformative technologies to craft more refined and effective NLP applications, paving the way for the next wave of breakthroughs in the field. Large Language Models (LLMs), like GPT-3 and its successors, emerged from these advancements. Trained on vast datasets, LLMs are capable of an impressive array of language tasks, from generating creative text and translating languages to even writing code. Thus, we can conclude that transferring human knowledge into machine learning models by adopting linguistic corpora as its knowledge engine forms the foundation of modern artificial intelligence.
Linguistic Corpora: The knowledge engine of AI
When we refer to corpora (from the Latin plural) or datasets used to train Large Language Models like GPT-4, we’re talking about a wide range of textual sources. These include books, website content, extensive repositories like Wikipedia, as well as more informal forms of language found in social media posts, product and service reviews, and even public emails. This diversity equips LLMs to understand and generate text across multiple languages, tones, and styles.
For professionals in Natural Language processing, data science, and data engineering, resources such as Kaggle and GitHub’s Awesome Public Datasets offer accessible collections of public datasets. Some of these datasets are pre-processed and ready for analysis, while others require cleaning and organisation before use. Although many datasets include quantitative data, these repositories also contain rich textual data, ideal for training language models.
AI engages not only with textual corpora but also with audio corpora, which encompass diverse forms of spoken language data, such as recorded conversations, podcasts, interviews and voice commands, often accompanied by text transcriptions. These audio datasets enable AI models to learn from more than just written words; they also capture the sound, rhythm, tone, and context inherent in spoken language.
The Impact of language in human life
Each language carries its own unique history, worldview, and nuances, enriching humanity’s diversity and deepening our understanding of the world. We linguists consider language to be a rich cultural, social, and psychological phenomenon. It’s not merely a means of communication but a reflection of a society’s identity, values and shared experiences. Language shapes social interactions, encapsulates cultural heritage and provides insight into human cognition, revealing how we perceive and interpret the world around us. Since language is deeply woven into our identities, it becomes both our shared asset and a collective responsibility.
However, the advancement of AI and NLP technology drives the exchange of language data, turning it into a digital market commodity. In their quest for vast online language data to enhance AI models, tech giants are increasingly resorting to aggressive data collection strategies. When language is turned into data for profit, there’s a risk of losing its depth, context and emotional value, which could weaken cultural identities and limit linguistic diversity. Additionally, making language a commercial asset raises ethical issues around privacy and consent, as large amounts of personal and cultural information are collected, processed, and monetised.
The following are some of the instances that highlight ethical and privacy concerns surrounding AI
According to an article titled “How Tech Giants Cut Corners to Harvest Data for AI,” which appeared in the New York Times on April 6, 2024, OpenAI, Google, and Meta bypassed their corporate policies, altered internal rules, and considered ways to circumvent copyright law to gather online data for training their latest AI systems.
In late 2021, OpenAI encountered a shortage of language data for training its AI models. To address this, researchers developed Whisper, a speech recognition tool capable of transcribing audio from YouTube videos. This approach provided valuable conversational text to enhance AI intelligence, though some within OpenAI expressed concerns that this might violate YouTube’s rules, which prohibit using its content in applications outside its platform. Nonetheless, OpenAI proceeded, with the team—led by President Greg Brockman—transcribing over a million hours of YouTube videos, which were then fed into GPT-4, one of the most advanced AI systems at the time and the foundation for the latest ChatGPT version.
At Meta, conversations among managers, legal experts, and engineers last year centred on acquiring the publishing giant Simon & Schuster to access extensive written works, as per recordings from internal meetings obtained by The New York Times. The discussions also explored gathering copyrighted material from across the web, despite the potential legal repercussions, as negotiating individual licenses with content creators was deemed too time-consuming.
Similarly, Google transcribed YouTube videos to gather text for its AI models, raising concerns about possible copyright violations against the video creators. Additionally, last year, Google broadened its terms of service to potentially allow its AI systems to draw from publicly available Google Docs, Google Maps restaurant reviews, and other online content. According to Google’s privacy team and an internal message seen by The Times, this policy change was partly motivated by the need to feed AI’s growing data appetite.
These actions underscore how online content, from news articles, stories, and forum posts to Wikipedia entries, images, and videos, has become the essential fuel for the thriving AI industry. Today, innovation in AI relies heavily on abundant digital data to train systems capable of instantly generating human-like text, images, sound, and video. This high-stakes AI race has driven companies like OpenAI, Google, and Meta to cut corners, relax their policies, and explore the edges of legal boundaries, as revealed by The New York Times.
An article in Fortune e-magazine titled “Twitter/X will now allow third parties to train AI models with people’s data—and any disputes must be heard in a Trump-friendly court,” published on October 18, 2024, reveals that Elon Musk’s social media platform has updated its privacy terms, effective November 15. Under these new policies, user data may be utilised to train AI models for third-party “collaborators,” potentially extending beyond Musk’s own Grok AI initiative and allowing Twitter/X to license data to external companies, similar to Reddit’s approach.
Twitter/X’s new privacy policy, effective November 15, allows user data to be used for training AI models by third-party “collaborators,” potentially creating a new revenue stream as ad income declines under Elon Musk’s ownership. While users can opt-out, it’s currently unclear how to do so.
With the advancement of AI and NLP technologies, unrestricted access to language data raises issues of ethics and privacy. Corporations like OpenAI, Google, Meta, and Twitter/X overreach in order to obtain massive amounts of digital language. As these tech firms push the boundaries of invention, they easily breach standard business procedures, internal guidelines, and, quite possibly, even copyright statutes to feed their AI engines. The cases of using OpenAI’s Whisper, Meta’s focus on acquiring copyright-protected works, Google’s extension of identifiable data policy, or Twitter’s modification of privacy policy focus on a similar issue: the desperate need to obtain as much data as possible to win the competition at any cost.
The Cambridge Analytica scandal of 2018 draws attention to the risks surrounding data in the web of networks. This political consulting agency implemented a breach that gained access to personal information belonging to around 87 million Facebook users. This data was harvested by Cambridge Analytica through the means of a fake quiz, which collected information about the participants and used a backdoor mechanism in Facebook’s APIs to collect information about their social connections. The great American data controversy firm raised awareness about the dangers of handling personal data during the electoral process when working with the Trump campaign.
In a news article titled “Operation Zero: How ChatGPT Maker OpenAI Says It Stopped an Israeli Company From Disrupting Lok Sabha Election,” published in the online news portal of Times of India on June 2, 2024, OpenAI, the developer of ChatGPT, announced that it thwarted an Israeli company’s attempts to interfere in India’s Lok Sabha elections. Dubbed “Operation Zero,” the initiative saw the Microsoft-backed AI company act swiftly to halt the misleading use of AI by the Israeli firm STOIC within 24 hours, thereby avoiding any substantial impact on the electoral process.
With the popularity of generative artificial intelligence, attention comes to applications of AI with great investments in military R&D, particularly in autonomous weapon systems (AWS). Fully automated weapons may not be available yet, but with the fast pace of military advancements, it does not seem far-fetched. This brings troubling ethical issues to the fore, especially as big powers, the US and China especially, are spearheading AWS research and development. These matters need to be openly discussed and regulated.
Is language commodification a crony capitalists’ and global populist agenda?
As generative artificial intelligence continues to advance, job prospects are increasingly concentrated within this field. ChatGPT’s projections suggest that careers in AI and data science are set to dominate the job market, becoming both the most lucrative and highly demanded roles over the next half-century. This trend may lead to the marginalisation of other professions and escalate competition for limited opportunities.
Moreover, many of the leading AI systems, such as OpenAI’s ChatGPT and Gemini’s AI Bard, are under the control of private entities and corporations. This concentration of ownership raises concerns: if these technologies are steered by individual or corporate interests, they could disrupt industries but also threaten the sovereignty and stability of nations, creating potential economic and political imbalances. This centralisation could lead to an uneven distribution of power, where a few private entities exert significant influence over global information flows and decision-making, which could impact international relations and the autonomy of various countries. The implications of such concentrated power demand careful regulation and oversight prevent these tools from being used to further narrow interests at the expense of broader societal well-being.
Safeguarding Personal Data in the Age of AI
Jennifer King, a privacy and data policy fellow at Stanford University’s Institute for Human-Centred Artificial Intelligence (Stanford HAI), along with Caroline Meinhardt, Stanford HAI’s policy research manager, released a white paper titled Rethinking Privacy in the AI Era: Policy Provocations for a Data-Centric World. In it King outlines their key insights into emerging privacy threats and possible solutions.
What risks do we face as our data is increasingly traded and utilised by AI systems?
Firstly, AI amplifies longstanding privacy concerns from decades of internet data collection, but on a much larger scale. These data-hungry, opaque systems leave individuals with even less control over what’s collected, how it’s used and whether they can alter or remove their personal information. Today, digital surveillance pervades nearly all aspects of online life, and AI may exacerbate this issue.
Secondly, AI tools trained on internet-sourced data can be misused for harmful purposes. Generative AI, for instance, can retain personal and relational details, enabling targeted scams like spearphishing or AI-driven voice cloning to impersonate people and commit fraud.
Thirdly, personal data shared online, like a resume or photo, is often repurposed to train AI without consent, raising civil rights issues. Predictive AI in hiring, for example, can carry biases; Amazon’s AI hiring tool showed a preference against female applicants. Similarly, facial recognition intended for law enforcement has led to multiple false arrests, particularly of Black men due to biased training data.
Are we so accustomed to companies collecting our data that it’s now too late to act?
While a lot of data has been collected, it’s still possible to establish stronger regulations like requiring users to opt in for data collection or mandating companies to delete data when misused.
Today, nearly every online interaction is tracked, and mobile apps often collect location data by default. This system exists largely because, two decades ago, the industry convinced regulators that opting in would stifle the internet’s growth. But now that the internet’s value is proven, this excuse is no longer valid.
Ideally, data should only be collected if users actively consent, such as by signing up for a service, and it should remain private unless they choose to make it public. A decade ago, data privacy was mainly about tracking purchases, but now, vast data collection fuels AI systems, with serious societal and civil rights implications. It’s not too late to change these practices; regulatory standards can still be updated.
Why aren’t data minimisation and purpose limitation rules enough to protect privacy?
While essential, these rules found in laws like the GDPR, California’s CPPA and the proposed federal ADPPA face challenges in implementation. Regulators often struggle to determine when a company has overstepped by collecting excessive data. In cases where companies, such as Amazon or Google, justify broad data collection due to their diverse services, it’s harder to enforce clear limits. Although these rules are valuable, ensuring companies comply is not always straightforward.
What are some of the proposed technological solutions to enhance data privacy in AI systems, and how effective are they in limiting unwanted data collection?
The white paper suggests several solutions to AI-related data privacy issues, including switching from opt-out to opt-in data sharing, which could be streamlined with technology.
For instance, Apple’s App Tracking Transparency (ATT) feature, launched in 2021, prompts iPhone users to choose if they want an app to track their activity. Reports show that 80–90 per cent of users opt-out when given this choice.
Another approach is integrating opt-out signals in web browsers, like the Global Privacy Control feature, which blocks third-party tracking by default. Although California’s CPPA allows for such signals, it isn’t mandatory, and only a few browsers (e.g., Firefox and Brave) offer this feature. Recently, a California legislator proposed requiring all browsers to honour opt-out signals, which could greatly limit unwanted data collection.
Why does the report advocate for collective solutions over individual privacy rights in data protection, and how could data intermediaries help enhance user control?
This report highlights the limitations of focusing solely on individual privacy rights, suggesting we need collective solutions to address data privacy effectively.
In places like California, where data privacy laws exist, few people understand or have the time to exercise their rights. Even if they did, they would need to repeatedly request each company not to sell their data, a
process that isn’t permanent.
A collective approach, such as using data intermediaries, could give the public more leverage. By delegating data rights to a trusted intermediary like a data steward or cooperative, individuals could negotiate their rights at scale. Although challenging to implement for consumers, this model has shown promise in business contexts.
As AI systems consume vast amounts of data, they pose significant risks to individual privacy, from intensified digital surveillance to potential misuse for fraudulent activities like identity theft. Given the volume of personal data being generated in India, these issues are particularly critical.
Relying solely on individual privacy rights is insufficient, as many users are either unaware of their data rights or lack the resources to enforce them across all digital platforms. Collective solutions, such as data intermediaries, could help Indian consumers better protect their rights, allowing them to pool their data management power for greater influence.
India’s upcoming data protection laws, modelled on principles like data minimisation and purpose limitation found in global standards (such as GDPR), could help curb excessive data collection. However, enforcing these regulations effectively, especially with large corporations that have broad data needs, will be a challenge.
Adopting opt-in data sharing policies could be a transformative step in India, enhancing user control over personal data. Features like Apple’s App Tracking Transparency and Global Privacy Control demonstrate how technology can support these privacy efforts. Indian policymakers could draw on these examples to mandate similar options across platforms. Hence, India’s digital landscape would benefit from both strong individual privacy rights and collective frameworks to protect user data more comprehensively. As AI-driven data collection grows, India’s regulatory approach must evolve to ensure data privacy, safeguard civil rights, and address the unique challenges posed by a data-centric, AI-powered world.
Comments