Data Security in an AI-First Paradigm
Data is always moving. It isn’t new that organizations need to track data, their genealogy, understand how it moves and evolves, and know who has access to it. But the extremely rapid growth and adoption of new technologies have made data even more fluid — and securing that valuable data more complex — than enterprises ever imagined.
Take cloud for instance. The convenience and efficiency that cloud computing brings is undeniable. But there are some clear and present concerns, such as the volume of data produced, the directness and ease of access to that data, and geographical and compliance challenges.
When companies implement systems built for speed and scalability, it makes data more mobile and more susceptible to attacks, and harder to determine the natural limitation of a blast radius. As these applications start to churn and expose your data to the world, it becomes significantly more complicated to understand if you’ve compromised an API.
Now we’re facing another major data disruption stacked on top of the cloud and data challenges we’re currently navigating, one that has many experts wondering how to move forward. So, what happens to data and security when businesses start taking an AI-first approach?
The shift toward implementing AI into all facets of business as an ‘everything enabler’ has prompted major concerns because large language models (LLMs), like ChatGPT, challenge the foundation of computer science, which is based on repeatability and predictability.
We’ve relied on the notion that if a machine learning (ML) model is coached to a high accuracy, it will produce the same output when executed again. However, with LLMs specifically, it’s harder to understand the mechanics of the model and how exactly it produces its output. Consequently, it is both unclear and unlikely that two similar LLM queries will generate identical results.
Since generative AI tools leverage training data or training prompts for the output, some LLMs store that training data to learn and make the model more effective. This raises both data quality and privacy concerns, particularly if the model generates content based on sensitive or personally identifiable information.
There are four major factors around confidentiality and privacy that businesses must keep top of mind:
- Erasure: For training data (propositions), what you feed in it will be extremely hard to isolate and erase once proposed. While the idea of dumping the entire model and retraining it seems tempting — what’s the point if it’s a point-in-time’ data set? If the dataset is small, it may make sense, but the whole point of a massive LLM is its volume and temporal changes.
- Confidentiality: Needless to say, you should not use anything classified as sensitive to train a model, but it’s easy to just code or KB blast the model. What might have been a long-term problem with misclassified data or unprotected data in your old system is now far more visible, which could amplify your data loss and confidentiality problem substantially.
- Processor Obligations: With LLMs, building in opt in/opt out personal information processor obligations for end users will be either extremely complicated as you add new ways to utilize the new propositions or extremely broad to the point that it provides no protection to the end user.
- Data Lineage & Pedigree: How will you know if a certain section of data is real? Imagine if someone were to ask you to explain the result and track back each object item’s source and prove it. Tracking provenance has never been easy in the new cloud world, and now it’s significantly more complex with LLMs.
Within these constructs, and in most circumstances where personal data is used to train an LLM, we must guarantee anonymity.
Refusing to adopt AI could set you back significantly on the competitive scale. Some of the early adopters, for example, are already commercializing horizonal and vertical LLM solutions.
There are two realities that we need to embrace to overcome AI paralysis:
We need to accept that this new paradigm is going to be messy. This notion of ‘grayscale security’ has always existed. However, the era of AI brings unprecedented levels of grayscale. And, at the rate it’s growing and evolving, we’re facing years of this amped up grayscale state.
It’s going to require a major shift in mindset. We need to start accepting that this new paradigm is going to be messy and stop expecting the binary — that something is a yes or no, safe or not safe, whether it’s good or bad data.
As we delve deeper into AI everything, the full spectrum of trustworthiness must now be internalized. Some industries, like financial services, insurance and healthcare, might be on the slower end of adoption due to complex regulatory landscapes and legacy mindsets.
We need to balance experimentation with risk mitigation. When it comes down to it, this is not the time to wait and see how the AI revolution will pan out, nor the time to completely shut down exploring opportunities all in the name of eliminating risk. This is the time to run experiments and contribute to market innovation.
As with any disruption, you’re weighing monetary loss against the loss of opportunity. Once you’ve thoroughly assessed your business and put some boundaries and controls in place to secure your data while still allowing access to it, you should start your own research and development initiatives to pursue safer, more transparent, better understood tooling.
Being at the forefront of driving such cutting-edge innovation will not only give you competitive advantage — when more purpose-built, either open source or licensed, within the enterprise, AI tooling will be much easier to control and secure. The need for transient models allows for faster iterations, experimentation and reduces broad risk (since you can erase and retrain). A ‘too-big-to-fail’ model will be too hard to clean up and secure in many ways.
Conclusion
Unfortunately, security has largely been an afterthought for many companies but, as we enter this paradigm shift, it must be woven into the very fabric of your business so that you can mitigate data risk while exploring its opportunity and embracing the unknown.
Breach has always been and will always be imminent, but now, more than ever, you’ll need to be mindful of how your security strategy changes along with this fast-moving technology. That way, you can continuously understand how your data moves and morphs, classify its importance and manage it while complying with different regulations that will inevitably emerge around AI.
A multifaceted, agile and ever-evolving cybersecurity approach is essential to endure this disruption. And, as this conversation around AI evolves, we’ll share more on our evolved thinking, too. Stay tuned.
This article was originally published on epam.com/insights. The article’s main photo is from envato.com.
The authors of this article are:
- Sam Rehman. SVP, Chief Information Security Officer, EPAM.
- Val Tsitlik. VP, Head of Data & Analytics, EPAM.