The challenge of making data science unreliable
We’re excited to bring Transform 2022 live back on July 19 and virtually July 20 – 28. Join AI and data leaders for insightful conversations and Interesting networking opportunities. Sign up today!
On March 21, President Biden warning about cyberattacks from Russia and reiterated the need to improve the state of cybersecurity in the country. We live in a world where enemies have many ways to get into our systems. Therefore, today’s security professionals need to act on the premise that no part of the network is to be trusted. Malicious actors increasingly have free reign in cyberspace, so failure must be predictable at each node. This is called ‘distrust‘ industry of architecture. In other words, in the digital world, we now have to guess where enemies are everywhere and act accordingly.
A recent executive order from the Biden administration specific call for a trustless approach to U.S. government data security, based on the Department of Defense own distrust strategy released earlier this year.
Today’s digital world is so fundamentally insecure that a no-trust strategy is warranted wherever computing is taking place – with one exception: data science.
It is not yet possible to accept the principles of zero trust while also enabling the data science operations and the AI systems they generate. This means the same as the call to use AI developingSo is the gap between the needs of cybersecurity and an organization’s ability to invest in data science and AI.
Finding ways to apply evolving security practices to data science has become the most pressing policy issue in the tech world.
Data reliability issues
Data science is based on human judgment, which means that in the process of creating analytic models, someone, somewhere has to be trusted. How else can we take large volumes of data, evaluate its value, clean and transform it, and then build models based on the insights the data holds?
If we completely removed any trusted actors from the lifecycle of the analytic model, as well as the logical conclusion of the unreliable method, that lifecycle would collapse – there would be no scientist which data is involved in the modeling.
In fact, data scientists only spend about 20% of their time getting into the field that could be considered “data science”. The remaining 80% of their time is spent on more laborious activities like evaluating, cleaning, and transforming raw datasets to make the data ready for modeling – a process called commonly referred to as “mixing data”.
Data aggregation is at the heart of all analytics. No munging, no pattern. And without trust, there can be no affection. Munging requires raw access to the data, it requires the ability to change that data in many unpredictable ways, and it often requires unrestricted time for the raw data itself.
Now compare the requirements of mixing with the needs of zero trust. Here’s an example of how Institute of Standards and Technology (NIST) process description of the distrust implementation in practice:
… Safeguards typically involve minimizing access to resources (such as data and compute resources and applications/services) to only those objects and assets identified as needs access and continuously authenticates and authorizes the identity and security posture of each access request…
According to this description, for zero trust to work, every request to access data must be individually and continuously authenticated (“Did the right person ask for the right access to the data?”) and authorized permission (“Is the requested access granted or not?”). In effect, this is similar to inserting administrative oversight between writers and their keyboards, reviewing and approving every key before it is punched. In simpler terms, the need to use – to engage in pure, unedited access to the raw data – undermines any fundamental claim of distrust.
So what to do?
Don’t trust data science
There are three basic tenets that can help align the emerging requirements of zero trust to the needs of data science: minimization, distributed data, and high observability.
We start with reduction, a concept that has been embedded in a wide range of data protection laws and regulations and is a longstanding principle in the information security community. The imperative reduction principle can never access more data than is necessary for specific tasks. This ensures that if a breach occurs, there will be some limit to the amount of data disclosed. If we think about the “attack surface”, mitigation ensures that the attack surface is as shallow as possible – any successful attack is devastating because even if successful, the attacker public will not have access to all the underlying data, only some .
This means that before data scientists engage with raw data, they should verify how much data is available and in what form they need it. Do they need a full social security number? Rarely. Do they need the full date of birth? Sometimes. Hashing, or other basic anonymous or pseudonymous methods, should be applied as widely as possible baseline defense measures. Ensuring that basic mitigations are applied to data will help mitigate the impact of any successful attack, making it the first and best way to apply zero trust to data science .
Sometimes mitigation may not be possible, due to the needs of the data scientist and their use case. For example, in the healthcare and life sciences space, there is no way to use patient data or diagnostic data for modeling. In this case, the following two principles are even more important.
Principle of distributed data requires decentralized data storage to limit the impact of any breach. If the minimization keeps the attack surface shallow, then distributed data ensures that the surface is as wide as possible, increasing the time and resource costs required for any successful attack.
For example, while many ministries and agencies within the US government have suffered major attacks, one organization has not: Congress. This is not because First Branch itself has a better grasp of the nuances of cybersecurity than its peers, but simply because there is no such thing as a “Congress” from a cybersecurity perspective. Each of its 540+ offices manages its own IT resources separately, meaning an intruder would need to successfully infiltrate hundreds of separate environments rather than just one. school. As Dan Geer warned almost two decades ago, diversity is one of the best defenses for single source failures. The more distributed data is, the harder it is to centralize and thus vulnerable to compromise, and the more protected it is over time.
One caveat, however: Diverse computing environments are complex, and the complexity itself is costly in terms of time and resources. Adopting this type of diversity in many ways will mitigate the trend toward single-cloud environments, designed to simplify IT needs and move organizations away from a dumb data approach. stupid. Data Grid Architecture is helping to maintain a decentralized architecture while unifying access to data through a single data access layer. However, some limits on distributed data can be guaranteed in practice. And this brings us to the final point: high observability.
High visibility is monitoring as much activity in cyberspace as possible, enough to form a convincing baseline for what is considered “normal” behavior so that meaningful deviations can be detected. relative to this baseline. This can be applied at the data layer, keeping track of what the underlying data looks like and how the data might change over time. It can be applied to the query layer, understanding how and when data is being queried, why, and what individual queries look like. And it can be applied to the user layer, understanding which individual users are accessing the data and when, and monitoring these factors both in real time and during testing.
At a fundamental level, some data scientists, somewhere, must be fully trusted if they are to do their job successfully, and observability is the last and best factor that can be achieved. Defensive organizations must secure their data, ensuring that any compromise is detected even if it cannot happen. prevented.
Note that observability is only protected in classes. Organizations must monitor each layer and their interactions to fully understand their threat environment and to protect their data and analytics. For example, anomalous activity at the query layer might be justified based on user activity (is it the user’s first day at work?) or due to changes to the data itself (the data). Is the drift so significant that the more extensive query needed to determine how the data has changed?). Only by understanding how the changes and models at each layer interact can organizations develop a broad enough understanding of their data to implement a consensus distrust approach. allowing data science in practice.
What is next?
Admittedly, adopting a trustless approach to a data science environment is not easy. To some, applying the principles of mitigation, distributed data, and high visibility to these environments seems impossible, at least in practice. But if you don’t take steps to secure your data science environment, the difficulties of adopting distrust into that environment will only worsen over time, making entire programs Data science and AI systems are fundamentally insecure. This means that now is the time to get started, even if the path ahead is still not entirely clear.
Matthew Carroll is the CEO of Immuta.
DataDecisionMakers
Welcome to the VentureBeat community!
DataDecisionMakers is a place where professionals, including technical people who work with data, can share data-related insights and innovations.
If you want to read about cutting-edge ideas and updates, best practices, and the future of data and data technology, join us at DataDecisionMakers.
You can even consider contribute an article your own!