Is Your Approach to Data Management Holding Your Scientists Back?

As the leader of a team of research scientists, Elena faces a growing problem—data. In her organization, vast amounts of scientific data are being generated at an increasing rate. The real obstacle, however, lies in the data’s accessibility and usability. Her team often finds themselves torn between making research decisions based on incomplete data or spending considerable time collecting and joining data from disparate informatics systems and lab instruments. Neither option is ideal for Elena, but she doesn’t have the luxury of time, leading her to make project decisions she’s not fully confident in.

These challenges, faced by Elena and countless others in the scientific community, signal the need for a more integrated approach to scientific data management that prioritizes scientists. While making data FAIR (Findable, Accessible, Interoperable, and Reusable) is a crucial part of this process, it alone is insufficient. A science-aware approach is also needed to ensure that disparate data is consolidated, accessible, and usable for all scientists—not just data scientists—to boost efficiency and accelerate research. In other words, scientists should be able to search, analyze, and visualize data in one place with full scientific context.

Getting to the point where a scientist has all the contextualized scientific data they need when they need it is easier said than done. Nevertheless, labs can move closer to this goal by understanding the hurdles associated with adopting a science-aware approach to data consolidation and taking practical steps to overcome them.

The scientific data management challenge

In most scientific organizations, data is constantly being generated and stored in various systems and formats, such as lab instruments and informatics platforms. Currently, many labs rely on a patchwork of tools and custom connectors to transfer this data back and forth. However, the vast number of instruments and applications within larger organizations, often sourced from multiple vendors, creates data silos that impede analysis and collaboration. Even organizations with extensive infrastructure for gathering and organizing scientific data struggle to consolidate, join, enrich, and unify this data.

There are a couple of approaches scientific organizations take to tackle these issues, yet few effectively address the data challenges scientists face. For example, many organizations have employed scientific data management systems (SDMS), which offer a more integrated approach to capturing, cataloging, and archiving instrument data. However, an SDMS is not designed to solve the problem of siloed data at the application level. This absence of connection with context limits scientific value and poses barriers to meaningful analysis.

Some organizations have turned to data lakes, warehouses, or lakehouses to consolidate data. Still, these approaches often overlook the needs of scientists. Data lakes are favored for their ability to consolidate vast amounts and varieties of data in their native, unstructured format, but this can pose challenges to accessibility and usability as these systems are designed for data scientists. Conversely, data warehouses store structured data, which enables scientists to view and compare it. However, they are expensive to maintain and lack flexibility. Data lakehouses blend these models, combining lakes' flexibility and scalability with warehouses' structured querying capability and scientific awareness. But lakehouses are equally costly to maintain and fail to deliver optimal flexibility and full scientific context. Additionally, they typically lack scientific analytics and must be coupled to an SDMS to solve the instrument data ingestion problem.

Data stored within these systems typically requires extensive and costly coding to be made accessible and usable, which further restricts scientists’ ability to retrieve information, such as details on all experiments conducted for a set of samples. These systems are not meant to be accessed by scientists, so additional investments must be made for each scientific application that requires access. While development costs for these integrations can cost millions of dollars, labs still won’t end up with a truly scientific application.

But for scientists, the issue is more fundamental, as Kevin Cramer, founder and chief executive officer of Sapio Sciences, explained: “If you don’t know the history of what’s been done, such as experimental work, then you’re likely to repeat something, wasting a scientists’ valuable time and an organization's resources.” This limited view of the data can adversely affect the decision-making process and stifle innovation.

Moreover, these systems primarily focus on archiving specific data types or formats without consideration for the long-term needs of the broader research team, including who will need access to the data, how it can be integrated with other data for comprehensive analysis, and how it can support broader research objectives. Such challenges are characteristic of “unFAIR” data, which prolongs the time spent searching for information and diverts attention from more productive work.

Another hurdle emerges when scientists attempt to leverage artificial intelligence (AI) and machine learning (ML) technologies. These tools have the potential to greatly accelerate the pace of scientific discovery by enabling researchers to identify patterns, predict outcomes, and make data-driven decisions. However, AI and ML require access to large volumes of high-quality, well-structured data, which traditional data handling methods often fail to provide. This misalignment hinders the effective utilization of AI and ML, limiting potential innovation.

Ultimately, SDMSs and data lakes, warehouses, and lakehouses are simply data repositories, lacking the tools and capabilities scientists need to derive meaningful insights. To truly support scientific discovery, data management solutions must become science-aware with improved data access and usability, along with features that enable scientists to search, visualize, and analyze data within a single platform. Cramer underscored the importance of this evolution, stating, “Humans are like neural networks in that the more data we receive, the smarter we become. So, in theory, the more scientific data I have, the more I can analyze and visualize, [which makes me] a better scientist both on a day-to-day basis and in the longer term.”

Embracing a science-aware solution

Transitioning to a comprehensive and science-aware data management solution, such as Jarvis from Sapio Sciences, offers numerous advantages for research organizations, enabling scientists to search, analyze, and visualize data with full scientific context. “We created Jarvis just to solve the problem scientists are having [with data management],” added Cramer. “It’s a platform you can give directly to scientists, immediately allowing them to search data and perform meaningful analytics.” These benefits stem from Jarvis’s science-aware approach to data consolidation, empowering scientists to engage with their data on a single, integrated platform.

Young female scientist analyzes data on a holographic display

iStock, metamorworks

One foundation of Jarvis lies in its capacity to collect and parse data from hundreds of instrument types. It then automatically syncs this data with metadata on samples, experiments, and projects from multiple ELN and LIMS applications, regardless of vendor. This automation is achieved through unique no-code pipeline rules, easily configured to detect and import new raw data. By harmonizing diverse data types into a unified knowledge graph and connecting data with necessary context, Jarvis creates well-structured and high-quality data that scientists can actually use. For example, they can search based on scientific queries, like molecular substructures, chemical properties, or assay results. Having data structured in a science-aware way also means it can be used to train high-quality AI and ML models. This approach to scientific data alleviates the burden of manual data management, allowing scientists to redirect their focus toward their primary research activities.

Jarvis goes beyond “passive” science awareness, by integrating Sapio’s AI-powered scientific assistant, ELaiN, further simplifying how researchers access, search, and analyze data. ELaiN employs large language model technology that has been trained to run Jarvis on behalf of a user. This means a scientist can pose questions in natural language about their experiments, assays, workflows, samples, materials, inventory, and almost any other data they’re working on within the platform. For instance, a researcher can simply ask, “Show me results from all PCR assays in the last month,” and quickly receive an organized display of relevant data, without having to manually search for it or break down the statement for the software to act on.

Beyond data retrieval, Jarvis and ELaiN enhance data interaction through integrated and scientific visualization tools like charts and tables. These features augment analysis by enabling interactive exploration of biological and chemical entities, including 2D compounds, 3D biologics, and annotated sequence-based entities. This allows scientists to visualize data using standard scientific tools without leaving the Sapio Jarvis platform.

Additionally, the platform provides built-in statistics, analytics, and reporting tools, eliminating dependencies on external software. These tools support scientists in conducting thorough analyses like flow cytometry gating and curve fitting, running bioinformatics and cheminformatics methods, and driving advanced AI and ML capabilities, including neural networks. Notably, the performance of these neural networks improves significantly when trained on large, consolidated datasets—like those found in Jarvis—leading to more sophisticated insights and predictions.

“If scientists want to get the most out of their data, and take advantage of revolutionary AI and ML models, they need to change the way they think about managing that data,” explained Cramer. “It isn’t enough to simply consolidate it. If a scientist can’t then easily search, visualize, and analyze that data in a science-aware way, you’re ultimately going to be holding back the pace of your research efforts.”

Learn more about Sapio Jarvis and how it can accelerate your research: https://www.sapiosciences.com/products/sapio-jarvis-demo/