How to get the most out of your AI/ML investments: Get started with your data infrastructure
We’re excited to bring Transform 2022 live back on July 19 and virtually from July 20 – 28. Join AI and data leaders for insightful talks and Interesting networking opportunities. Sign up today!
The Big Data era has democratized information, generated rich data, and increased revenue at technology-driven companies. But for all this intelligence, we don’t get the level of insight from the field of machine learning that one might expect, as many companies are struggling to create machine learning (ML) useful and useful projects. A successful AI/ML program doesn’t start with a large group of data scientists. It starts with a robust data infrastructure. Data needs to be accessible across systems and ready for analysis so data scientists can quickly make comparisons and deliver business results, and data needs to be reliable. This is the challenge many companies face when starting a data science program.
The problem is that many companies make their first forays into data science, hire expensive data scientists, and then discover they don’t have the tools or infrastructure that scientists don’t have. data needed to succeed. Well-paid researchers end up spending time sorting, validating, and preparing data — instead of looking for insights. This infrastructure work is important, but it also misses the chance for data scientists to use their most useful skills in the most valuable ways.
Challenges with data management
When leaders evaluate the reasons for the success or failure of a data science project (and 87% of projects never went into production) they often find that their company has been trying to get ahead of the curve without building a reliable data base. If they don’t have that solid foundation, data engineers can spend up to 44% of their time maintain a data pipeline with changes to APIs or data structures. Creating an automated data integration workflow can help engineers step back in time and ensure companies have all the data they need for accurate machine learning. This also helps cut costs and maximize efficiency as companies build their data science capabilities.
Narrow data delivers narrow insights
Machine learning is hard – if there are gaps in the data or it’s not formatted properly, machine learning won’t work or worse, gives incorrect results.
When companies are faced with uncertainty about their data, most organizations require the data science team to manually label the dataset as part of supervised machine learning, but this is an process is time consuming and brings additional risks to the project. Worse yet, when the training examples are cut too far due to data problems, it is likely that the narrow scope will mean that the ML model can only tell us what we already know.
The solution is to ensure that the team can pull from a central, comprehensive data warehouse that includes a variety of sources and provides a common understanding of the data. This improves the potential ROI from ML models by providing more consistent data to work with. A data science program can only grow if it is based on reliable, consistent data and an understanding of the confidence bars for the results.
Big model vs valuable data
One of the biggest challenges to success data science program is balancing the volume and value of data when making predictions. A social media company that analyzes billions of interactions a day could use a large volume of relatively low-value actions (e.g. someone swiping up or sharing an article) to generate recommendations. reliable prediction. If an organization is trying to determine which customers are likely to renew their contracts at the end of the year, it is likely to work with smaller data sets with big consequences. Since it can take a year to find out if the recommended actions lead to success, this creates major limitations for a data science program.
In these situations, companies need to subdivide their internal data repositories to combine all the data they have to make the best recommendations. This may include party information not collected with controlling content, first-party website data, and data from customer interactions with the product, along with successful outcomes, coupons, and reviews. support, customer satisfaction surveys, even unstructured data like user feedback. All of these data sources contain clues if customers will renew their contracts. By combining data warehouses across business teams, metrics can be normalized and given enough depth and breadth to generate reliable predictions.
To avoid the trap of reducing trust and profitability from ML/AI programs, companies can take the following steps.
- Realize where you are – Does your business have a clear understanding of how ML contributes to the business? Is your company infrastructure ready? Don’t try to add fancy gilding on top of fuzzy data – be clear where you’re starting, so you don’t go too far.
- Get all your data in one place – Make sure you have a central cloud service or data lake defined and integrated. Once everything is in focus, you can start acting on the data and find any reliability discrepancies.
- Crawl-Walk-Run – Start with the proper order of operations as you’re building your data science program. Let’s focus on data analytics and Business Intelligence first, then build data engineering, and finally the data science team.
- Don’t forget the basics Once you’ve combined all your data, cleaned, and validated, you’re ready to do data science. But don’t forget the “housekeeping” work required to maintain the foundation yields significant results. These essential tasks include investing in data sanitization and cataloging, ensuring the right metrics are targeted that will improve the customer experience, and maintaining data connectivity between systems or use infrastructure services manually.
By building the right infrastructure for data science, companies can see what matters to the business and where are blind spots. Making the foundation in advance can provide Solid ROI, but more importantly, it will set up the data science team to have a significant impact. Funding for a flashy data science program is relatively easy, but remember, the majority of such projects fail. It’s not easy to budget for “boring” infrastructure tasks, but data management creates the foundation for data scientists to deliver the most meaningful business impact. .
Alexander Lovell is head of product at Fivetran.
DataDecisionMakers
Welcome to the VentureBeat community!
DataDecisionMakers is a place where professionals, including technical people who work with data, can share data-related insights and innovations.
If you want to read about cutting-edge ideas and updates, best practices, and the future of data and data technology, join us at DataDecisionMakers.
You can even consider contribute an article your own!