Using Their Own Data to Train AI Systems: How Organizations Can Harness Generative AI Effectively
Organizations looking to harness the power of generative artificial intelligence (AI) should prioritize using their own data to train AI systems. By using foundation models as a starting point and incorporating their own data, organizations can provide more relevant context and alleviate concerns about potential risks, such as inaccuracy and intellectual property infringements.
Accuracy is particularly important for companies like Jiva, an agritech vendor that utilizes AI in its mobile app, Crop Doctor. This app uses image processing and computer vision to identify crop diseases and recommend treatments. Jiva also employs AI to assess the creditworthiness of farmers who request cash advancements before a harvest, ensuring that loans are repaid when the harvest pays out. To train its AI models, Jiva uses various AI and machine learning tools, including Pinecorn, OpenAI, scikit-learn, TensorFlow, and Vertex AI. The company operates in Singapore, Indonesia, and India.
Jiva collects thousands of annotated images for each disease, sourced from its field teams and farmers who use its app AgriCentral. Field experts and agronomy experts annotate these images, which are then added to the training model. For crops that Jiva’s team is less familiar with, the company leverages other platforms like Plantix, which provide extensive datasets for image recognition and diagnosis information.
Ensuring the accuracy of information is crucial as it directly impacts farmers’ harvests and livelihoods. Jiva’s CTO Tejas Dinkar emphasizes the importance of using only datasets sourced and vetted by Jiva to maintain data veracity. Additionally, Jiva’s chatbot is engineered to ignore any pretrained data about farming that may be in large language models (LLMs), ensuring that it provides accurate responses or acknowledges when it is unable to identify a crop disease.
Jiva also uses its image library to enhance existing platforms like Plantix. While these models serve as a good baseline, they may not be adequately trained on region-specific or market-specific data. Jiva has created training models for crops that are more common in Indonesia and India, such as corn, which have been performing better than off-the-shelf products like Plantix. This highlights the importance of localization in AI models.
While using foundation data models can provide a quick start to generative AI, organizations should aim to finetune these models with their own data for better results. Olivier Klein, Amazon Web Services’ (AWS) Asia-Pacific chief technologist, emphasizes that organizations that put in the effort to properly finetune AI models will see faster progress in their implementation. Incorporating generative AI into an organization’s data strategy and platform is also crucial for success.
One challenge organizations face is whether they have enough data of their own to train the AI model. However, Klein notes that data quantity does not necessarily equate to data quality. Proper data annotation and contextualization are important to ensure that AI training models generate industry-specific responses. Annotating individual components of the training data helps AI machines understand the data and identify important components.
Klein dispels the misconception that all AI systems are the same, emphasizing the need for organizations to tweak AI models based on their specific use cases and verticals. Generative AI has gained attention in call centers, where it can enhance the experience for call agents by providing better responses on-the-fly and improving customer service. Call center operators can train the AI model using their own knowledge base, incorporating chatbot and customer interactions.
Finetuning an existing LLM with domain-specific content requires significantly less data compared to building a new foundational model from scratch. This approach is less computationally intensive and can be more cost-effective, although it still requires data science expertise. It’s worth noting that not all providers of LLMs permit users to finetune on top of their models.
Using their own data also addresses concerns about data control and privacy. Organizations want to retain control over the data used to train AI models and ensure it remains within their environments. This approach fosters transparency, establishes responsible AI adoption, and avoids the “blackbox” problem.
In conclusion, organizations can effectively harness generative AI by using their own data to train AI systems. By starting with foundation models and incorporating their own data, organizations can provide more relevant context, improve accuracy, and address concerns about data control. Finetuning AI models and contextualizing them to specific industries and use cases further enhances the effectiveness of generative AI.