Microsoft offers guide to pirating Harry Potter series for LLM training

Microsoft SQL now supports native vector search capabilities in Azure SQL and SQL database in Microsoft Fabric. We also released the langchain-sqlserver package, enabling the management of SQL Server as a Vectorstore in LangChain. In this step-by-step tutorial, we will show you how to add generative AI features to your own applications with just a few lines of code using Azure SQL DB, LangChain, and LLMs.
The Harry Potter series, written by J.K. Rowling, is a globally beloved collection of seven books that follow the journey of a young wizard, Harry Potter, and his friends as they battle the dark forces led by the evil Voldemort. Its captivating plot, rich characters, and imaginative world have made it one of the most famous and cherished series in literary history
By using a well-known dataset, we can create engaging and relatable examples that resonate with a wide audience
This Sample dataset from Kaggle contains 7 .txt files of 7 books of Harry Potter. For this demo we will only be using the first book – Harry Potter and the Sorcerer’s Stone.
Use cases:
Whether you’re a tech enthusiast or a Potterhead, we have two exciting use cases to explore:
The code lives in an integration package langchain-sqlserver
In this example, we will use a dataset consisting of text files from the Harry Potter books, which are stored in Azure Blob Storage
LangChain has a seamless integration with AzureBlobStorage, making it easy to load documents directly from Azure Blob Storage.
Additionally, LangChain provides a method to split long text into smaller chunks, using langchain-text-splitter which is essential since Azure OpenAI embeddings have an input token limit.
In this example we use Azure OpenAI to generate embeddings of the split documents, however you can use any of the different embeddings provided in LangChain.
After splitting the long text files of Harry Potter books into smaller chunks, you can generate vector embeddings for each chunk using the Text Embedding Model available through AzureOpenAI. Notice how we can accomplish this in just a few lines of code!
Once your vector store has been created and the relevant documents have been added you can now perform similarity search.
The vectorstore also supports a set of filters that can be applied against the metadata fields of the documents. By applying filters based on specific metadata attributes, users can limit the scope of their searches, concentrating only on the most relevant data subsets.
Performing a simple similarity search can be done as follows with the similarity_search_with_score
The Q&A function allows users to ask specific questions about the story, characters, and events, and get concise, context-rich answers. This not only enhances their understanding of the books but also makes them feel like they’re part of the magical universe.
The LangChain Vector store simplifies building sophisticated Q&A systems by enabling efficient similarity searches to find the top 10 relevant documents based on the user’s query.
Read more about Langchain RAG tutorials & the terminologies mentioned above here
We can now ask the user question and receive responses from the Q&A System:
Potterheads are known for their creativity and passion for the series. With this they can craft their own stories based on user prompt given , explore new adventures, and even create alternate endings. Whether it’s imagining a new duel between Harry and Voldemort or crafting a personalized Hogwarts bedtime story for you kiddo, the possibilities are endless.
The fan fiction function uses the embeddings in the vector store to generate new stories
Let’s imagine we prompt with this
Don’t miss the discussion around the “Vector Support in SQL – Public Preview” by young Davide on the Hogwarts Express! Even the Wizards are excited!
As you can see along with generating the story it also mentions the sources of inspiration from the Vector Store above
Hence, by combining the Q&A system with the fan fiction generator offers a unique and immersive reading experience.
If users come across a puzzling moment in the books, they can ask the Q&A system for clarification. If they’re inspired by a particular scene, they can use the fan fiction generator to expand on it and create their own version of events. This interactive approach makes reading more engaging and enjoyable.
You can find this notebook in the GitHub Repo along with other samples: https://github.com/Azure-Samples/azure-sql-db-vector-search.
We’d love to hear your thoughts on this feature! Please share how you’re using it in the comments below and let us know any feedback or suggestions for future improvements. If you have specific requests, don’t forget to submit them through the Azure SQL and SQL Server feedback portal, where other users can also contribute and help us prioritize future developments. We look forward to hearing your ideas!
Alex Chen
Senior Tech EditorCovering the latest in consumer electronics and software updates. Obsessed with clean code and cleaner desks.