
Once you have the packages installed, it’s time to jump into the concepts. So, Let’s start installing the required packages. In this article, I am going to talk about the extraction of highlighted text, for more details look into PyMuPDF documentation here.įor demonstration, I am going to use a sample pdf that I created with some texts from Wikipedia. Now without any further delay, let’s start exploring PyMuPDF. Tbh, the library looks complicated in the beginning, but there are lots of amazing stuff you can do with the tools for sure. So, I started looking into all kinds of libraries for handling PDFs and came across PyMuPDF. There were tons of articles, codes, projects on extracting tables, images, text from PDF using libraries like PyPDF2, PDFMiner, tabula but very few were on extracting the highlighted texts. At that very moment I thought of extracting all my highlighted text and to create a mini PDF out of it, just for me, my version of the book with all the ideas and concepts that I found crucial and creative.Īs usual I started googling about the extraction methods and I was literally surprised not being able to find enough materials for extracting highlighted texts from PDFs.


I find it quite amusing when PDFs are filled with highlights and notes. Being an individual engrossed in Data Science, I was taking notes and highlighting text all over the book.

Couple of weeks back I was reading a dynamic book on “Feature Engineering” by Max Kuhn & Kjell Johnson.
