First if we want to extract the information from the PDF files then the first and the easiest way is to copy paste the content you need manually.But what if the PDF files is hundreds or thousands of pages then manually coping and pasting the required information will be a really a hectic task.So using a simple piece of code doing it will be the next easiest way.

PDF(Portable Document Format) is similar to that of the Word document,that saves an electronic version of a document suitable for printing.So it is not really a good format for extraction of the information.Extraction task will be easier if the information that you want to extract has a fixed format like in the below image.

Question format in pdf

As you can see I had to extract all the above Questions of a PDF file into the excel or a csv file in which each row contains the Question number,Question,options and answer. So, For extracting the textual information.we need to change the complete text information of the PDF file into a single text file or a Html file based on the requirement and usage.The above thing I had done  using the python language.Using Text Converter or Html Converter commands we can easily convert the PDF information into the text or Html format.Below is the code for the conversion.

def convert_pdf_to_txt(path):

rsrcmgr = PDFResourceManager();
retstr = StringIO();
codec = ‘utf-8’;
laparams = LAParams();
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams);

#Getting the file using the file command..
fp = file(path, ‘rb’);
interpreter = PDFPageInterpreter(rsrcmgr, device);
password = “”;
maxpages = 0;
caching = True;
pagenos=set();

for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page);

text = retstr.getvalue();

fp.close();
device.close();
retstr.close();
return text;

In the above code if you change the text converter to Html converter then we can get the html file.I have extracted the information from the text file. After getting the text file as we can see if we can get the text from the “(Q” to “Answer >>” Then we can get all the information related to the Question and the options.

  • After extracting the Question along with the options we need to separate the Questions and options.
  • For the Question extract from  Text between these two strings “(Q” to “A)” (start of the option A).
  • In the same way for option A  extract the text between “A)” to “B)” (start of the option B). similarly other options can also be extracted.

The below is the little piece of code that finds the position of the string we are required in to the generator.

def findall(string,sub_string):
print ‘Entered the findall fucntion,,!!’
s = 0;
while True:
s = string.find(sub_string,s);
if s == -1:
return;
yield s;
s += len(sub_string);

And When you are trying to write in the csv file always write the csv file in UTF-8  format or else the spaces will be shown as some special characters that will make the csv file not understandable. If you are using the linux environment then the above mention problem can be resolved by changing the format of the CSV file when opening it to UTF-8 format.As we generally use csvwrite for writing in the CSV files. Instead of that use xlwt commad to write in the csv files which will write in the UTF – 8 format.

References:

  1. https://stackoverflow.com/questions/25665/python-module-for-converting-pdf-to-text