lucene에서 PDF 문서 처리

alias 2007. 9. 2. 00:08

2007. 9. 2. 00:08

* PDFBox를 이용한 PDF 문서 처리
1. PDFBox
http://www.pdfbox.org/ 에서 다운 로드가 가능하다. 2007년 8월 현재 0.7.3 버전이다. 이 zip파일에는 PDF를 처리하기 위한 윈도우용 execute 파일과 jar, war, java소스들이 포함되어 있다. PDFBox-0.7.3.jar 를 CLASSPATH에 지정해 놓는다.
해당 jar에 대한 javadoc는 http://www.pdfbox.org/javadoc/index.html 에서 볼수 있다.

2. LucenePDFDocument 클래스 이용
특별히 Document 생성을 제어할 필요가 없는 경우, 즉 기본적으로 설정된 Field로만 사용해도 무방한 경우 간단하게 사용이 가능하다. 다음은 LucenePDFDocument를 이용하여 pdf 파일을 lucene Document로 만드는 코드이다.

Document doc=LucenePDFDocument.getDocument(new File("파일Path"));
필드는
addTextField( document, "Author", info.getAuthor() );
addTextField( document, "CreationDate", info.getCreationDate() );
addTextField( document, "Creator", info.getCreator() );
addTextField( document, "Keywords", info.getKeywords() );
addTextField( document, "ModificationDate", info.getModificationDate() );
addTextField( document, "Producer", info.getProducer() );
addTextField( document, "Subject", info.getSubject() );
addTextField( document, "Title", info.getTitle() );
addTextField( document, "Trapped", info.getTrapped() );
위의 ""내의 필드에 대해서 Document가 생성된다.

3. IndexFiles 클래스 이용
IndexFiles는 path를 받아서 해당 path에 있는 pdf 파일들을 파싱하고 색인까지 처리해 준다.
사용은 다음과 같다.

IndexFiles indexFiles = new IndexFiles();
indexFiles.index(new File("File Directory"),true,"indexDirectory");

4. PDF의 텍스트 추출과 색인
다음 함수는 PDF파일을 받아서 Author, Title, Ketword, Subject, Content를 추출하는 함수 이다

public String[] getPDFDocumentString(InputStream is){
COSDocument cosDoc=null;
PDDocument pdDoc=null;
String[] returnResult=new String[5];
try {
PDFParser parser=new PDFParser(is);
//1. PDF Document를 파싱하고 처리하는 클래스 생성
parser.parse();
cosDoc=parser.getDocument();
   //2. PDF document의 in-memory 표현하는 클래스로 변환
if(cosDoc.isEncrypted()){
   //3. 암호화 여부 판단
System.out.println("This PDF document is encrypted.");
cosDoc.close();
System.exit(1);
}
   //4. PDDocument로 전환
pdDoc=new PDDocument(cosDoc);
try{
PDFTextStripper stripper =new PDFTextStripper();
returnResult[4]=stripper.getText(pdDoc);
   //5.PDF문서에서 Text를 추출 (PDDocument를 이용)
} catch (Exception e){
e.printStackTrace();
cosDoc.close();
pdDoc.close();
}
try{
PDDocumentInformation docInfo=pdDoc.getDocumentInformation();
returnResult[0]=docInfo.getAuthor();
System.out.println("Author:"+returnResult[0]);
returnResult[1]=docInfo.getTitle();
System.out.println("Title:"+returnResult[1]);
returnResult[2]=docInfo.getKeywords();
System.out.println("Keywords:"+returnResult[2]);
returnResult[3]=docInfo.getSubject();
System.out.println("Subject:"+returnResult[3]);
System.out.println("Content:"+returnResult[4]);
//6. PDF문서의 각 정보를 추출하기 위한 클래스 생성 및 정보 추출
} catch (Exception e){
e.printStackTrace();
cosDoc.close();
pdDoc.close();
}
cosDoc.close();
pdDoc.close();
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return returnResult;
}

몽상가

lucene에서 PDF 문서 처리

+ Recent posts

티스토리툴바