Panda Project

  Hwang, Inan(2006-04-02 07:50:31, Hit : 11304, Vote : 162
 [Biz자료] 스크래핑 프로그래밍 - 델파이

Simple HTML page scraping with Delphi | 관련 정보  2005/07/01 14:03  

This article will show you the techniques needed to download an HTML page from the Internet, do some page scraping (using regular expressions for pattern matching) and present the information in more *situation-friendly* manner.  

As you already know, About Delphi Programming covers all aspects of Delphi/Kylix development, including articles, chat, forum, RTL reference, glossary, free code VCL and much more.
This site is as much dynamic as it can be. New features added to the site in a day by day basics range from tutorials for Delphi beginners to more advanced articles including code for faster, better, more robust development.

NOTE: We've recently changed the way new postings are presented on this site. The "What's new and Hot" section has been transformed to "In the Spotlight" BLOG entries. Even though the ideas in this article are still "valid", if you want to grab the *Current Headlines* from the About Delphi Programming site, please visit the "About Delphi Programming *Current Headlines* sticker" page!

There's one section of this site that brings you the latest news: compilation of hot, new and updated materials on the About Delphi Programming site. This page is located at "What's New and Hot". Every time you visit the site I strongly suggest you to open that page and see if some new items were added to the site.

To enable Delphi programming related site web masters to add this valuable information to their pages, I've developed the "About Delphi Programming *NEW and HOT* sticker". It's a java script file, web masters need to include in their web pages to give their viewers the ultimate source of professional Delphi / Kylix programming content, updated frequently.

This java script file (.js) is created by a Delphi developed utility. This application downloads the What's New and Hot page from this site, then uses some regular expressions to extract the correct data from the page and finally creates a .js file using the extracted data.

The idea of this article is to show you the techniques used to download a page from the Internet, do some page scraping and finally present the information in more "situation-friendly" manner.

The key to data extraction methods described in this article is to convert the existing HTML document to more "situation-friendly" source. These are the steps we'll be discussing:

Retrieval of HTML source documents
Processing the HTML document, removing the unneeded data
Transforming the result to string type variables
Displaying the information extracted in a ListView
Note: The Sticker java script file described above uses the same techniques we are to discuss in this article - it is just enriched with HTML tags, due to the fact that JavaScript document.write method adds well formatted HTML code to an existing document. What's important is that the core of the Sticker .js is the data I'll show you how to extract.

Preparing the Delphi Project
To keep up with the article, I suggest you to start Delphi, create a new project with one blank form. On this form place one TButton (Standard palette) and one TListView (Win32 palette) component. Leave the default component names as Delphi suggests. That is, Button1 for the button and ListView1 for the list view component. You'll use Button1 to get the file from the Internet, do information retrieval and show the result in the ListView1. Also, make sure to add 4 columns to the ListView1: Title, URL, Description, When/Where. The ViewStyle of the ListView1 should be set to vsReport.

젨?b>Retrieval of HTML source documents
Before we start extracting data from an HTML file, we need to make sure we have one locally.
Your first task is to create a Delphi function used to download a file from the Internet. One way of achieving this task is to use the WinInet API calls. Delphi gives us full access to the WinInet API (wininet.pas) which we can use to connect to and retrieve files from any Web site that uses either Hypertext Transfer Protocol (HTTP) or File Transfer Protocol (FTP). I've already written an article that describes this technique: Get File From the Net.

Another approach, if you have Delphi 6, is to use the TDownloadURL object. The TDownloadURL object, defined in ExtActns.pas unit, is designed for saving the contents of a specified URL to a file. Here's the code that uses the TDownloadURL to download the "What's New and Hot" page from this site.

function Download_HTM(const sURL, sLocalFileName:string): boolean;
  with TDownLoadURL.Create(nil) do

This function, Download_HTM, downloads a file from the URL specified in the sURL parameter, and saves this file locally under a sLocalFileName name. The function returns True if it succeeds, False otherwise. Of course, this function is to be called from the Button1 OnClick event handler. You can see the code below. Note that, locally, the file is saved as c:\temp_adp.newandhot.

procedure TForm1.Button1Click(Sender: TObject);
  if NOT Download_HTM(ADPNEWHOTURL,TmpFileName) then
    ShowMessage('Error in HTML file download');

  more code to be added


Note: In the process of downloading a file, the TDownloadURL periodically generates an OnDownloadProgress event, so that you can provide users with feedback about the process. I'll leave this for you to implement.

Now, that we have the HTM page locally on the disk we can use the techniques for handling ASCII files from Object Pascal code.

젨?b>Processing the HTML document
The next step is to locate the interesting data inside the HTML document and extract it. Since the HTML document is just pure ASCII file, you can use the set of routines designed to work with text files in Delphi.

Before we move on
It's important to have in mind that techniques described in this article are somehow deprecated with "new" intelligent information retrieval techniques like HTML to XML using the XSLT. If you do not know what I'm talking about don't worry.
To be able to successfully extract the data from a web page, the use of some kind of regular expressions for pattern matching is required. This in particular means that you will be able to do page scraping if and only if you know the structure of an HTML document. This is not a big problem if you are the one that creates the web page. Even if you are not the person behind a web page, you can use pattern matching, but must be sure to check your code occasionally, since HTML is dynamic content and a document structure can change very often due to the various banner ad systems and dynamic server-side scripting engines.

In situations when pattern matching is not giving results you can turn to more intelligent solutions like transforming the HTML document to XML - a standard for marking up structured documents; however this is not something we are to discuss here.

If you open up the downloaded file with the Notepad, you should notice that the information we want to extract is placed inside and tags. After you extract that part, you have to make sure any server or client side scripting is excluded - such text usually appears between the tags. What remains is an HTML code, with 10 items formatted like:

/library/weekly/aa061802a.htm">A Beginner뭩 Guide to Delphi Programming: Chapter 5

06/18 in BEGINNERS COURSE. Take a closer look at exactly what each keyword means by examining each line of the Delphi form unit source code. Interface, implementation, uses and other keywords explained in easy language!

Now, this "item" holds 4 pieces of interesting information. In the code, this item is a held in the ItemBuf string variable. Marked red is the URL of the particular news item. Marked blue is the title of this item. The description is green, and the date and location is maroon.

To get the particular element of information, you can use the following code:

//find the title
iStart:=Pos('',ItemBuf) + Length('');
sTitle:= Copy(ItemBuf, iStart, iStop-iStart);

Finally you transform each item to 4 string type variables and display the information in ListView.

I'm not going to bother you with the project details here, be sure to see the entire code, you'll have plenty to play with.

This is the project at run-time:

If you have any questions or comments to this article, please post them on the Delphi Programming Forum.

Push Technology (푸쉬 기술) | 관련 정보  2005/05/23 14:08  

스크래핑을 이용한 웹 정보 검색 및 저장하는 기술은 이미 밝혀진 바와 같이 인터넷에 있는 정보를 수집하고 저장하는 기술이다

그런데 정보를 수집하고 저정하는 기술만 있다고 정보가 정보다운 가치를 가질 수 있을까?
그렇지 않다.

우선, 정보를 가공해야 한다.

수집된 정보를 분석하고 분류해야만 정보를 원하는 곳에 정확한 정보를 보낼 수가 있는 것이다.

정보를 분석하고 분류하는 방법은 정보의 성격과 이용자의 취향(?)에 따라 '그때 그때 다르게' 구성할 수 밖에 없다.

아울러 웹에서 스크래핑된 방대한 정보는 어떻게 정보의 수용자에게 전달될 수 있을까?

이에 대한 해답이 바로 ' Push Technology'이며 간략히 정리하면 다음과 같다.

푸쉬기술은 정보를 정보 수용자에게 반 강제적으로 전송하고 정보 수요자가  필요한 시간에 조회하는 기술이며, 따라서 일반적인 웹 조회를 뛰어 넘는 다른  기술 수단을 이용하여 아래와 같은 다양한 서비스를 구현할 수 있다.


1. 인터넷 사이트의 방송채널화

인터넷 이용자들이 지금처럼 정보 수집을 위해 웹사이트를 일일이 찾아다녀야 하는 불편이 사라질 것이며, 컴퓨터 화면보호기를 통해 주요 정보의 제목이나 주식시세 등을 받아볼 수 있기 때문이다.

2. 뉴스 메일

개인적으로 매일 발생하는 방대한 양의 뉴스를 모두  검색해 볼 수가 없다. 그래서 뉴스/블로그 내용을 자동으로 수집해서 정보 수용자에게 전달하는 서비스도 가능하다.

3. 기타

특정 계좌의 거래 내용 변경에 대하여 자동 체크, 뉴스 정보를 화면 보호기에 자동으로 보여주는 서비스 등등.

참고로 이러한 푸쉬 기술로 정보를 얻는 방법에는 일정한 기준이 없다. 따라서 인터넷 혹은 S/W업체들을 중심으로 푸쉬 기술을 이용한 정보의 조회 방법을 표준화하자는 논의가 있다.

그러나 각 업체들의 이해관계가 복잡한 관계로 아직도 표준화까지는 멀고 먼 것 같다. 아무쪼록 이렇게 좋은 기술과 서비스를 일반인이 이용할 수 있는 진정한 IT세계가 살 날이 언제 올런지 자뭇 궁금해진다.

아마도 3년 이내에는 가시화되지 않을까?

스크래핑 SEM Push Technology (푸쉬 기술)  

Event Analysys Technoledge | 스크래핑 기술  2005/05/23 13:53  

01. 정의

   : 운영체제상의 데이터 송수신이나 이벤트 등을 추출 또는 제어하는 기술

02. 기술핵심

   : Data Extract Technology(데이터 추출 기술)
   : Window Event Capturing(윈도우 이벤트 캡쳐 기술)
   : Protocol Data Capturing(통신 데이터 캡쳐 기술)

03. 이용분야

   : 윈도우 제어 및 프로그램 제어

Screen Recognition Technoledge | 스크래핑 기술  2005/05/23 13:48  

01. 정의

   : Application Program에 보여지는 데이터를 추출하거나 입력, 실행하는 기술

02. 핵심 기술

   : Screen Position Capturing(화면 포지션 추출 기술)
   : Data Extract Technology(데이터 추출 기술)
   : Application Program을 이용한 데이터 입력

03. 이용분야

   : 부동산 매물 등록(중복데이터 입력에 따른 번거로움)
   : 약국 처방전 조제 등록(스캔데이터)
   : 대량문서화 작업 데이터 입력

Scaping Technoledge | 스크래핑 기술  2005/05/23 13:46  

01. 정의

   : 인터넷 스크린에 보여지는 데이터 중에서 필요한 데이터를 추출하는 기술

02. 웹스크래핑의 핵심 기술

   : 데이터를 수집해 오는 기술
   : 일정 포맷으로 변환하는 기술

03. 구조에 따른 분류

   : 서버 의존형
   : 클라이언트 의존형
   : 혼합형

04. 이용분야

   : 계좌통합 관리(각 금융기관의 조회, 이체 등의 자산정보를 통합관리함)
   : E-Mail 통합 조회(여러 웹 메일을 사용하는 경우 한번에 종합 확인)
   : 호텔, 항공사, 렌트카, 주유소 마일리지 등 각종 보상 프로그램 활용
   : 전자 상거래에서 경매 진행 상황 추적
   : 물류 서비스의 배송 정보 추적

[BM착안] 시간인식 홈 06.04.07 ★★
[관심자료] 땅

Copyright 1999-2023 Zeroboard / skin by zero
  copyright ⓒ 2005 ZIP365.COM All rights reserved