Panda Project


  Hwang, Inan(2006-04-02 07:50:31, Hit : 11304, Vote : 162
 [BizÀÚ·á] ½ºÅ©·¡ÇÎ ÇÁ·Î±×·¡¹Ö - µ¨ÆÄÀÌ

http://blog.naver.com/sommer04.do

Simple HTML page scraping with Delphi | °ü·Ã Á¤º¸  2005/07/01 14:03  
--------------------------------------

http://blog.naver.com/sommer04/60014537285

This article will show you the techniques needed to download an HTML page from the Internet, do some page scraping (using regular expressions for pattern matching) and present the information in more *situation-friendly* manner.  

As you already know, About Delphi Programming covers all aspects of Delphi/Kylix development, including articles, chat, forum, RTL reference, glossary, free code VCL and much more.
This site is as much dynamic as it can be. New features added to the site in a day by day basics range from tutorials for Delphi beginners to more advanced articles including code for faster, better, more robust development.

--------------------------------------------------------------------------------
NOTE: We've recently changed the way new postings are presented on this site. The "What's new and Hot" section has been transformed to "In the Spotlight" BLOG entries. Even though the ideas in this article are still "valid", if you want to grab the *Current Headlines* from the About Delphi Programming site, please visit the "About Delphi Programming *Current Headlines* sticker" page!
--------------------------------------------------------------------------------

There's one section of this site that brings you the latest news: compilation of hot, new and updated materials on the About Delphi Programming site. This page is located at "What's New and Hot". Every time you visit the site I strongly suggest you to open that page and see if some new items were added to the site.

To enable Delphi programming related site web masters to add this valuable information to their pages, I've developed the "About Delphi Programming *NEW and HOT* sticker". It's a java script file, web masters need to include in their web pages to give their viewers the ultimate source of professional Delphi / Kylix programming content, updated frequently.


This java script file (.js) is created by a Delphi developed utility. This application downloads the What's New and Hot page from this site, then uses some regular expressions to extract the correct data from the page and finally creates a .js file using the extracted data.

The idea of this article is to show you the techniques used to download a page from the Internet, do some page scraping and finally present the information in more "situation-friendly" manner.

The key to data extraction methods described in this article is to convert the existing HTML document to more "situation-friendly" source. These are the steps we'll be discussing:

Retrieval of HTML source documents
Processing the HTML document, removing the unneeded data
Transforming the result to string type variables
Displaying the information extracted in a ListView
Note: The Sticker java script file described above uses the same techniques we are to discuss in this article - it is just enriched with HTML tags, due to the fact that JavaScript document.write method adds well formatted HTML code to an existing document. What's important is that the core of the Sticker .js is the data I'll show you how to extract.

Preparing the Delphi Project
To keep up with the article, I suggest you to start Delphi, create a new project with one blank form. On this form place one TButton (Standard palette) and one TListView (Win32 palette) component. Leave the default component names as Delphi suggests. That is, Button1 for the button and ListView1 for the list view component. You'll use Button1 to get the file from the Internet, do information retrieval and show the result in the ListView1. Also, make sure to add 4 columns to the ListView1: Title, URL, Description, When/Where. The ViewStyle of the ListView1 should be set to vsReport.

젨?b>Retrieval of HTML source documents
Before we start extracting data from an HTML file, we need to make sure we have one locally.
Your first task is to create a Delphi function used to download a file from the Internet. One way of achieving this task is to use the WinInet API calls. Delphi gives us full access to the WinInet API (wininet.pas) which we can use to connect to and retrieve files from any Web site that uses either Hypertext Transfer Protocol (HTTP) or File Transfer Protocol (FTP). I've already written an article that describes this technique: Get File From the Net.

Another approach, if you have Delphi 6, is to use the TDownloadURL object. The TDownloadURL object, defined in ExtActns.pas unit, is designed for saving the contents of a specified URL to a file. Here's the code that uses the TDownloadURL to download the "What's New and Hot" page from this site.

function Download_HTM(const sURL, sLocalFileName:string): boolean;
begin
  Result:=True;
  with TDownLoadURL.Create(nil) do
  try
    URL:=sURL;
    Filename:=sLocalFileName;
    try
      ExecuteTarget(nil);
    except
      Result:=False
    end;
  finally
    Free;
  end;
end;


This function, Download_HTM, downloads a file from the URL specified in the sURL parameter, and saves this file locally under a sLocalFileName name. The function returns True if it succeeds, False otherwise. Of course, this function is to be called from the Button1 OnClick event handler. You can see the code below. Note that, locally, the file is saved as c:\temp_adp.newandhot.

procedure TForm1.Button1Click(Sender: TObject);
const
  ADPNEWHOTURL='http://delphi.about.com/cs/newandhot/index.htm';
  TmpFileName='c:\temp_adp.newandhot';
begin
  if NOT Download_HTM(ADPNEWHOTURL,TmpFileName) then
  begin
    ShowMessage('Error in HTML file download');
    Exit;
  end;

  {
  more code to be added
  }

end;


Note: In the process of downloading a file, the TDownloadURL periodically generates an OnDownloadProgress event, so that you can provide users with feedback about the process. I'll leave this for you to implement.

Now, that we have the HTM page locally on the disk we can use the techniques for handling ASCII files from Object Pascal code.

젨?b>Processing the HTML document
The next step is to locate the interesting data inside the HTML document and extract it. Since the HTML document is just pure ASCII file, you can use the set of routines designed to work with text files in Delphi.

Before we move on
It's important to have in mind that techniques described in this article are somehow deprecated with "new" intelligent information retrieval techniques like HTML to XML using the XSLT. If you do not know what I'm talking about don't worry.
To be able to successfully extract the data from a web page, the use of some kind of regular expressions for pattern matching is required. This in particular means that you will be able to do page scraping if and only if you know the structure of an HTML document. This is not a big problem if you are the one that creates the web page. Even if you are not the person behind a web page, you can use pattern matching, but must be sure to check your code occasionally, since HTML is dynamic content and a document structure can change very often due to the various banner ad systems and dynamic server-side scripting engines.

In situations when pattern matching is not giving results you can turn to more intelligent solutions like transforming the HTML document to XML - a standard for marking up structured documents; however this is not something we are to discuss here.

If you open up the downloaded file with the Notepad, you should notice that the information we want to extract is placed inside and tags. After you extract that part, you have to make sure any server or client side scripting is excluded - such text usually appears between the tags. What remains is an HTML code, with 10 items formatted like:

/library/weekly/aa061802a.htm">A Beginner뭩 Guide to Delphi Programming: Chapter 5

06/18 in BEGINNERS COURSE. Take a closer look at exactly what each keyword means by examining each line of the Delphi form unit source code. Interface, implementation, uses and other keywords explained in easy language!

Now, this "item" holds 4 pieces of interesting information. In the code, this item is a held in the ItemBuf string variable. Marked red is the URL of the particular news item. Marked blue is the title of this item. The description is green, and the date and location is maroon.

To get the particular element of information, you can use the following code:

//find the title
iStart:=Pos('',ItemBuf) + Length('');
iStop:=Pos('',ItemBuf);
sTitle:= Copy(ItemBuf, iStart, iStop-iStart);

Finally you transform each item to 4 string type variables and display the information in ListView.

I'm not going to bother you with the project details here, be sure to see the entire code, you'll have plenty to play with.

This is the project at run-time:

If you have any questions or comments to this article, please post them on the Delphi Programming Forum.


Push Technology (Ǫ½¬ ±â¼ú) | °ü·Ã Á¤º¸  2005/05/23 14:08  
----------------------------

http://blog.naver.com/sommer04/60013138258


½ºÅ©·¡ÇÎÀ» ÀÌ¿ëÇÑ À¥ Á¤º¸ °Ë»ö ¹× ÀúÀåÇÏ´Â ±â¼úÀº ÀÌ¹Ì ¹àÇôÁø ¹Ù¿Í °°ÀÌ ÀÎÅͳݿ¡ ÀÖ´Â Á¤º¸¸¦ ¼öÁýÇϰí ÀúÀåÇÏ´Â ±â¼úÀÌ´Ù

±×·±µ¥ Á¤º¸¸¦ ¼öÁýÇϰí ÀúÁ¤ÇÏ´Â ±â¼ú¸¸ ÀÖ´Ù°í Á¤º¸°¡ Á¤º¸´Ù¿î °¡Ä¡¸¦ °¡Áú ¼ö ÀÖÀ»±î?
±×·¸Áö ¾Ê´Ù.

¿ì¼±, Á¤º¸¸¦ °¡°øÇØ¾ß ÇÑ´Ù.

¼öÁýµÈ Á¤º¸¸¦ ºÐ¼®ÇÏ°í ºÐ·ùÇØ¾ß¸¸ Á¤º¸¸¦ ¿øÇÏ´Â °÷¿¡ Á¤È®ÇÑ Á¤º¸¸¦ º¸³¾ ¼ö°¡ ÀÖ´Â °ÍÀÌ´Ù.

Á¤º¸¸¦ ºÐ¼®ÇÏ°í ºÐ·ùÇÏ´Â ¹æ¹ýÀº Á¤º¸ÀÇ ¼º°Ý°ú ÀÌ¿ëÀÚÀÇ ÃëÇâ(?)¿¡ µû¶ó '±×¶§ ±×¶§ ´Ù¸£°Ô' ±¸¼ºÇÒ ¼ö ¹Û¿¡ ¾ø´Ù.

¾Æ¿ï·¯ À¥¿¡¼­ ½ºÅ©·¡ÇÎµÈ ¹æ´ëÇÑ Á¤º¸´Â ¾î¶»°Ô Á¤º¸ÀÇ ¼ö¿ëÀÚ¿¡°Ô Àü´ÞµÉ ¼ö ÀÖÀ»±î?

ÀÌ¿¡ ´ëÇÑ ÇØ´äÀÌ ¹Ù·Î ' Push Technology'ÀÌ¸ç °£·«È÷ Á¤¸®ÇÏ¸é ´ÙÀ½°ú °°´Ù.


Ǫ½¬±â¼úÀº Á¤º¸¸¦ Á¤º¸ ¼ö¿ëÀÚ¿¡°Ô ¹Ý °­Á¦ÀûÀ¸·Î Àü¼ÛÇϰí Á¤º¸ ¼ö¿äÀÚ°¡  ÇÊ¿äÇÑ ½Ã°£¿¡ Á¶È¸ÇÏ´Â ±â¼úÀ̸ç, µû¶ó¼­ ÀϹÝÀûÀÎ À¥ Á¶È¸¸¦ ¶Ù¾î ³Ñ´Â ´Ù¸¥  ±â¼ú ¼ö´ÜÀ» ÀÌ¿ëÇÏ¿© ¾Æ·¡¿Í °°Àº ´Ù¾çÇÑ ¼­ºñ½º¸¦ ±¸ÇöÇÒ ¼ö ÀÖ´Ù.



À̸¦Å׸é

1. ÀÎÅÍ³Ý »çÀÌÆ®ÀÇ ¹æ¼Ûä³ÎÈ­

ÀÎÅÍ³Ý ÀÌ¿ëÀÚµéÀÌ Áö±Ýó·³ Á¤º¸ ¼öÁýÀ» À§ÇØ À¥»çÀÌÆ®¸¦ ÀÏÀÏÀÌ Ã£¾Æ´Ù³à¾ß ÇÏ´Â ºÒÆíÀÌ »ç¶óÁú °ÍÀ̸ç, ÄÄÇ»ÅÍ È­¸éº¸È£±â¸¦ ÅëÇØ ÁÖ¿ä Á¤º¸ÀÇ Á¦¸ñÀ̳ª ÁֽĽü¼ µîÀ» ¹Þ¾Æº¼ ¼ö Àֱ⠶§¹®ÀÌ´Ù.


2. ´º½º ¸ÞÀÏ

°³ÀÎÀûÀ¸·Î ¸ÅÀÏ ¹ß»ýÇÏ´Â ¹æ´ëÇÑ ¾çÀÇ ´º½º¸¦ ¸ðµÎ  °Ë»öÇØ º¼ ¼ö°¡ ¾ø´Ù. ±×·¡¼­ ´º½º/ºí·Î±× ³»¿ëÀ» ÀÚµ¿À¸·Î ¼öÁýÇØ¼­ Á¤º¸ ¼ö¿ëÀÚ¿¡°Ô Àü´ÞÇÏ´Â ¼­ºñ½ºµµ °¡´ÉÇÏ´Ù.


3. ±âŸ

ƯÁ¤ °èÁÂÀÇ °Å·¡ ³»¿ë º¯°æ¿¡ ´ëÇÏ¿© ÀÚµ¿ üũ, ´º½º Á¤º¸¸¦ È­¸é º¸È£±â¿¡ ÀÚµ¿À¸·Î º¸¿©ÁÖ´Â ¼­ºñ½º µîµî.


Âü°í·Î ÀÌ·¯ÇÑ Çª½¬ ±â¼ú·Î Á¤º¸¸¦ ¾ò´Â ¹æ¹ý¿¡´Â ÀÏÁ¤ÇÑ ±âÁØÀÌ ¾ø´Ù. µû¶ó¼­ ÀÎÅÍ³Ý È¤Àº S/W¾÷üµéÀ» Áß½ÉÀ¸·Î Ǫ½¬ ±â¼úÀ» ÀÌ¿ëÇÑ Á¤º¸ÀÇ Á¶È¸ ¹æ¹ýÀ» Ç¥ÁØÈ­ÇÏÀÚ´Â ³íÀǰ¡ ÀÖ´Ù.

±×·¯³ª °¢ ¾÷üµéÀÇ ÀÌÇØ°ü°è°¡ º¹ÀâÇÑ °ü°è·Î ¾ÆÁ÷µµ Ç¥ÁØÈ­±îÁö´Â ¸Ö°í ¸Õ °Í °°´Ù. ¾Æ¹«ÂÉ·Ï ÀÌ·¸°Ô ÁÁÀº ±â¼ú°ú ¼­ºñ½º¸¦ ÀϹÝÀÎÀÌ ÀÌ¿ëÇÒ ¼ö ÀÖ´Â ÁøÁ¤ÇÑ IT¼¼°è°¡ »ì ³¯ÀÌ ¾ðÁ¦ ¿Ã·±Áö ÀÚ¹µ ±Ã±ÝÇØÁø´Ù.

¾Æ¸¶µµ 3³â À̳»¿¡´Â °¡½ÃÈ­µÇÁö ¾ÊÀ»±î?


½ºÅ©·¡ÇÎ SEM Push Technology (Ǫ½¬ ±â¼ú)  

Event Analysys Technoledge | ½ºÅ©·¡ÇÎ ±â¼ú  2005/05/23 13:53  
----------------------------

http://blog.naver.com/sommer04/60013137799


01. Á¤ÀÇ

   : ¿î¿µÃ¼Á¦»óÀÇ µ¥ÀÌÅÍ ¼Û¼ö½ÅÀ̳ª À̺¥Æ® µîÀ» ÃßÃâ ¶Ç´Â Á¦¾îÇÏ´Â ±â¼ú

02. ±â¼úÇÙ½É

   : Data Extract Technology(µ¥ÀÌÅÍ ÃßÃâ ±â¼ú)
   : Window Event Capturing(À©µµ¿ì À̺¥Æ® ĸÃÄ ±â¼ú)
   : Protocol Data Capturing(Åë½Å µ¥ÀÌÅÍ Ä¸ÃÄ ±â¼ú)

03. ÀÌ¿ëºÐ¾ß

   : À©µµ¿ì Á¦¾î ¹× ÇÁ·Î±×·¥ Á¦¾î


Screen Recognition Technoledge | ½ºÅ©·¡ÇÎ ±â¼ú  2005/05/23 13:48  
--------------------------------

http://blog.naver.com/sommer04/60013137674


01. Á¤ÀÇ

   : Application Program¿¡ º¸¿©Áö´Â µ¥ÀÌÅ͸¦ ÃßÃâÇϰųª ÀÔ·Â, ½ÇÇàÇÏ´Â ±â¼ú

02. ÇÙ½É ±â¼ú

   : Screen Position Capturing(È­¸é Æ÷Áö¼Ç ÃßÃâ ±â¼ú)
   : Data Extract Technology(µ¥ÀÌÅÍ ÃßÃâ ±â¼ú)
   : Application ProgramÀ» ÀÌ¿ëÇÑ µ¥ÀÌÅÍ ÀÔ·Â

03. ÀÌ¿ëºÐ¾ß

   : ºÎµ¿»ê ¸Å¹° µî·Ï(Áߺ¹µ¥ÀÌÅÍ ÀԷ¿¡ µû¸¥ ¹ø°Å·Î¿ò)
   : ¾à±¹ ó¹æÀü Á¶Á¦ µî·Ï(½ºÄµµ¥ÀÌÅÍ)
   : ´ë·®¹®¼­È­ ÀÛ¾÷ µ¥ÀÌÅÍ ÀÔ·Â


Scaping Technoledge | ½ºÅ©·¡ÇÎ ±â¼ú  2005/05/23 13:46  
---------------------

http://blog.naver.com/sommer04/60013137624


01. Á¤ÀÇ

   : ÀÎÅÍ³Ý ½ºÅ©¸°¿¡ º¸¿©Áö´Â µ¥ÀÌÅÍ Áß¿¡¼­ ÇÊ¿äÇÑ µ¥ÀÌÅ͸¦ ÃßÃâÇÏ´Â ±â¼ú

02. À¥½ºÅ©·¡ÇÎÀÇ ÇÙ½É ±â¼ú

   : µ¥ÀÌÅ͸¦ ¼öÁýÇØ ¿À´Â ±â¼ú
   : ÀÏÁ¤ Æ÷¸ËÀ¸·Î º¯È¯ÇÏ´Â ±â¼ú

03. ±¸Á¶¿¡ µû¸¥ ºÐ·ù

   : ¼­¹ö ÀÇÁ¸Çü
   : Ŭ¶óÀÌ¾ðÆ® ÀÇÁ¸Çü
   : È¥ÇÕÇü

04. ÀÌ¿ëºÐ¾ß

   : °èÁÂÅëÇÕ °ü¸®(°¢ ±ÝÀ¶±â°üÀÇ Á¶È¸, ÀÌü µîÀÇ ÀÚ»êÁ¤º¸¸¦ ÅëÇÕ°ü¸®ÇÔ)
   : E-Mail ÅëÇÕ Á¶È¸(¿©·¯ À¥ ¸ÞÀÏÀ» »ç¿ëÇÏ´Â °æ¿ì Çѹø¿¡ Á¾ÇÕ È®ÀÎ)
   : È£ÅÚ, Ç×°ø»ç, ·»Æ®Ä«, ÁÖÀ¯¼Ò ¸¶Àϸ®Áö µî °¢Á¾ º¸»ó ÇÁ·Î±×·¥ Ȱ¿ë
   : ÀüÀÚ »ó°Å·¡¿¡¼­ °æ¸Å ÁøÇà »óȲ ÃßÀû
   : ¹°·ù ¼­ºñ½ºÀÇ ¹è¼Û Á¤º¸ ÃßÀû






¡â [BMÂø¾È] ½Ã°£ÀÎ½Ä È¨ 06.04.07 ¡Ú¡Ú
¡ä [°ü½ÉÀÚ·á] ¶¥

Copyright 1999-2023 Zeroboard / skin by zero
  copyright ¨Ï 2005 ZIP365.COM All rights reserved