官术网_书友最值得收藏!

Extracting web data from a URL using JSoup

A large amount of data, nowadays, can be found on the Web. This data is sometimes structured, semi-structured, or even unstructured. Therefore, very different techniques are needed to extract them. There are many different ways to extract web data. One of the easiest and handy ways is to use an external Java library named JSoup. This recipe uses a certain number of methods offered in JSoup to extract web data.

Getting ready

In order to perform this recipe, we will require the following:

  1. Go to https://jsoup.org/download, and download the jsoup-1.9.2.jar file. Add the JAR file to your Eclipse project an external library.
  2. If you are a Maven fan, please follow the instructions on the download page to include the JAR file into your Eclipse project.

How to do it...

  1. Create a method named extractDataWithJsoup(String url). The parameter is the URL of any webpage that you need to call the method. We will be extracting web data from this URL:
            public void extractDataWithJsoup(String href){  
    
  2. Use the connect() method by sending the URL where we want to connect (and extract data). Then, we will be chaining a few more methods with it. First, we will chain the timeout() method that takes milliseconds as parameters. The methods after that define the user-agent name during this connection and whether attempts will be made to ignore connection errors. The next method to chain with the previous two is the get() method that eventually returns a Document object. Therefore, we will be holding this returned object in doc of the Document class:
            doc = 
              Jsoup.connect(href).timeout(10*1000).userAgent
                ("Mozilla").ignoreHttpErrors(true).get();
  3. As this code throws IOException, we will be using a try...catch block as follows:
            Document doc = null; 
            try { 
             doc = Jsoup.connect(href).timeout(10*1000).userAgent
               ("Mozilla").ignoreHttpErrors(true).get(); 
               } catch (IOException e) { 
                  //Your exception handling here 
            } 
    
    Tip

    We are not used to seeing times in milliseconds. Therefore, it is a nice practice to write 10*1000 to denote 10 seconds when millisecond is the time unit in a coding. This enhances readability of the code.

  4. A large number of methods can be found for a Document object. If you want to extract the title of the URL, you can use title method as follows:
             if(doc != null){ 
              String title = doc.title(); 
    
  5. To only extract the textual part of the web page, we can chain the body() method with the text() method of a Document object, as follows:
            String text = doc.body().text();
    
  6. If you want to extract all the hyperlinks in a URL, you can use the select() method of a Document object with the a[href]parameter. This gives you all the links at once:
            Elements links = doc.select("a[href]"); 
    
  7. Perhaps you wanted to process the links in a web page inpidually? That is easy, too--you need to iterate over all the links to get the inpidual links:
            for (Element link : links) { 
                String linkHref = link.attr("href"); 
                String linkText = link.text(); 
                String linkOuterHtml = link.outerHtml(); 
                String linkInnerHtml = link.html();  
            System.out.println(linkHref + "t" + linkText + "t"  +  
              linkOuterHtml + "t" + linkInnterHtml);       
            }  
    
  8. Finally, close the if-condition with a brace. Close the method with a brace:
        } 
        }  

The complete method, its class, and the driver method are as follows:

import java.io.IOException; 
import org.jsoup.Jsoup; 
import org.jsoup.nodes.Document; 
import org.jsoup.nodes.Element; 
import org.jsoup.select.Elements; 
 
public class JsoupTesting { 
   public static void main(String[] args){ 
      JsoupTesting test = new JsoupTesting(); 
      test.extractDataWithJsoup("Website address preceded by http://"); 
   } 
 
   public void extractDataWithJsoup(String href){ 
      Document doc = null; 
      try { 
         doc = Jsoup.connect(href).timeout(10*1000).userAgent
             ("Mozilla").ignoreHttpErrors(true).get(); 
      } catch (IOException e) { 
         //Your exception handling here 
      } 
      if(doc != null){ 
         String title = doc.title(); 
         String text = doc.body().text(); 
         Elements links = doc.select("a[href]"); 
         for (Element link : links) { 
            String linkHref = link.attr("href"); 
            String linkText = link.text(); 
            String linkOuterHtml = link.outerHtml(); 
            String linkInnerHtml = link.html(); 
            System.out.println(linkHref + "t" + linkText + "t"  + 
                linkOuterHtml + "t" + linkInnterHtml); 
         } 
      } 
   } 
} 
主站蜘蛛池模板: 达州市| 贺州市| 罗城| 田林县| 南雄市| 古浪县| 吉首市| 赣榆县| 马鞍山市| 凤凰县| 铁岭市| 奈曼旗| 侯马市| 子洲县| 内丘县| 罗江县| 南阳市| 崇文区| 永康市| 日喀则市| 昆明市| 阳东县| 青岛市| 鸡西市| 怀安县| 库车县| 宜兰市| 农安县| 霍州市| 乐都县| 宝坻区| 金秀| 监利县| 如皋市| 保靖县| 芷江| 澄迈县| 松阳县| 客服| 长岭县| 婺源县|