๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
๊ฐœ๋ฐœ/Java

(JAVA) JSOUP - Java HTML Parser

by gomdeng 2024. 11. 25.

๐Ÿถ JSOUP

1. HTML ๋ฌธ์„œ์— ํฌํ•จ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์„ ์ถ”์ถœํ•˜๋Š” JAVA ์˜คํ”ˆ์†Œ์Šค ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
2. ์ •์  ํŽ˜์ด์ง€๋ฅผ ํฌ๋กค๋ง ํ•˜๋Š”๋ฐ ์ฃผ๋กœ ์‚ฌ์šฉ๋œ๋‹ค.

 

๐Ÿถ ํฌ๋กค๋ง(Crawling)

HTML ํŽ˜์ด์ง€๋ฅผ ๊ฐ€์ ธ์™€์„œ ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœํ•˜๋Š” ์ž‘์—…์„ ์˜๋ฏธ

1. ๋™์  ํŽ˜์ด์ง€๋Š” ํฌ๋กค๋ง ๋˜์ง€ ์•Š๋Š”๋‹ค. (ํŒŒ์‹ฑ ์ „ HTML ์†Œ์Šค๋ฅผ ๊ฐ€์ ธ์˜ด)
2. ๋™์  ํŽ˜์ด์ง€๋ฅผ ํฌ๋กค๋งํ•˜๋Š” ๋ฐฉ๋ฒ•์—๋Š” ์…€๋ ˆ๋‹ˆ์›€(Selenium)์ด ์žˆ๋‹ค.

 

๐Ÿถ ์‚ฌ์šฉ์˜ˆ์ œ

1. ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์ถ”๊ฐ€ : https://mvnrepository.com/artifact/org.jsoup/jsoup

2. Maven, Gradle๊ณผ ๊ฐ™์€ ๋นŒ๋“œ ๋„๊ตฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์˜์กด์„ฑ์„ ์ถ”๊ฐ€ํ•˜๊ฑฐ๋‚˜ jar๋ฅผ ๋‹ค์šด๋ฐ›์•„ ํด๋ž˜์ŠคํŒจ์Šค์— ์ถ”๊ฐ€ํ•œ๋‹ค.

 

๐Ÿถ ์†Œ์Šค

import java.io.IOException;
import java.util.List;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

public class JsoupTest {
	public static void main(String[] args) {
    	
        // [Test URL] : SBS News
        String URL = "https://news.sbs.co.kr/news/newsflash.do?plink=SNB&cooper";
        
        // [Document] : Jsoup์œผ๋กœ ๊ฐ€์ ธ์˜จ HTML์„ ๋‹ด์„ ๊ฐ์ฒด
        Document doc = null;
        
        try {
            // URL์— ํ•ด๋‹นํ•˜๋Š” HTML ์ „์ฒด ๋ฌธ์„œ ๊ฐ€์ ธ์˜ค๊ธฐ
            doc = Jsoup.connect(URL).get();
        } catch(IOException e) {
            e.printStackTrace();
        }
        
        // ์ถœ๋ ฅ ๋ณ€์ˆ˜ ์„ ์–ธ
        String title;
        String content;
        String date;
        String writer;
        
        // [Element]	: Document์˜ HTML ์š”์†Œ
        // [Elements]	: Element๊ฐ€ ๋ชจ์ธ ์ž๋ฃŒํ˜•
        // ํŠน์ • ๊ฐ’ ์ถ”์ถœ	: css ์„ ํƒ๋ฌธ๋ฒ•์„ ์ด์šฉ, ํƒœ๊ทธ๋ฅผ ๊ฒ€์ƒ‰ํ•˜์—ฌ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์˜จ๋‹ค.
        List<Element> elements = doc.select(".w_news_list ul > li");
        
        // ๊ฒฐ๊ณผ ๊ฐ’ ์ถœ๋ ฅ
        for(Element element : elements) {
            title	= "^title : " 	+ element.getElementsByClass("sub").text();
            content	= "^content : " + element.getElementsByClass("read").text();
            date	= "^date : " 	+ element.getElementsByClass("date").text();
            writer	= "^writer : " 	+ element.getElementsByClass("name").text();
            
            System.out.print(title + "\n" + content + "\n" + date + "\n" + writer + "\n");
            System.out.println("-------------------------------------------------");
        }
    }
}

 

๐Ÿถ ๊ฒฐ๊ณผ

 

๐Ÿถ ์ฐธ๊ณ 

 

Java HTML parser, Jsoup๋กœ ์›ํ•˜๋Š” ๊ฐ’ ์–ป์–ด๋‚ด๊ธฐ - ๊ธฐ๋ณธ

Jsoup๋Š” ์•„์ฃผ ๊ฐ•๋ ฅํ•˜๊ณ  ์žฌ๋ฏธ์žˆ๋Š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋‹ค. ๋‹จ์ˆœํ•œ HTML ๋ฌธ์„œ ํŒŒ์‹ฑ์„ ๋„˜์–ด ์›น ์‚ฌ์ดํŠธ์— ๋Œ€ํ•œ Request, Response๋ฅผ ๋ชจ๋‘ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค. ๋•๋ถ„์— ์ผ๋ถ€ ํŠน๋ณ„ํ•œ ๊ฒฝ์šฐ(ํ”Œ๋ž˜์‹œ, ์• ํ”Œ๋ฆฟ, ActiveX๊ฐ™์€ ๋น„ํ‘œ์ค€

partnerjun.tistory.com

 

 

[Java] - Jsoup์„ ์ด์šฉํ•œ ํฌ๋กค๋ง(feat. ์ธํ”„๋Ÿฐ)

• ์•ˆ๋…•ํ•˜์„ธ์š”~ ์ด์ „์— ์šด์˜ํ•˜๋˜ ๋ธ”๋กœ๊ทธ ๋ฐ GitHub, ๊ณต๋ถ€ ๋‚ด์šฉ์„ ์ •๋ฆฌํ•˜๋Š” Study-GitHub ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค! • ๋„ค์ด๋ฒ„ ๋ธ”๋กœ๊ทธ • GitHub • Study-GitHub • ๐Ÿ” ๐Ÿ“Ž Jsoup์„ ์ด์šฉํ•œ ํฌ๋กค๋ง ์•ˆ๋…•ํ•˜์„ธ์š”! ์ด๋ฒˆ์— ์ •๋ฆฌ

zzang9ha.tistory.com