Java Jsoup爬虫入门(1)

小编出个简单爬取页面数据的教程吧,爬取账号密码那些小编也不会,也不敢学,哈哈。但是小编也希望各位大佬们要遵守网络安全,爬爬页面数据还是没什么的。但是小编有一个小小要求,别爬小编的网站,小编的小站没啥人气,用的也是比较低配的服务器,我怕爬着爬着就崩了,可以利用其他网站的博客去练练手,哈哈。。

好了,小编也不多说了,开始吧

pom.xml依赖

<dependencies>

    <dependency>
        <groupId>org.apache.httpcomponents</groupId>
        <artifactId>httpclient</artifactId>
        <version>4.0.1</version>
    </dependency>

    <dependency>
        <groupId>org.apache.httpcomponents</groupId>
        <artifactId>httpcore</artifactId>
        <version>4.0.1</version>
    </dependency>

    <dependency>
        <groupId>org.apache.httpcomponents</groupId>
        <artifactId>httpmime</artifactId>
        <version>4.0.1</version>
    </dependency>

    <dependency>
        <groupId>commons-codec</groupId>
        <artifactId>commons-codec</artifactId>
        <version>1.3</version>
    </dependency>

    <dependency>
        <groupId>commons-logging</groupId>
        <artifactId>commons-logging</artifactId>
        <version>1.1.1</version>
    </dependency>

    <dependency>
        <groupId>commons-io</groupId>
        <artifactId>commons-io</artifactId>
        <version>1.4</version>
    </dependency>

    <dependency>
        <groupId>org.jsoup</groupId>
        <artifactId>jsoup</artifactId>
        <version>1.11.3</version>
    </dependency>

    <dependency>
        <groupId>org.apache.commons</groupId>
        <artifactId>commons-lang3</artifactId>
        <version>3.9</version>
    </dependency>

    <dependency>
        <groupId>junit</groupId>
        <artifactId>junit</artifactId>
        <version>4.12</version>
        <scope>test</scope>
    </dependency>

</dependencies>

Demo

import org.jsoup.Connection;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.junit.Test;

public class Demo {

    //爬虫入门
    @Test
    public void text01() throws Exception {
        //爬取的目标网站
        String targetURL = "http://dt2008.cn";
        //获取connect
        Connection connection = Jsoup.connect(targetURL);
        //伪造请求头
        connection.header("Accept", "");
        connection.header("Accept-Encoding", "");
        connection.header("Accept-Language", "");
        connection.header("Cache-Control", "");
        connection.header("Connection", "");
        connection.header("Cookie", "");
        connection.header("Host", "");
        connection.header("User-Agent", "");
        connection.ignoreHttpErrors(true);
        //执行
        Connection.Response execute = connection.method(Connection.Method.POST).execute();
        //获取爬取结果
        String body = execute.body();
        System.out.println(body);

    }
}

伪造请求头怎么伪造呢?

浏览器F12 – 网络 – 点击所有 – 找出触发原因是document的 – 点进去

这里面就是请求头,根据key value对引值一个一个写入,注意不需要全部都写得,只要代码上哪里需要写直接复制上就行了

运行效果

 

 

 

本站资源除特别声明外,转载文章请声明文章出处
东泰博客 » Java Jsoup爬虫入门(1)

发表评论

切图仔日常笔记博客