phpSimpleHtmlDom采集类库_Jquery筛选方式
一个人的学习,漫长而又艰辛,真希望有时候能得到指点,不至于让时间无辜的流失.
基础代码获取网页建议用CURL,附加POST数据可以登陆后采集
<?php
require_once('./simple_html_dom.php');
$url='http://www.w3cschool.cc/';
$Curl=curl_init();//实例化cURL
curl_setopt($Curl, CURLOPT_URL, $url);//初始化路径
curl_setopt($Curl, CURLOPT_RETURNTRANSFER, 1);//0获取后直接打印出来
curl_setopt($Curl, CURLOPT_HEADER, 1);//0关闭打印相应头,直接打印需为1,
$result=curl_exec($Curl);//执行一个cURL会话
curl_close($Curl);//关闭cURL会话
$html = str_get_html($result);//创建DOM
foreach($html->find('#leftcolumn a') as $element) {
echo $element->href . '<br>';//获取URL
echo $element->plaintext . '<br>';//获取纯文本
}
$html->clear();
unset($html);
中文手册(作者: S.C. Chen):http://www.ecartchina.com/php-simple-html-dom/index.htm
采集淘宝测试
require_once('simple_html_dom.php');
ini_set("time_limit","0");
ini_set("memory_limit","512M");
$memory=memory_get_usage();
echo 'memory:'.($memory/1024).'KB<br/>';
echo 'time:'.date('H:i:s',time()).'<br/>';
function curl_get_content($url){
$Curl=curl_init();//实例化cURL
curl_setopt($Curl, CURLOPT_URL, $url);//初始化路径
curl_setopt($Curl, CURLOPT_RETURNTRANSFER, 1);//0获取后直接打印出来
curl_setopt($Curl, CURLOPT_HEADER, 0);//0关闭打印相应头,直接打印需为1,
$result=curl_exec($Curl);//执行一个cURL会话
curl_close($Curl);//关闭cURL会话
return $result;
}
$cateUrl='http://the-seventh-sense.taobao.com/';
$cateCon=curl_get_content($cateUrl);
$cateHtml = str_get_html($cateCon);//创建DOM
$CateList=array();
$i=0;
foreach($cateHtml->find('.J_TAllCatsTree li .fst-cat-hd a[href*=category]') as $element) {
$CateList[$i]['url']=urldecode($element->href);//获取URL
$CateList[$i]['name']=$element->plaintext;//获取纯文本
$i++;
}
$cateHtml->clear();
unset($cateHtml);
$i=0;
foreach ($CateList as $goodsUrl) {
$goodsCon=curl_get_content($goodsUrl['url']);
$goodsHtml = str_get_html($goodsCon);//创建DOM
$goodsBlock=$goodsHtml->find('.shop-hesper-bd .item');
foreach($goodsBlock as $goodsElement ) {
$goodsList[$i]['name']=$goodsElement->find(".detail .item-name",0)->plaintext;
$goodsList[$i]['price']=$goodsElement->find(".detail .c-price",0)->plaintext;
$goodsList[$i]['img']=$goodsElement->find(".photo a img",0)->src;
$goodsList[$i]['catename']=$goodsUrl['name'];
$i++;
}
$goodsHtml->clear();
unset($goodsHtml);
}
echo '<hr/>';
$n1=count($CateList);
$n2=count($goodsList);
echo '采集'.$n1.'条栏目'.$n2.'个商品<br/>';
$memory=memory_get_usage();
echo 'memory:'.($memory/1024).'KB<br/>';
echo 'time:'.date('H:i:s',time()).'<br/>';
beginmemory:971.953125KBbegintime:05:30:19
overmemory:1352.890625KB
overtime:05:30:39
耗时20s,成功采集9个栏目127个商品
BY 悠悠山雨