2016-03-29 22 views
5

Próbowałem użyć Scrapy, aby uzyskać dane z Google Analytics i pomimo tego, że jestem kompletnym użytkownikiem Pythona, zrobiłem pewne postępy. Mogę teraz zalogować się do Google Analytics przez Scrapy, ale muszę wykonać żądanie AJAX, aby uzyskać dane, które chcę. Próbowałem replikacji żądania HTTP Header Mój przeglądarki z kodu poniżej, ale nie wydaje się do pracy, mój dziennik błędów mówiSkrobanie Google Analytics przez Scrapy

zbyt wiele wartości do rozpakowywania

Może ktoś pomóc? Pracowałem nad tym przez dwa dni, mam wrażenie, że jestem bardzo blisko, ale jestem też bardzo zdezorientowany.

Oto kod:

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 
from scrapy.http import FormRequest, Request 
from scrapy.selector import Selector 
import logging 
from super.items import SuperItem 
from scrapy.shell import inspect_response 
import json 

class LoginSpider(BaseSpider): 
    name = 'super' 
    start_urls = ['https://accounts.google.com/ServiceLogin?service=analytics&passive=true&nui=1&hl=fr&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&followup=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr#identifier'] 

    def parse(self, response): 
     return [FormRequest.from_response(response, 
        formdata={'Email': 'Email'}, 

        callback=self.log_password)] 


    def log_password(self, response): 
     return [FormRequest.from_response(response, 
        formdata={'Passwd': 'Password'}, 

        callback=self.after_login)] 

    def after_login(self, response): 
     if "authentication failed" in response.body: 
     self.log("Login failed", level=logging.ERROR) 
     return 
    # We've successfully authenticated, let's have some fun! 
     else: 
     print("Login Successful!!") 
     return Request(url="https://analytics.google.com/analytics/web/getPage?id=trafficsources-all-traffic&ds=a5425w87291514p94531107&hl=fr&authuser=0", 
       method='POST', 
       headers=[{'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8', 
         'Galaxy-Ajax': 'true', 
         'Origin': 'https://analytics.google.com', 
         'Referer': 'https://analytics.google.com/analytics/web/?hl=fr&pli=1', 
         'User-Agent': 'My-user-agent', 
         'X-GAFE4-XSRF-TOKEN': 'Mytoken'}], 
       callback=self.parse_tastypage, dont_filter=True) 


    def parse_tastypage(self, response): 
     response = json.loads(jsonResponse) 

     inspect_response(response, self) 
     yield item 

A oto część dziennika:

2016-03-28 19:11:39 [scrapy] INFO: Enabled item pipelines: 
[] 
2016-03-28 19:11:39 [scrapy] INFO: Spider opened 
2016-03-28 19:11:39 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2016-03-28 19:11:39 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 
2016-03-28 19:11:40 [scrapy] DEBUG: Crawled (200) <GET https://accounts.google.com/ServiceLogin?service=analytics&passive=true&nui=1&hl=fr&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&followup=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr#identifier> (referer: None) 
2016-03-28 19:11:46 [scrapy] DEBUG: Crawled (200) <POST https://accounts.google.com/AccountLoginInfo> (referer: https://accounts.google.com/ServiceLogin?service=analytics&passive=true&nui=1&hl=fr&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&followup=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr) 
2016-03-28 19:11:50 [scrapy] DEBUG: Redirecting (302) to <GET https://accounts.google.com/CheckCookie?hl=fr&checkedDomains=youtube&pstMsg=0&chtml=LoginDoneHtml&service=analytics&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&gidl=CAA> from <POST https://accounts.google.com/ServiceLoginAuth> 
2016-03-28 19:11:57 [scrapy] DEBUG: Redirecting (302) to <GET https://www.google.com/analytics/web/?hl=fr> from <GET https://accounts.google.com/CheckCookie?hl=fr&checkedDomains=youtube&pstMsg=0&chtml=LoginDoneHtml&service=analytics&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&gidl=CAA> 
2016-03-28 19:12:01 [scrapy] DEBUG: Crawled (200) <GET https://www.google.com/analytics/web/?hl=fr> (referer: https://accounts.google.com/AccountLoginInfo) 
Login Successful!! 
2016-03-28 19:12:01 [scrapy] ERROR: Spider error processing <GET https://www.google.com/analytics/web/?hl=fr> (referer: https://accounts.google.com/AccountLoginInfo) 
Traceback (most recent call last): 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/twisted/internet/defer.py", line 577, in _runCallbacks 
    current.result = callback(current.result, *args, **kw) 
    File "/Users/aminbouraiss/super/super/spiders/mySuper.py", line 42, in after_login 
    callback=self.parse_tastypage, dont_filter=True) 
    File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/http/request/__init__.py", line 35, in __init__ 
    self.headers = Headers(headers or {}, encoding=encoding) 
    File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/http/headers.py", line 12, in __init__ 
    super(Headers, self).__init__(seq) 
    File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/utils/datatypes.py", line 193, in __init__ 
    self.update(seq) 
    File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/utils/datatypes.py", line 229, in update 
    super(CaselessDict, self).update(iseq) 
    File "/Library/Python/2.7/site-packages/Scrapy-1.1.0rc3-py2.7.egg/scrapy/utils/datatypes.py", line 228, in <genexpr> 
    iseq = ((self.normkey(k), self.normvalue(v)) for k, v in seq) 
ValueError: too many values to unpack 
2016-03-28 19:12:01 [scrapy] INFO: Closing spider (finished) 
2016-03-28 19:12:01 [scrapy] INFO: Dumping Scrapy stats: 
{'downloader/request_bytes': 6419, 
'downloader/request_count': 5, 
'downloader/request_method_count/GET': 3, 
'downloader/request_method_count/POST': 2, 
'downloader/response_bytes': 75986, 
'downloader/response_count': 5, 
'downloader/response_status_count/200': 3, 
'downloader/response_status_count/302': 2, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2016, 3, 28, 23, 12, 1, 824033), 
'log_count/DEBUG': 6, 
+2

nie, użyj funkcji API –

+0

Próbuję uzyskać pewne dane, że nie mogę dostać się za pośrednictwem interfejsu API –

Odpowiedz

2

Twój błąd jest ponieważ nagłówki musi być DICT, a nie lista wewnątrz dict:

headers={'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8', 

          'Galaxy-Ajax': 'true', 
          'Origin': 'https://analytics.google.com', 
          'Referer': 'https://analytics.google.com/analytics/web/?hl=fr&pli=1', 
          'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36', 
          }, 

To naprawi bieżący numer, ale otrzymasz numer 411, ponieważ musisz określić długość treści również, jeśli masz dd czego chcesz wyciągnąć, będę mógł pokazać ci jak. można zobaczyć poniżej wynik:

2016-03-29 14:02:11 [scrapy] DEBUG: Redirecting (302) to <GET https://www.google.com/analytics/web/?hl=fr> from <GET https://accounts.google.com/CheckCookie?hl=fr&checkedDomains=youtube&pstMsg=0&chtml=LoginDoneHtml&service=analytics&continue=https%3A%2F%2Fwww.google.com%2Fanalytics%2Fweb%2F%3Fhl%3Dfr&gidl=CAA> 
2016-03-29 14:02:13 [scrapy] DEBUG: Crawled (200) <GET https://www.google.com/analytics/web/?hl=fr> (referer: https://accounts.google.com/AccountLoginInfo) 
Login Successful!! 
2016-03-29 14:02:14 [scrapy] DEBUG: Crawled (411) <POST https://analytics.google.com/analytics/web/getPage?id=trafficsources-all-traffic&ds=a5425w87291514p94531107&hl=fr&authuser=0> (referer: https://analytics.google.com/analytics/web/?hl=fr&pli=1) 
2016-03-29 14:02:14 [scrapy] DEBUG: Ignoring response <411 https://analytics.google.com/analytics/web/getPage?id=trafficsources-all-traffic&ds=a5425w87291514p94531107&hl=fr&authuser=0>: HTTP status code is not handled or not allowed 
+0

Dzięki Padraic, ja ci piwo winien! Zmieniłem nagłówki żądań http i nareszcie działało. –

+0

@gerardbaste, no prob, cieszę się, że masz to posortowane, szczęśliwe analizowanie. –