added redirect support by rkrp · Pull Request #8 · inglesp/http-crawler

rkrp · 2016-10-12T14:40:46Z

Added feature whether to follow redirects or not.

inglesp

Hi @rkrp -- this is a great start. Well done in particular on working out how to get the test server to return redirects.

I've left one comment which I'd like you to address if you can.

inglesp · 2016-10-12T16:24:56Z

src/http_crawler/__init__.py

+        rsp = session.get(url, allow_redirects=follow_redirects)
+
+        if rsp.status_code // 100 == 3:
+            continue


I think we should still yield a response, even if it is a redirect. For instance, a user of the library might want to know about all requests that are redirected. We should also extract any URLs from the Location header of a 3xx response.

Please could you:

remove these two lines, so that we continue to yield all responses;

add a check in this section so that if the response is a 3xx, we try to extract a URL from the Location header.

Does that make sense?

Once you've done that, you should also add yourself to AUTHORS.txt

add a check in this section so that if the response is a 3xx, we try to extract a URL from the Location header.

@inglesp By doing this, wouldn't we essentially be following redirects? I am confused.

If we are proceeding this manner, I think it would be a better approach to scrap the body of the 3xx HTTP response also, in addition to adding the url from location header.

Hi @rkrp. Sorry for the slow reply -- have been AFK for a few days.

I understand your confusion, and I think it's because the wording of #5 is unclear. I'm sorry about that!

Thinking about this a bit more,follow_redirects is not the right name for the new argument, since it suggests that 3xx responses shouldn't be followed at all.

Can you think of a better name for it?

@inglesp How about pause_at_redirects or scrap_redirects?

Any thoughts on this?

I think it would be a better approach to scrap the body of the 3xx HTTP response also, in addition to adding the url from location header.

Hi @rkrp. I'm so sorry it's taken me so long to reply to you.

What about auto_follow_redirects?

And yes, we should also scrape the body of the 3xx response.

added redirect support

b2c9ddd

rkrp mentioned this pull request Oct 12, 2016

Allow user to choose whether to follow redirects #5

Open

inglesp requested changes Oct 12, 2016

View reviewed changes

inglesp reviewed Oct 12, 2016

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added redirect support#8

added redirect support#8
rkrp wants to merge 1 commit intoinglesp:masterfrom
rkrp:redirect_support

rkrp commented Oct 12, 2016

Uh oh!

inglesp left a comment

Uh oh!

inglesp Oct 12, 2016

Uh oh!

rkrp Oct 13, 2016 •

edited

Loading

Uh oh!

inglesp Oct 17, 2016

Uh oh!

inglesp Oct 17, 2016

Uh oh!

rkrp Oct 21, 2016 •

edited

Loading

Uh oh!

inglesp Jan 12, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rkrp commented Oct 12, 2016

Uh oh!

inglesp left a comment

Choose a reason for hiding this comment

Uh oh!

inglesp Oct 12, 2016

Choose a reason for hiding this comment

Uh oh!

rkrp Oct 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

inglesp Oct 17, 2016

Choose a reason for hiding this comment

Uh oh!

inglesp Oct 17, 2016

Choose a reason for hiding this comment

Uh oh!

rkrp Oct 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

inglesp Jan 12, 2017

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rkrp Oct 13, 2016 •

edited

Loading

rkrp Oct 21, 2016 •

edited

Loading