misc-programming-notes/notesonPython.html at master · arturolei/misc-programming-notes · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
<!DOCTYPE HTML>
<HTML>
<head>
Notes on Python (D-Lab Workshops/Seminars)
</head>

<p>
There are two ways of getting data:
1) Webscraping--> Every scraping job you do is inefficient, relies on downloading a lot of data-->generally encourage people not to do this especially if there is another way
2) API (Application Programming Interface)--->Get structured data back, JSON--->Please see if there's an API
-->API data access might be limited

Rule of Thumb:
If no API, you have to scrape-->Etiquette for scraping? API's furnish structured data.

--Check to see if they let you.

Webscraping most of the time, we can find what we want through HTML, sometimes we can find this through CSS.

CSS makes finding key elements that we want.
</p>

<p>
Jupyter Notebook

Markdown and HTML integrated with code, things look a lot more presentable... Things can look nice when you program.
Specific can run-->You can save stages within your kernel. Better than writing

Need request library, bs4-->Import

JS Problem-->Some sites do not populate elements until things happy.

BeautifulSoup(src, 'lxml'), src exq

DO NOT FORGET time.sleep(x) where x is positive integer. DO NOT FORGET TO SLEEP or given Python speed, you could get blocked by the server (constantly pulling data)

Please take a timestamp photo-->If you do scraping, write code to create a text so that you have all the original data... -->You never want to scrape things twice or multiple times.
<p>

</HTML>