-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathnotesonPython.html
More file actions
40 lines (27 loc) · 1.44 KB
/
notesonPython.html
File metadata and controls
40 lines (27 loc) · 1.44 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
<!DOCTYPE HTML>
<HTML>
<head>
Notes on Python (D-Lab Workshops/Seminars)
</head>
<p>
There are two ways of getting data:
1) Webscraping--> Every scraping job you do is inefficient, relies on downloading a lot of data-->generally encourage people not to do this especially if there is another way
2) API (Application Programming Interface)--->Get structured data back, JSON--->Please see if there's an API
-->API data access might be limited
Rule of Thumb:
If no API, you have to scrape-->Etiquette for scraping? API's furnish structured data.
--Check to see if they let you.
Webscraping most of the time, we can find what we want through HTML, sometimes we can find this through CSS.
CSS makes finding key elements that we want.
</p>
<p>
Jupyter Notebook
Markdown and HTML integrated with code, things look a lot more presentable... Things can look nice when you program.
Specific can run-->You can save stages within your kernel. Better than writing
Need request library, bs4-->Import
JS Problem-->Some sites do not populate elements until things happy.
BeautifulSoup(src, 'lxml'), src exq
DO NOT FORGET time.sleep(x) where x is positive integer. DO NOT FORGET TO SLEEP or given Python speed, you could get blocked by the server (constantly pulling data)
Please take a timestamp photo-->If you do scraping, write code to create a text so that you have all the original data... -->You never want to scrape things twice or multiple times.
<p>
</HTML>