Web Crawler
What is a Web Crawler?
It is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing (web spidering).
It copies pages for processing by a search engine, which indexes the downloaded pages so that users can search more efficiently.
There are three types of Crawling
-
JSON - requests
uses dictionary
-
XML - requests
uses dictionary & Beautifulsoup
-
HTML - requests, selenium
uses Beautifulsoup
I will go over these three types of Crawling by using several example sources
1. JSON
-
Crawling the information of "sunrise api"
-
Using the altitudes and latitudes of selected cities
1) Google "Sunrise API"
-
I will apply the latitude and longitude of Seoul in this case.
2) Find the Appropriate URL from the site
3) Use the Requests module - Get the information from the URL
# longitude of Seoul -> 37.5, latitude of Seoul -> 127.0
import requests
url = 'https://api.sunrise-sunset.org/json?lat=37.5&lng=127.0&date=2021-02-08'
# We can customize the url by changing the number following "lat =" and "lng ="
data = requests.get(url).json()
data
# The numbers in this data represent the Coordinated Universal Time
{'results': {'sunrise': '10:28:21 PM',
'sunset': '9:03:58 AM',
'solar_noon': '3:46:09 AM',
'day_length': '10:35:37',
'civil_twilight_begin': '10:01:20 PM',
'civil_twilight_end': '9:30:58 AM',
'nautical_twilight_begin': '9:30:26 PM',
'nautical_twilight_end': '10:01:53 AM',
'astronomical_twilight_begin': '8:59:55 PM',
'astronomical_twilight_end': '10:32:24 AM'},
'status': 'OK'}
# pulling out the dictionary which is in another dictionary
data['results']
{'sunrise': '10:28:21 PM',
'sunset': '9:03:58 AM',
'solar_noon': '3:46:09 AM',
'day_length': '10:35:37',
'civil_twilight_begin': '10:01:20 PM',
'civil_twilight_end': '9:30:58 AM',
'nautical_twilight_begin': '9:30:26 PM',
'nautical_twilight_end': '10:01:53 AM',
'astronomical_twilight_begin': '8:59:55 PM',
'astronomical_twilight_end': '10:32:24 AM'}
4) Select a Date
date = '2020-08-01'
url = 'https://api.sunrise-sunset.org/json?lat=37.5&lng=127.0&date='
data = requests.get(url + date).json()
data['results']
{'sunrise': '8:36:41 PM',
'sunset': '10:39:55 AM',
'solar_noon': '3:38:18 AM',
'day_length': '14:03:14',
'civil_twilight_begin': '8:07:52 PM',
'civil_twilight_end': '11:08:44 AM',
'nautical_twilight_begin': '7:32:38 PM',
'nautical_twilight_end': '11:43:59 AM',
'astronomical_twilight_begin': '6:54:35 PM',
'astronomical_twilight_end': '12:22:01 PM'}
5) Define a Function
def by_date(date):
url = 'https://api.sunrise-sunset.org/json?lat=37.5&lng=127.0&date='
data = requests.get(url + date).json()
return data['results']
by_date('2020-02-01') # example of the usage of function
{'sunrise': '10:35:35 PM',
'sunset': '8:55:25 AM',
'solar_noon': '3:45:30 AM',
'day_length': '10:19:50',
'civil_twilight_begin': '10:08:08 PM',
'civil_twilight_end': '9:22:53 AM',
'nautical_twilight_begin': '9:36:51 PM',
'nautical_twilight_end': '9:54:10 AM',
'astronomical_twilight_begin': '9:06:05 PM',
'astronomical_twilight_end': '10:24:55 AM'}
6) DataFrame and CSV
# Let's try it for 3 days
import pandas as pd
sample_list = []
sample_list.append(by_date('2020-01-01'))
sample_list.append(by_date('2020-01-02'))
sample_list.append(by_date('2020-01-03'))
df = pd.DataFrame(sample_list)
df.to_csv('sample.csv', index = False)
df
sunrise | sunset | solar_noon | day_length | civil_twilight_begin | civil_twilight_end | nautical_twilight_begin | nautical_twilight_end | astronomical_twilight_begin | astronomical_twilight_end | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 10:46:31 PM | 8:24:16 AM | 3:35:23 AM | 09:37:45 | 10:17:30 PM | 8:53:17 AM | 9:44:49 PM | 9:25:58 AM | 9:13:03 PM | 9:57:44 AM |
1 | 10:46:40 PM | 8:25:03 AM | 3:35:52 AM | 09:38:23 | 10:17:40 PM | 8:54:03 AM | 9:45:01 PM | 9:26:42 AM | 9:13:16 PM | 9:58:27 AM |
2 | 10:46:47 PM | 8:25:52 AM | 3:36:19 AM | 09:39:05 | 10:17:49 PM | 8:54:50 AM | 9:45:11 PM | 9:27:27 AM | 9:13:27 PM | 9:59:11 AM |
7) Collect 1 month Data using Loop
def by_date_2(date):
url = 'https://api.sunrise-sunset.org/json?lat=37.5&lng=127.0&date='
data = requests.get(url + date).json()['results']
data['date'] = date # Adding date information
return data
import time
sample_list_2 = []
for date in pd.date_range('2020-01-01', '2020-01-31'):
date = str(date)[:10]
sample_list_2.append(by_date_2(date))
time.sleep(0.2)
df_new = pd.DataFrame(sample_list_2)
df_new # This took such a long time...
sunrise | sunset | solar_noon | day_length | civil_twilight_begin | civil_twilight_end | nautical_twilight_begin | nautical_twilight_end | astronomical_twilight_begin | astronomical_twilight_end | date | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10:46:31 PM | 8:24:16 AM | 3:35:23 AM | 09:37:45 | 10:17:30 PM | 8:53:17 AM | 9:44:49 PM | 9:25:58 AM | 9:13:03 PM | 9:57:44 AM | 2020-01-01 |
1 | 10:46:40 PM | 8:25:03 AM | 3:35:52 AM | 09:38:23 | 10:17:40 PM | 8:54:03 AM | 9:45:01 PM | 9:26:42 AM | 9:13:16 PM | 9:58:27 AM | 2020-01-02 |
2 | 10:46:47 PM | 8:25:52 AM | 3:36:19 AM | 09:39:05 | 10:17:49 PM | 8:54:50 AM | 9:45:11 PM | 9:27:27 AM | 9:13:27 PM | 9:59:11 AM | 2020-01-03 |
3 | 10:46:52 PM | 8:26:42 AM | 3:36:47 AM | 09:39:50 | 10:17:56 PM | 8:55:38 AM | 9:45:20 PM | 9:28:14 AM | 9:13:37 PM | 9:59:56 AM | 2020-01-04 |
4 | 10:46:54 PM | 8:27:33 AM | 3:37:14 AM | 09:40:39 | 10:18:01 PM | 8:56:27 AM | 9:45:26 PM | 9:29:01 AM | 9:13:45 PM | 10:00:42 AM | 2020-01-05 |
5 | 10:46:55 PM | 8:28:26 AM | 3:37:41 AM | 09:41:31 | 10:18:03 PM | 8:57:18 AM | 9:45:31 PM | 9:29:50 AM | 9:13:52 PM | 10:01:29 AM | 2020-01-06 |
6 | 10:46:54 PM | 8:29:19 AM | 3:38:07 AM | 09:42:25 | 10:18:04 PM | 8:58:09 AM | 9:45:35 PM | 9:30:39 AM | 9:13:57 PM | 10:02:17 AM | 2020-01-07 |
7 | 10:46:50 PM | 8:30:14 AM | 3:38:32 AM | 09:43:24 | 10:18:04 PM | 8:59:01 AM | 9:45:36 PM | 9:31:29 AM | 9:13:59 PM | 10:03:05 AM | 2020-01-08 |
8 | 10:46:45 PM | 8:31:10 AM | 3:38:58 AM | 09:44:25 | 10:18:01 PM | 8:59:54 AM | 9:45:35 PM | 9:32:20 AM | 9:14:01 PM | 10:03:54 AM | 2020-01-09 |
9 | 10:46:37 PM | 8:32:07 AM | 3:39:22 AM | 09:45:30 | 10:17:56 PM | 9:00:48 AM | 9:45:33 PM | 9:33:12 AM | 9:14:00 PM | 10:04:44 AM | 2020-01-10 |
10 | 10:46:28 PM | 8:33:05 AM | 3:39:46 AM | 09:46:37 | 10:17:49 PM | 9:01:43 AM | 9:45:28 PM | 9:34:04 AM | 9:13:57 PM | 10:05:35 AM | 2020-01-11 |
11 | 10:46:16 PM | 8:34:03 AM | 3:40:10 AM | 09:47:47 | 10:17:40 PM | 9:02:39 AM | 9:45:22 PM | 9:34:57 AM | 9:13:53 PM | 10:06:26 AM | 2020-01-12 |
12 | 10:46:03 PM | 8:35:02 AM | 3:40:33 AM | 09:48:59 | 10:17:30 PM | 9:03:36 AM | 9:45:14 PM | 9:35:51 AM | 9:13:47 PM | 10:07:18 AM | 2020-01-13 |
13 | 10:45:47 PM | 8:36:03 AM | 3:40:55 AM | 09:50:16 | 10:17:17 PM | 9:04:33 AM | 9:45:04 PM | 9:36:45 AM | 9:13:39 PM | 10:08:10 AM | 2020-01-14 |
14 | 10:45:29 PM | 8:37:04 AM | 3:41:16 AM | 09:51:35 | 10:17:02 PM | 9:05:30 AM | 9:44:52 PM | 9:37:40 AM | 9:13:30 PM | 10:09:03 AM | 2020-01-15 |
15 | 10:45:10 PM | 8:38:05 AM | 3:41:37 AM | 09:52:55 | 10:16:46 PM | 9:06:29 AM | 9:44:39 PM | 9:38:36 AM | 9:13:18 PM | 10:09:57 AM | 2020-01-16 |
16 | 10:44:48 PM | 8:39:07 AM | 3:41:58 AM | 09:54:19 | 10:16:27 PM | 9:07:28 AM | 9:44:23 PM | 9:39:32 AM | 9:13:05 PM | 10:10:51 AM | 2020-01-17 |
17 | 10:44:24 PM | 8:40:10 AM | 3:42:17 AM | 09:55:46 | 10:16:07 PM | 9:08:27 AM | 9:44:06 PM | 9:40:28 AM | 9:12:49 PM | 10:11:45 AM | 2020-01-18 |
18 | 10:43:58 PM | 8:41:14 AM | 3:42:36 AM | 09:57:16 | 10:15:45 PM | 9:09:27 AM | 9:43:47 PM | 9:41:25 AM | 9:12:32 PM | 10:12:40 AM | 2020-01-19 |
19 | 10:43:31 PM | 8:42:17 AM | 3:42:54 AM | 09:58:46 | 10:15:20 PM | 9:10:28 AM | 9:43:25 PM | 9:42:23 AM | 9:12:13 PM | 10:13:35 AM | 2020-01-20 |
20 | 10:43:01 PM | 8:43:22 AM | 3:43:11 AM | 10:00:21 | 10:14:54 PM | 9:11:28 AM | 9:43:02 PM | 9:43:20 AM | 9:11:53 PM | 10:14:30 AM | 2020-01-21 |
21 | 10:42:30 PM | 8:44:26 AM | 3:43:28 AM | 10:01:56 | 10:14:26 PM | 9:12:29 AM | 9:42:38 PM | 9:44:18 AM | 9:11:30 PM | 10:15:26 AM | 2020-01-22 |
22 | 10:41:56 PM | 8:45:31 AM | 3:43:44 AM | 10:03:35 | 10:13:57 PM | 9:13:31 AM | 9:42:11 PM | 9:45:17 AM | 9:11:06 PM | 10:16:22 AM | 2020-01-23 |
23 | 10:41:21 PM | 8:46:37 AM | 3:43:59 AM | 10:05:16 | 10:13:25 PM | 9:14:33 AM | 9:41:42 PM | 9:46:15 AM | 9:10:39 PM | 10:17:18 AM | 2020-01-24 |
24 | 10:40:44 PM | 8:47:42 AM | 3:44:13 AM | 10:06:58 | 10:12:52 PM | 9:15:35 AM | 9:41:12 PM | 9:47:14 AM | 9:10:11 PM | 10:18:15 AM | 2020-01-25 |
25 | 10:40:05 PM | 8:48:48 AM | 3:44:27 AM | 10:08:43 | 10:12:16 PM | 9:16:37 AM | 9:40:40 PM | 9:48:13 AM | 9:09:42 PM | 10:19:12 AM | 2020-01-26 |
26 | 10:39:25 PM | 8:49:54 AM | 3:44:39 AM | 10:10:29 | 10:11:39 PM | 9:17:39 AM | 9:40:06 PM | 9:49:12 AM | 9:09:10 PM | 10:20:08 AM | 2020-01-27 |
27 | 10:38:42 PM | 8:51:00 AM | 3:44:51 AM | 10:12:18 | 10:11:00 PM | 9:18:42 AM | 9:39:30 PM | 9:50:12 AM | 9:08:37 PM | 10:21:06 AM | 2020-01-28 |
28 | 10:37:58 PM | 8:52:06 AM | 3:45:02 AM | 10:14:08 | 10:10:20 PM | 9:19:44 AM | 9:38:53 PM | 9:51:11 AM | 9:08:01 PM | 10:22:03 AM | 2020-01-29 |
29 | 10:37:12 PM | 8:53:13 AM | 3:45:12 AM | 10:16:01 | 10:09:38 PM | 9:20:47 AM | 9:38:14 PM | 9:52:11 AM | 9:07:24 PM | 10:23:00 AM | 2020-01-30 |
30 | 10:36:24 PM | 8:54:19 AM | 3:45:22 AM | 10:17:55 | 10:08:54 PM | 9:21:50 AM | 9:37:33 PM | 9:53:10 AM | 9:06:46 PM | 10:23:58 AM | 2020-01-31 |
'Python Programming > Notes' 카테고리의 다른 글
CRISP-DM 공공자전거 데이터 분석 (2) | 2021.04.02 |
---|---|
Web_Crawler(3)_HTML (0) | 2021.03.22 |
금융데이터 다루기 - 패키지 설치와 Plotting (0) | 2021.03.18 |
Web Crawler(2) - XML 뉴스 정보 가져오기 (0) | 2021.02.21 |