Web Crawler
What is a Web Crawler?
It is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing (web spidering).
It copies pages for processing by a search engine, which indexes the downloaded pages so that users can search more efficiently.
There are three types of Crawling
JSON - requests
uses dictionary
XML - requests
uses dictionary & Beautifulsoup
HTML - requests, selenium
uses Beautifulsoup
I will go over these three types of Crawling by using several example sources
Crawling the information of "sunrise api"
Using the altitudes and latitudes of selected cities
1) Google "Sunrise API"
I will apply the latitude and longitude of Seoul in this case.
2) Find the Appropriate URL from the site
3) Use the Requests module - Get the information from the URL
# longitude of Seoul -> 37.5, latitude of Seoul -> 127.0
import requests
url = 'https://api.sunrise-sunset.org/json?lat=37.5&lng=127.0&date=2021-02-08'
# We can customize the url by changing the number following "lat =" and "lng ="
data = requests.get(url).json()
# The numbers in this data represent the Coordinated Universal Time
{'results': {'sunrise': '10:28:21 PM',
'sunset': '9:03:58 AM',
'solar_noon': '3:46:09 AM',
'day_length': '10:35:37',
'civil_twilight_begin': '10:01:20 PM',
'civil_twilight_end': '9:30:58 AM',
'nautical_twilight_begin': '9:30:26 PM',
'nautical_twilight_end': '10:01:53 AM',
'astronomical_twilight_begin': '8:59:55 PM',
'astronomical_twilight_end': '10:32:24 AM'},
'status': 'OK'}
# pulling out the dictionary which is in another dictionary
{'sunrise': '10:28:21 PM',
'sunset': '9:03:58 AM',
'solar_noon': '3:46:09 AM',
'day_length': '10:35:37',
'civil_twilight_begin': '10:01:20 PM',
'civil_twilight_end': '9:30:58 AM',
'nautical_twilight_begin': '9:30:26 PM',
'nautical_twilight_end': '10:01:53 AM',
'astronomical_twilight_begin': '8:59:55 PM',
'astronomical_twilight_end': '10:32:24 AM'}
4) Select a Date
date = '2020-08-01'
url = 'https://api.sunrise-sunset.org/json?lat=37.5&lng=127.0&date='
data = requests.get(url + date).json()
{'sunrise': '8:36:41 PM',
'sunset': '10:39:55 AM',
'solar_noon': '3:38:18 AM',
'day_length': '14:03:14',
'civil_twilight_begin': '8:07:52 PM',
'civil_twilight_end': '11:08:44 AM',
'nautical_twilight_begin': '7:32:38 PM',
'nautical_twilight_end': '11:43:59 AM',
'astronomical_twilight_begin': '6:54:35 PM',
'astronomical_twilight_end': '12:22:01 PM'}
5) Define a Function
def by_date(date):
url = 'https://api.sunrise-sunset.org/json?lat=37.5&lng=127.0&date='
data = requests.get(url + date).json()
return data['results']
by_date('2020-02-01') # example of the usage of function
{'sunrise': '10:35:35 PM',
'sunset': '8:55:25 AM',
'solar_noon': '3:45:30 AM',
'day_length': '10:19:50',
'civil_twilight_begin': '10:08:08 PM',
'civil_twilight_end': '9:22:53 AM',
'nautical_twilight_begin': '9:36:51 PM',
'nautical_twilight_end': '9:54:10 AM',
'astronomical_twilight_begin': '9:06:05 PM',
'astronomical_twilight_end': '10:24:55 AM'}
6) DataFrame and CSV
# Let's try it for 3 days
import pandas as pd
sample_list = []
df = pd.DataFrame(sample_list)
df.to_csv('sample.csv', index = False)
sunrise | sunset | solar_noon | day_length | civil_twilight_begin | civil_twilight_end | nautical_twilight_begin | nautical_twilight_end | astronomical_twilight_begin | astronomical_twilight_end | |
0 | 10:46:31 PM | 8:24:16 AM | 3:35:23 AM | 09:37:45 | 10:17:30 PM | 8:53:17 AM | 9:44:49 PM | 9:25:58 AM | 9:13:03 PM | 9:57:44 AM |
1 | 10:46:40 PM | 8:25:03 AM | 3:35:52 AM | 09:38:23 | 10:17:40 PM | 8:54:03 AM | 9:45:01 PM | 9:26:42 AM | 9:13:16 PM | 9:58:27 AM |
2 | 10:46:47 PM | 8:25:52 AM | 3:36:19 AM | 09:39:05 | 10:17:49 PM | 8:54:50 AM | 9:45:11 PM | 9:27:27 AM | 9:13:27 PM | 9:59:11 AM |
7) Collect 1 month Data using Loop
def by_date_2(date):
url = 'https://api.sunrise-sunset.org/json?lat=37.5&lng=127.0&date='
data = requests.get(url + date).json()['results']
data['date'] = date # Adding date information
return data
import time
sample_list_2 = []
for date in pd.date_range('2020-01-01', '2020-01-31'):
date = str(date)[:10]
df_new = pd.DataFrame(sample_list_2)
df_new # This took such a long time...
sunrise | sunset | solar_noon | day_length | civil_twilight_begin | civil_twilight_end | nautical_twilight_begin | nautical_twilight_end | astronomical_twilight_begin | astronomical_twilight_end | date | |
0 | 10:46:31 PM | 8:24:16 AM | 3:35:23 AM | 09:37:45 | 10:17:30 PM | 8:53:17 AM | 9:44:49 PM | 9:25:58 AM | 9:13:03 PM | 9:57:44 AM | 2020-01-01 |
1 | 10:46:40 PM | 8:25:03 AM | 3:35:52 AM | 09:38:23 | 10:17:40 PM | 8:54:03 AM | 9:45:01 PM | 9:26:42 AM | 9:13:16 PM | 9:58:27 AM | 2020-01-02 |
2 | 10:46:47 PM | 8:25:52 AM | 3:36:19 AM | 09:39:05 | 10:17:49 PM | 8:54:50 AM | 9:45:11 PM | 9:27:27 AM | 9:13:27 PM | 9:59:11 AM | 2020-01-03 |
3 | 10:46:52 PM | 8:26:42 AM | 3:36:47 AM | 09:39:50 | 10:17:56 PM | 8:55:38 AM | 9:45:20 PM | 9:28:14 AM | 9:13:37 PM | 9:59:56 AM | 2020-01-04 |
4 | 10:46:54 PM | 8:27:33 AM | 3:37:14 AM | 09:40:39 | 10:18:01 PM | 8:56:27 AM | 9:45:26 PM | 9:29:01 AM | 9:13:45 PM | 10:00:42 AM | 2020-01-05 |
5 | 10:46:55 PM | 8:28:26 AM | 3:37:41 AM | 09:41:31 | 10:18:03 PM | 8:57:18 AM | 9:45:31 PM | 9:29:50 AM | 9:13:52 PM | 10:01:29 AM | 2020-01-06 |
6 | 10:46:54 PM | 8:29:19 AM | 3:38:07 AM | 09:42:25 | 10:18:04 PM | 8:58:09 AM | 9:45:35 PM | 9:30:39 AM | 9:13:57 PM | 10:02:17 AM | 2020-01-07 |
7 | 10:46:50 PM | 8:30:14 AM | 3:38:32 AM | 09:43:24 | 10:18:04 PM | 8:59:01 AM | 9:45:36 PM | 9:31:29 AM | 9:13:59 PM | 10:03:05 AM | 2020-01-08 |
8 | 10:46:45 PM | 8:31:10 AM | 3:38:58 AM | 09:44:25 | 10:18:01 PM | 8:59:54 AM | 9:45:35 PM | 9:32:20 AM | 9:14:01 PM | 10:03:54 AM | 2020-01-09 |
9 | 10:46:37 PM | 8:32:07 AM | 3:39:22 AM | 09:45:30 | 10:17:56 PM | 9:00:48 AM | 9:45:33 PM | 9:33:12 AM | 9:14:00 PM | 10:04:44 AM | 2020-01-10 |
10 | 10:46:28 PM | 8:33:05 AM | 3:39:46 AM | 09:46:37 | 10:17:49 PM | 9:01:43 AM | 9:45:28 PM | 9:34:04 AM | 9:13:57 PM | 10:05:35 AM | 2020-01-11 |
11 | 10:46:16 PM | 8:34:03 AM | 3:40:10 AM | 09:47:47 | 10:17:40 PM | 9:02:39 AM | 9:45:22 PM | 9:34:57 AM | 9:13:53 PM | 10:06:26 AM | 2020-01-12 |
12 | 10:46:03 PM | 8:35:02 AM | 3:40:33 AM | 09:48:59 | 10:17:30 PM | 9:03:36 AM | 9:45:14 PM | 9:35:51 AM | 9:13:47 PM | 10:07:18 AM | 2020-01-13 |
13 | 10:45:47 PM | 8:36:03 AM | 3:40:55 AM | 09:50:16 | 10:17:17 PM | 9:04:33 AM | 9:45:04 PM | 9:36:45 AM | 9:13:39 PM | 10:08:10 AM | 2020-01-14 |
14 | 10:45:29 PM | 8:37:04 AM | 3:41:16 AM | 09:51:35 | 10:17:02 PM | 9:05:30 AM | 9:44:52 PM | 9:37:40 AM | 9:13:30 PM | 10:09:03 AM | 2020-01-15 |
15 | 10:45:10 PM | 8:38:05 AM | 3:41:37 AM | 09:52:55 | 10:16:46 PM | 9:06:29 AM | 9:44:39 PM | 9:38:36 AM | 9:13:18 PM | 10:09:57 AM | 2020-01-16 |
16 | 10:44:48 PM | 8:39:07 AM | 3:41:58 AM | 09:54:19 | 10:16:27 PM | 9:07:28 AM | 9:44:23 PM | 9:39:32 AM | 9:13:05 PM | 10:10:51 AM | 2020-01-17 |
17 | 10:44:24 PM | 8:40:10 AM | 3:42:17 AM | 09:55:46 | 10:16:07 PM | 9:08:27 AM | 9:44:06 PM | 9:40:28 AM | 9:12:49 PM | 10:11:45 AM | 2020-01-18 |
18 | 10:43:58 PM | 8:41:14 AM | 3:42:36 AM | 09:57:16 | 10:15:45 PM | 9:09:27 AM | 9:43:47 PM | 9:41:25 AM | 9:12:32 PM | 10:12:40 AM | 2020-01-19 |
19 | 10:43:31 PM | 8:42:17 AM | 3:42:54 AM | 09:58:46 | 10:15:20 PM | 9:10:28 AM | 9:43:25 PM | 9:42:23 AM | 9:12:13 PM | 10:13:35 AM | 2020-01-20 |
20 | 10:43:01 PM | 8:43:22 AM | 3:43:11 AM | 10:00:21 | 10:14:54 PM | 9:11:28 AM | 9:43:02 PM | 9:43:20 AM | 9:11:53 PM | 10:14:30 AM | 2020-01-21 |
21 | 10:42:30 PM | 8:44:26 AM | 3:43:28 AM | 10:01:56 | 10:14:26 PM | 9:12:29 AM | 9:42:38 PM | 9:44:18 AM | 9:11:30 PM | 10:15:26 AM | 2020-01-22 |
22 | 10:41:56 PM | 8:45:31 AM | 3:43:44 AM | 10:03:35 | 10:13:57 PM | 9:13:31 AM | 9:42:11 PM | 9:45:17 AM | 9:11:06 PM | 10:16:22 AM | 2020-01-23 |
23 | 10:41:21 PM | 8:46:37 AM | 3:43:59 AM | 10:05:16 | 10:13:25 PM | 9:14:33 AM | 9:41:42 PM | 9:46:15 AM | 9:10:39 PM | 10:17:18 AM | 2020-01-24 |
24 | 10:40:44 PM | 8:47:42 AM | 3:44:13 AM | 10:06:58 | 10:12:52 PM | 9:15:35 AM | 9:41:12 PM | 9:47:14 AM | 9:10:11 PM | 10:18:15 AM | 2020-01-25 |
25 | 10:40:05 PM | 8:48:48 AM | 3:44:27 AM | 10:08:43 | 10:12:16 PM | 9:16:37 AM | 9:40:40 PM | 9:48:13 AM | 9:09:42 PM | 10:19:12 AM | 2020-01-26 |
26 | 10:39:25 PM | 8:49:54 AM | 3:44:39 AM | 10:10:29 | 10:11:39 PM | 9:17:39 AM | 9:40:06 PM | 9:49:12 AM | 9:09:10 PM | 10:20:08 AM | 2020-01-27 |
27 | 10:38:42 PM | 8:51:00 AM | 3:44:51 AM | 10:12:18 | 10:11:00 PM | 9:18:42 AM | 9:39:30 PM | 9:50:12 AM | 9:08:37 PM | 10:21:06 AM | 2020-01-28 |
28 | 10:37:58 PM | 8:52:06 AM | 3:45:02 AM | 10:14:08 | 10:10:20 PM | 9:19:44 AM | 9:38:53 PM | 9:51:11 AM | 9:08:01 PM | 10:22:03 AM | 2020-01-29 |
29 | 10:37:12 PM | 8:53:13 AM | 3:45:12 AM | 10:16:01 | 10:09:38 PM | 9:20:47 AM | 9:38:14 PM | 9:52:11 AM | 9:07:24 PM | 10:23:00 AM | 2020-01-30 |
30 | 10:36:24 PM | 8:54:19 AM | 3:45:22 AM | 10:17:55 | 10:08:54 PM | 9:21:50 AM | 9:37:33 PM | 9:53:10 AM | 9:06:46 PM | 10:23:58 AM | 2020-01-31 |
'Python Programming > Notes' 카테고리의 다른 글
CRISP-DM 공공자전거 데이터 분석 (2) | 2021.04.02 |
Web_Crawler(3)_HTML (0) | 2021.03.22 |
금융데이터 다루기 - 패키지 설치와 Plotting (0) | 2021.03.18 |
Web Crawler(2) - XML 뉴스 정보 가져오기 (0) | 2021.02.21 |