본문 바로가기

Python Programming/Notes

Web Crawler(1) - JSON

Web Crawler

 

What is a Web Crawler?

It is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing (web spidering).
It copies pages for processing by a search engine, which indexes the downloaded pages so that users can search more efficiently.

 

There are three types of Crawling

  • JSON - requests

    uses dictionary

  • XML - requests

    uses dictionary & Beautifulsoup

  • HTML - requests, selenium

    uses Beautifulsoup

I will go over these three types of Crawling by using several example sources

 

1. JSON

  • Crawling the information of "sunrise api"

  • Using the altitudes and latitudes of selected cities

1) Google "Sunrise API"

  • I will apply the latitude and longitude of Seoul in this case.


2) Find the Appropriate URL from the site

 

 

 

 

3) Use the Requests module - Get the information from the URL

# longitude of Seoul -> 37.5, latitude of Seoul -> 127.0
import requests
url = 'https://api.sunrise-sunset.org/json?lat=37.5&lng=127.0&date=2021-02-08'
# We can customize the url by changing the number following "lat =" and "lng =" 
data = requests.get(url).json()
data 
# The numbers in this data represent the Coordinated Universal Time
{'results': {'sunrise': '10:28:21 PM',
  'sunset': '9:03:58 AM',
  'solar_noon': '3:46:09 AM',
  'day_length': '10:35:37',
  'civil_twilight_begin': '10:01:20 PM',
  'civil_twilight_end': '9:30:58 AM',
  'nautical_twilight_begin': '9:30:26 PM',
  'nautical_twilight_end': '10:01:53 AM',
  'astronomical_twilight_begin': '8:59:55 PM',
  'astronomical_twilight_end': '10:32:24 AM'},
 'status': 'OK'}
# pulling out the dictionary which is in another dictionary
data['results'] 
{'sunrise': '10:28:21 PM',
 'sunset': '9:03:58 AM',
 'solar_noon': '3:46:09 AM',
 'day_length': '10:35:37',
 'civil_twilight_begin': '10:01:20 PM',
 'civil_twilight_end': '9:30:58 AM',
 'nautical_twilight_begin': '9:30:26 PM',
 'nautical_twilight_end': '10:01:53 AM',
 'astronomical_twilight_begin': '8:59:55 PM',
 'astronomical_twilight_end': '10:32:24 AM'}

4) Select a Date

date = '2020-08-01'
url = 'https://api.sunrise-sunset.org/json?lat=37.5&lng=127.0&date='
data = requests.get(url + date).json()
data['results']
{'sunrise': '8:36:41 PM',
 'sunset': '10:39:55 AM',
 'solar_noon': '3:38:18 AM',
 'day_length': '14:03:14',
 'civil_twilight_begin': '8:07:52 PM',
 'civil_twilight_end': '11:08:44 AM',
 'nautical_twilight_begin': '7:32:38 PM',
 'nautical_twilight_end': '11:43:59 AM',
 'astronomical_twilight_begin': '6:54:35 PM',
 'astronomical_twilight_end': '12:22:01 PM'}

5) Define a Function

def by_date(date):
    url = 'https://api.sunrise-sunset.org/json?lat=37.5&lng=127.0&date='
    data = requests.get(url + date).json()
    return data['results']

by_date('2020-02-01') # example of the usage of function
{'sunrise': '10:35:35 PM',
 'sunset': '8:55:25 AM',
 'solar_noon': '3:45:30 AM',
 'day_length': '10:19:50',
 'civil_twilight_begin': '10:08:08 PM',
 'civil_twilight_end': '9:22:53 AM',
 'nautical_twilight_begin': '9:36:51 PM',
 'nautical_twilight_end': '9:54:10 AM',
 'astronomical_twilight_begin': '9:06:05 PM',
 'astronomical_twilight_end': '10:24:55 AM'}

6) DataFrame and CSV

# Let's try it for 3 days 

import pandas as pd
sample_list = []
sample_list.append(by_date('2020-01-01'))
sample_list.append(by_date('2020-01-02'))
sample_list.append(by_date('2020-01-03'))
df = pd.DataFrame(sample_list)
df.to_csv('sample.csv', index = False)
df
  sunrise sunset solar_noon day_length civil_twilight_begin civil_twilight_end nautical_twilight_begin nautical_twilight_end astronomical_twilight_begin astronomical_twilight_end
0 10:46:31 PM 8:24:16 AM 3:35:23 AM 09:37:45 10:17:30 PM 8:53:17 AM 9:44:49 PM 9:25:58 AM 9:13:03 PM 9:57:44 AM
1 10:46:40 PM 8:25:03 AM 3:35:52 AM 09:38:23 10:17:40 PM 8:54:03 AM 9:45:01 PM 9:26:42 AM 9:13:16 PM 9:58:27 AM
2 10:46:47 PM 8:25:52 AM 3:36:19 AM 09:39:05 10:17:49 PM 8:54:50 AM 9:45:11 PM 9:27:27 AM 9:13:27 PM 9:59:11 AM

7) Collect 1 month Data using Loop

def by_date_2(date):
    url = 'https://api.sunrise-sunset.org/json?lat=37.5&lng=127.0&date='
    data = requests.get(url + date).json()['results']
    data['date'] = date # Adding date information
    return data

import time
sample_list_2 = []

for date in pd.date_range('2020-01-01', '2020-01-31'):
    date = str(date)[:10]
    sample_list_2.append(by_date_2(date))
    time.sleep(0.2) 

df_new = pd.DataFrame(sample_list_2)
df_new # This took such a long time...
  sunrise sunset solar_noon day_length civil_twilight_begin civil_twilight_end nautical_twilight_begin nautical_twilight_end astronomical_twilight_begin astronomical_twilight_end date
0 10:46:31 PM 8:24:16 AM 3:35:23 AM 09:37:45 10:17:30 PM 8:53:17 AM 9:44:49 PM 9:25:58 AM 9:13:03 PM 9:57:44 AM 2020-01-01
1 10:46:40 PM 8:25:03 AM 3:35:52 AM 09:38:23 10:17:40 PM 8:54:03 AM 9:45:01 PM 9:26:42 AM 9:13:16 PM 9:58:27 AM 2020-01-02
2 10:46:47 PM 8:25:52 AM 3:36:19 AM 09:39:05 10:17:49 PM 8:54:50 AM 9:45:11 PM 9:27:27 AM 9:13:27 PM 9:59:11 AM 2020-01-03
3 10:46:52 PM 8:26:42 AM 3:36:47 AM 09:39:50 10:17:56 PM 8:55:38 AM 9:45:20 PM 9:28:14 AM 9:13:37 PM 9:59:56 AM 2020-01-04
4 10:46:54 PM 8:27:33 AM 3:37:14 AM 09:40:39 10:18:01 PM 8:56:27 AM 9:45:26 PM 9:29:01 AM 9:13:45 PM 10:00:42 AM 2020-01-05
5 10:46:55 PM 8:28:26 AM 3:37:41 AM 09:41:31 10:18:03 PM 8:57:18 AM 9:45:31 PM 9:29:50 AM 9:13:52 PM 10:01:29 AM 2020-01-06
6 10:46:54 PM 8:29:19 AM 3:38:07 AM 09:42:25 10:18:04 PM 8:58:09 AM 9:45:35 PM 9:30:39 AM 9:13:57 PM 10:02:17 AM 2020-01-07
7 10:46:50 PM 8:30:14 AM 3:38:32 AM 09:43:24 10:18:04 PM 8:59:01 AM 9:45:36 PM 9:31:29 AM 9:13:59 PM 10:03:05 AM 2020-01-08
8 10:46:45 PM 8:31:10 AM 3:38:58 AM 09:44:25 10:18:01 PM 8:59:54 AM 9:45:35 PM 9:32:20 AM 9:14:01 PM 10:03:54 AM 2020-01-09
9 10:46:37 PM 8:32:07 AM 3:39:22 AM 09:45:30 10:17:56 PM 9:00:48 AM 9:45:33 PM 9:33:12 AM 9:14:00 PM 10:04:44 AM 2020-01-10
10 10:46:28 PM 8:33:05 AM 3:39:46 AM 09:46:37 10:17:49 PM 9:01:43 AM 9:45:28 PM 9:34:04 AM 9:13:57 PM 10:05:35 AM 2020-01-11
11 10:46:16 PM 8:34:03 AM 3:40:10 AM 09:47:47 10:17:40 PM 9:02:39 AM 9:45:22 PM 9:34:57 AM 9:13:53 PM 10:06:26 AM 2020-01-12
12 10:46:03 PM 8:35:02 AM 3:40:33 AM 09:48:59 10:17:30 PM 9:03:36 AM 9:45:14 PM 9:35:51 AM 9:13:47 PM 10:07:18 AM 2020-01-13
13 10:45:47 PM 8:36:03 AM 3:40:55 AM 09:50:16 10:17:17 PM 9:04:33 AM 9:45:04 PM 9:36:45 AM 9:13:39 PM 10:08:10 AM 2020-01-14
14 10:45:29 PM 8:37:04 AM 3:41:16 AM 09:51:35 10:17:02 PM 9:05:30 AM 9:44:52 PM 9:37:40 AM 9:13:30 PM 10:09:03 AM 2020-01-15
15 10:45:10 PM 8:38:05 AM 3:41:37 AM 09:52:55 10:16:46 PM 9:06:29 AM 9:44:39 PM 9:38:36 AM 9:13:18 PM 10:09:57 AM 2020-01-16
16 10:44:48 PM 8:39:07 AM 3:41:58 AM 09:54:19 10:16:27 PM 9:07:28 AM 9:44:23 PM 9:39:32 AM 9:13:05 PM 10:10:51 AM 2020-01-17
17 10:44:24 PM 8:40:10 AM 3:42:17 AM 09:55:46 10:16:07 PM 9:08:27 AM 9:44:06 PM 9:40:28 AM 9:12:49 PM 10:11:45 AM 2020-01-18
18 10:43:58 PM 8:41:14 AM 3:42:36 AM 09:57:16 10:15:45 PM 9:09:27 AM 9:43:47 PM 9:41:25 AM 9:12:32 PM 10:12:40 AM 2020-01-19
19 10:43:31 PM 8:42:17 AM 3:42:54 AM 09:58:46 10:15:20 PM 9:10:28 AM 9:43:25 PM 9:42:23 AM 9:12:13 PM 10:13:35 AM 2020-01-20
20 10:43:01 PM 8:43:22 AM 3:43:11 AM 10:00:21 10:14:54 PM 9:11:28 AM 9:43:02 PM 9:43:20 AM 9:11:53 PM 10:14:30 AM 2020-01-21
21 10:42:30 PM 8:44:26 AM 3:43:28 AM 10:01:56 10:14:26 PM 9:12:29 AM 9:42:38 PM 9:44:18 AM 9:11:30 PM 10:15:26 AM 2020-01-22
22 10:41:56 PM 8:45:31 AM 3:43:44 AM 10:03:35 10:13:57 PM 9:13:31 AM 9:42:11 PM 9:45:17 AM 9:11:06 PM 10:16:22 AM 2020-01-23
23 10:41:21 PM 8:46:37 AM 3:43:59 AM 10:05:16 10:13:25 PM 9:14:33 AM 9:41:42 PM 9:46:15 AM 9:10:39 PM 10:17:18 AM 2020-01-24
24 10:40:44 PM 8:47:42 AM 3:44:13 AM 10:06:58 10:12:52 PM 9:15:35 AM 9:41:12 PM 9:47:14 AM 9:10:11 PM 10:18:15 AM 2020-01-25
25 10:40:05 PM 8:48:48 AM 3:44:27 AM 10:08:43 10:12:16 PM 9:16:37 AM 9:40:40 PM 9:48:13 AM 9:09:42 PM 10:19:12 AM 2020-01-26
26 10:39:25 PM 8:49:54 AM 3:44:39 AM 10:10:29 10:11:39 PM 9:17:39 AM 9:40:06 PM 9:49:12 AM 9:09:10 PM 10:20:08 AM 2020-01-27
27 10:38:42 PM 8:51:00 AM 3:44:51 AM 10:12:18 10:11:00 PM 9:18:42 AM 9:39:30 PM 9:50:12 AM 9:08:37 PM 10:21:06 AM 2020-01-28
28 10:37:58 PM 8:52:06 AM 3:45:02 AM 10:14:08 10:10:20 PM 9:19:44 AM 9:38:53 PM 9:51:11 AM 9:08:01 PM 10:22:03 AM 2020-01-29
29 10:37:12 PM 8:53:13 AM 3:45:12 AM 10:16:01 10:09:38 PM 9:20:47 AM 9:38:14 PM 9:52:11 AM 9:07:24 PM 10:23:00 AM 2020-01-30
30 10:36:24 PM 8:54:19 AM 3:45:22 AM 10:17:55 10:08:54 PM 9:21:50 AM 9:37:33 PM 9:53:10 AM 9:06:46 PM 10:23:58 AM 2020-01-31