Python Script: Scraping YouTube Autocomplete Suggestions

Python Script: Scraping YouTube Autocomplete Suggestions

You may have heard before that web scraping has no limits. Today, many businesses scrape the data they need to set up. In addition, businesses can continue their lives and grow by processing the data they scraped.

It is very important to train and develop algorithms used in artificial intelligence projects. The most basic need for developing and testing algorithms is data. Although there are many ways to obtain data for this need, the most inexpensive and easiest way is web scraping.

What is Web Scraping

Web scraping is basically extracting data from the web. By means of web scraping, applications can easily obtain the data they target from their targeted websites with the help of a bot/tool. One of the biggest advantages of web scraping is that it allows the web scraping to be automated. In this way, they can quickly scrape hundreds or even thousands of data from the web and save this data to the desired target databases.

In today’s article, we will look at scraping YouTube autocomplete, which is mostly preferred in artificial intelligence projects. So let’s get started.

Scraping YouTube Autocomplete

We will scrape YouTube autocomplete data using the python programming language. Therefore, let’s create a python project first. Let’s create a file named ‘index.py‘ where we will write our codes in the project we have created.

Then, let’s install the packages we will use in our application by running the following command

pip install parcel selenium webdriver webdriver_manager

Now let’s paste the codes below into our file named ‘index.py‘.

import re, json, time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from parsel import Selector

search_items = [“rihanna “, “lana del rey “, “eminem “]
youtube_url = “https://www.youtube.com/”
autocomplete_results = []

def scrape_youtube():
    chrome_driver= Service(ChromeDriverManager().install())
    options = get_options()
    for item in search_items:

      execute_scraping(chrome_driver=chrome_driver, options=options, item=item)
     
    print(json.dumps(autocomplete_results, indent=4))

def get_options():
    options = webdriver.ChromeOptions()
    options.add_argument(“–headless”)
    options.add_argument(‘–lang=en’)
    options.add_argument(“user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36”)
    return options

def get_driver_settings(chrome_driver, options):
    driver = webdriver.Chrome(service=chrome_driver, options=options)
    driver.get(youtube_url)
    return driver 

def execute_scraping(chrome_driver, options, item):
    driver = get_driver_settings(chrome_driver=chrome_driver, options=options)

    WebDriverWait(driver, 10000).until(EC.visibility_of_element_located((By.TAG_NAME, ‘body’)))
 
    send_keys(driver=driver, item=item)
 
    time.sleep(1)
 
    cssSelector = Selector(driver.page_source)

    results = get_autocomplete_result(cssSelector=cssSelector)
 
    append_list_autocomplete_results(item=item, results=results)

    finish_driver(driver=driver)

def send_keys(driver, item):
    searchItem = driver.find_element(By.XPATH, ‘//input[@id=”search”]’)
    searchItem.click()
    searchItem.send_keys(item)
 
def get_autocomplete_result(cssSelector):
    results = [
            re.search(r'”>(.*)</b>’, result).group(1).replace(“<b>”, “”)
            for result in cssSelector.css(‘.sbqs_c’).getall()
        ]
    return results

def append_list_autocomplete_results(item, results):
    autocomplete_results.append({
            “item”: item.strip(),
            “autocomplete_results”: results
        })

def finish_driver(driver):
    driver.quit()
 
scrape_youtube()

When we examine the codes, the following fields are specified as static. The words in search_items are the words we want to autocomplete on YouTube.

search_items = [“rihanna”, “lana del rey”, “eminem”]
youtube_url = “https://www.youtube.com/”
autocomplete_results = []

The following function is the first step to start the flow in the application. First set the chrome driver configurations and options, then run the execute_scraping function to scrape the search_items one by one.

def scrape_youtube():
    chrome_driver= Service(ChromeDriverManager().install())
    options = get_options()
    for item in search_items:

      execute_scraping(chrome_driver=chrome_driver, options=options, item=item)
       
    print(json.dumps(autocomplete_results, indent=4))

The execute_scraping function works for each targeted word. After the driver adjustments, the targeted word is searched on youtube and the results are obtained. After this step, target data is extracted from the data obtained with Css Selector and added to the result list. Finally, the driver is turned off.

def execute_scraping(chrome_driver, options, item):
    driver = get_driver_settings(chrome_driver=chrome_driver, options=options)

    WebDriverWait(driver, 10000).until(EC.visibility_of_element_located((By.TAG_NAME, ‘body’)))
   
    send_keys(driver=driver, item=item)
   
    time.sleep(1)
   
    cssSelector = Selector(driver.page_source)

    results = get_autocomplete_result(cssSelector=cssSelector)
   
    append_list_autocomplete_results(item=item, results=results)

    finish_driver(driver=driver)

When we run this application, the following output is printed on the console of the application.

[
    {
        “item”: “rihanna”,
        “autocomplete_results”: [
            “rihanna lift me up”,
            “rihanna songs”,
            “rihanna lift me up video”,
            “rihanna diamonds”,
            “rihanna umbrella”,
            “rihanna work”,
            “rihanna we found love”,
            “rihanna pon de replay”,
            “rihanna love on the brain”,
            “rihanna stay”,
            “rihanna live”,
            “rihanna what’s my name”,
            “rihanna woo”,
            “rihanna man down”
        ]
    },
    {
        “item”: “lana del rey”,
        “autocomplete_results”: [
            “lana del rey summertime sadness”,
            “lana del rey young and beautiful”,
            “lana del rey unreleased”,
            “lana del rey doin time”,
            “lana del rey playlist”,
            “lana del rey west coast”,
            “lana del rey dark paradise”,
            “lana del rey video games”,
            “lana del rey live”,
            “lana del rey sped up”,
            “lana del rey cola”,
            “lana del rey love”,
            “lana del rey serial killer”,
            “lana del rey high by the beach”
        ]
    },
    {
        “item”: “eminem”,
        “autocomplete_results”: [
            “eminem mockingbird”,
            “eminem lose yourself”,
            “eminem superman”,
            “eminem stan”,
            “eminem lyrics”,
            “eminem without me”,
            “eminem hall of fame 2022”,
            “eminem sing for the moment”,
            “eminem rap god”,
            “eminem venom”,
            “eminem godzilla”,
            “eminem till i collapse”,
            “eminem beautiful”,
            “eminem slim shady”
        ]
    }
]

Conclusion

In this example, we looked at how to scrape the YouTube autocomplete domain with the python programming language. If you don’t want to deal with code integration, Zenserp API is for you. You can scrape many websites that are difficult to scrape, especially Google and YouTube, without writing any code. Check out Zenserp’s powerful documentation for more.

admin

Tech Crazee is a website came up with a great content on all multiple niche like business, technology, finance and more. Tech Crazee studies, analyze's and presents before publishing in this website. We the Tech Crazee team established a platform to build a proper and trustful medium with the internet world and the people.