5 minute read

Introduction

In a previous post, i’ve explained how to scrape the most frequently used English vocabulary. But the final purpose is to build an anki deck to memorize those words based on their translation in french. So i was looking for a way the automate the translation of more than 5,000 words…

My first idea was to retrieve the result from google translate. One can use alternatives such as DeepL (which is a really decent solution for sentences btw…) or Larousse (this is the online service of a famous french dictionary).

Let’s try the first idea here :)

Google Translate

Everyone knows Google Translate, according to wikipedia :
it is a multilingual neural machine translation service developed by Google, to translate text, documents and websites […] tt offers a website interface, a mobile app, and an API (paid service). […] it supports 109 languages at various levels claimed […] more than 100 billion words translated daily. Google Translate can translate multiple forms of text and media, which includes text, speech, and text within still or moving images. Specifically, its functions include:

It is interesting to note that Google Translate is coupled with other machine learning services in order to identify text in pictures taken by the users or to translate language that are handwritten on the phone screen or drawn on a virtual keyboard without the support of a keyboard.

When you take a look at its web interface, you rapidly see that it is dynamic. So libraries such as Beautifulsoup are useless here. We are going to give a try with Selenium

Selenium

source: scrapingbee.com - Selenium refers to a number of different open-source projects used for browser automation. It supports bindings for all major programming languages, including Python.

The Selenium API uses the WebDriver protocol to control a web browser, like Chrome, Firefox or Safari. The browser can run either localy or remotely.

At the beginning of the project, it was mostly used for cross-browser, end-to-end testing (acceptance tests).

Now it is still used for testing, but it is also used as a general browser automation platform. And of course, it us used for web scraping!

Selenium is useful when you have to perform an action on a website such as:

  • Clicking on buttons
  • Filling forms
  • Scrolling
  • Taking a screenshot

It is also useful for executing Javascript code. Let’s say that you want to scrape a Single Page Application. Plus you haven’t found an easy way to directly call the underlying APIs. In this case, Selenium might be what you need.

Pre-requisites

The easiest way to install Selenium on a Python environment is through the installer pip or conda :

pip install selenium
conda install selenium

While the installation of Selenium makes the functionality available to you, you need additional drivers for it to be able to interface with a chosen web browser. The download links for the drivers are available here: Chrome, Edge, Firefox, and Safari. Once the driver for the compatible version downloaded, i will use the Chromedriver.

If you only plan to locally test Selenium, downloading the package and drivers should suffice. However, if you would like to set Selenium up on a remote server, you would additionally need to install the Selenium Server. Selenium Server is written in Java, and you need to have JRE 1.6 or above to install it on your server. It is available on Selenium’s download page.

You have to add the anme & the path of the driver in the environnement path of your OS. For windows, just put the downloaded binary where it suits your needs and change the system settings then :

cmd.exe > chromedriver 
... chromedriver has successfully started ...

For linux:

# https://chromedriver.chromium.org/downloads
!wget https://chromedriver.storage.googleapis.com/2.41/chromedriver_linux64.zip
!unzip chromedriver_linux64.zip
!sudo mv chromedriver /usr/bin/chromedriver
!sudo chown root:root /usr/bin/chromedriver
!sudo chmod +x /usr/bin/chromedriver

sudo vim /home/username/.profile export PATH=$PATH:/pathtodriver/webdriver

Let’s scrape it baby !

You have to use the following parameters in your code :

from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--headless')
browser = webdriver.Chrome(chrome_options=chrome_options)

“-no-sandbox” parameter is to let Chrome run with root privileges
“-headless” parameter is not to open the graphical interface

You can add these parameters to get a better experience:
chrome_options.add_argument(‘blink-settings=imagesEnabled=false’)
chrome_options.add_argument(‘–disable-gpu’)

from selenium import webdriver

DRIVER_PATH = '/home/pathto/chromedriver' #'/usr/bin/chromedriver'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
driver.get('https://google.com')

The driver.page_source will return the full page HTML code.

Here are two other interesting WebDriver properties:

driver.title gets the page’s title
driver.current_url gets the current URL (this can be useful when there are redirections on the website and you need the final URL)

One can launch Chrome in headfull mode (like regular Chrome, which is controlled by your Python code) and should see a message stating that the browser is controlled by automated software. But it’s better for automation purposes to run Chrome in headless mode (without any graphical user interface).

I’ve found a lovely snippet by Walber Nunes on github, that do the works pretty well ! Here it is:

from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
#from fake_useragent import UserAgent
from random import randrange
from datetime import datetime
import time

#ua = UserAgent()
#userAgent = ua.random
userAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)s Chrome/92.0.4515.131 Safari/537.36"
options = webdriver.ChromeOptions()
options.add_argument(f"user-agent={userAgent}")
options.add_experimental_option('excludeSwitches', ['enable-logging'])
browser = webdriver.Chrome(executable_path=r"chromedriver.exe", options=options)

def translate_term(text, target_lang, source_lang="en"):
    translation = 'N/A'

    try:
        browser.get(f"https://translate.google.com/?sl={source_lang}&tl={target_lang}&text={text}&op=translate")
        condition = EC.presence_of_element_located((By.XPATH, f"//span[@data-language-for-alternatives='{target_lang}']/span[1]"))
        element = WebDriverWait(browser, 10).until(condition)
        translation = element.text
        time.sleep(randrange(2))
    except Exception:
        now = datetime.now()
        current_time = now.strftime("%H:%M:%S")
        print("Something went wrong at: ", current_time)        

    return translation

Then you can print the result of various words’ translations:

print(translate_term(text="abolish", target_lang="fr"))
print(translate_term(text="nice", target_lang="fr"))
print(translate_term(text="mouse", target_lang="fr"))
print(translate_term(text="zone", target_lang="fr"))
print(translate_term(text="flight", target_lang="fr"))

and the results are:

abolir
Something went wrong at: 21:04:10
N/A
Souris
zone
voyage en avion

After that, you have to close the headless browser.

browser.close()
# or webdriver.quit()

End note & References

To be honnest, this is just a first look at selenium’s capabilities, but you can do much more with it. I like the idea to automate stuff on webpages, this is really powerfull. If you want to dive a little more deep in it, here are few interesting ressources for you :