Hadoop Distributed File Systems (HDFS) are a great way to store vast amounts ob tabular data. To access this data as convenient as possible, tools like Hive or Impala can be used. These offer an SQL like interface to handle the data stored in a HDFS. However, when its is necessary to run the same queries recurrently, it might be useful to automate these jobs with the help of a Python script. This is where the library impyla comes into play. When I first worked on some ETL scripts, I was disappointed with the existing official documentation of impyla. However, after getting used to the basic workflow, it is a convenient way to run recurring ETL tasks.
the basic setup is quite simple. With "pip install impyla", the necessary Python library can be installed. To establish a connection to your Impala database, you need to import the connect function from impala.dbapi. By calling the connect function with parameters like host, port, database name, username and password, the connection is initialized and a connection object is returned.
To run queries, the impyla connection object offers the cursor. If you run connection_object.cursor() a cursor object will be returned. With this cursor object, SQL like statements can be executed. By using the cursor_object.execute(SQL STATEMENT) function, the SQL statement will be sent to the database server. For queries that don't read data from the Impala database, this is already the final step. For read queries however, impala.utils provides you with the function as_pandas(). This function takes your cursor object and returns the respective data in the form of a pandas dataframe.
Finally, please remember to always close your connection to the Impala database by using connection_object.close() - also in case of any errors occurring during the query execution. Your database admin will thank you ;)
I hope this short tutorial will be helpful to you. Feel free to reach out in case of any questions