Pyspark Read From Hive Tables Using Jdbc Driver

Thrift JDBC/ODBC Server (aka Spark Austerity Server or STS) is Spark SQL's port of Apache Hive'due south HiveServer2 that allows JDBC/ODBC clients to execute SQL queries over JDBC and ODBC protocols on Apache Spark.

With Spark Thrift Server, concern users tin work with their shiny Business concern Intelligence (BI) tools, e.g. Tableau or Microsoft Excel, and connect to Apache Spark using the ODBC interface. That brings the in-retentiveness distributed capabilities of Spark SQL'southward query engine (with all the Catalyst query optimizations you lot surely like very much) to environments that were initially "disconnected".

Abreast, SQL queries in Spark Thrift Server share the same SparkContext that helps further amend functioning of SQL queries using the aforementioned data sources.

Spark Thrift Server is a Spark standalone application that y'all start using start-thriftserver.sh and cease using stop-thriftserver.sh shell scripts.

Spark Austerity Server has its own tab in web UI — JDBC/ODBC Server bachelor at /sqlserver URL.

spark thriftserver webui.png

Figure 1. Spark Thrift Server'due south spider web UI

Spark Thrift Server can piece of work in HTTP or binary send modes.

Use beeline command-line tool or SQuirreL SQL Client or Spark SQL's DataSource API to connect to Spark Thrift Server through the JDBC interface.

Spark Austerity Server extends spark-submit's command-line options with --hiveconf [prop=value].

Important

You have to enable hive-thriftserver build contour to include Spark Austerity Server in your build.

                      ./build/mvn -Phadoop-two.7,yarn,mesos,hive,hive-thriftserver -DskipTests make clean install                    

Tip

Enable INFO or DEBUG logging levels for org.apache.spark.sql.hive.thriftserver and org.apache.hive.service.server loggers to see what happens inside.

Add the following line to conf/log4j.properties:

                      log4j.logger.org.apache.spark.sql.hive.thriftserver=DEBUG log4j.logger.org.apache.hive.service.server=INFO                    

Starting Austerity JDBC/ODBC Server —get-go-thriftserver.sh

You can offset Thrift JDBC/ODBC Server using ./sbin/showtime-thriftserver.sh beat out script.

With INFO logging level enabled, when yous execute the script you lot should see the post-obit INFO messages in the logs:

              INFO HiveThriftServer2: Started daemon with process name: [email protected] INFO HiveThriftServer2: Starting SparkContext ... INFO HiveThriftServer2: HiveThriftServer2 started            

Internally, beginning-thriftserver.sh script submits org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 standalone awarding for execution (using spark-submit).

              $ ./bin/spark-submit --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2            

Tip

Using the more than explicit approach with spark-submit to first Spark Thrift Server could be easier to trace execution past seeing the logs printed out to the standard output and hence last directly.

Using Beeline JDBC Client to Connect to Spark Austerity Server

beeline is a command-line tool that allows you to access Spark Thrift Server using the JDBC interface on command line. It is included in the Spark distribution in bin directory.

              $ ./bin/beeline Beeline version 1.2.ane.spark2 by Apache Hive beeline>            

You lot can connect to Spark Thrift Server using connect control every bit follows:

              beeline> !connect jdbc:hive2://localhost:10000            

When connecting in not-secure mode, simply enter the username on your auto and a blank password.

              beeline> !connect jdbc:hive2://localhost:10000 Connecting to jdbc:hive2://localhost:10000 Enter username for jdbc:hive2://localhost:10000: jacek Enter password for jdbc:hive2://localhost:10000: [press ENTER] Connected to: Spark SQL (version two.three.0) Commuter: Hive JDBC (version i.2.ane.spark2) Transaction isolation: TRANSACTION_REPEATABLE_READ 0: jdbc:hive2://localhost:10000>            

In one case connected, y'all can send SQL queries (every bit if Spark SQL were a JDBC-compliant database).

              0: jdbc:hive2://localhost:10000> show databases; +---------------+--+ | databaseName  | +---------------+--+ | default       | +---------------+--+ i row selected (0.074 seconds)            

Connecting to Spark Austerity Server using SQuirreL SQL Customer 3.7.1

Spark Thrift Server allows for remote access to Spark SQL using JDBC protocol.

SQuirreL SQL Client is a Java SQL client for JDBC-compliant databases.

Run the client using java -jar squirrel-sql.jar.

spark thriftserver squirrel.png

Effigy two. SQuirreL SQL Client

Y'all outset have to configure a JDBC driver for Spark Thrift Server. Spark Austerity Server uses org.spark-project.hive:hive-jdbc:1.2.one.spark2 dependency that is the JDBC driver (that also downloads transitive dependencies).

Tip

The Hive JDBC Commuter, i.e. hive-jdbc-ane.2.ane.spark2.jar and other jar files are in jars directory of the Apache Spark distribution (or assembly/target/scala-ii.11/jars for local builds).
Table 1. SQuirreL SQL Customer's Connection Parameters
Parameter Clarification

Proper noun

Spark Austerity Server

Example URL

jdbc:hive2://localhost:10000

Extra Class Path

All the jar files of your Spark distribution

Form Name

org.apache.hive.jdbc.HiveDriver

spark thriftserver squirrel adddriver.png

Figure 3. Adding Hive JDBC Driver in SQuirreL SQL Client

With the Hive JDBC Driver divers, you tin can connect to Spark SQL Thrift Server.

spark thriftserver squirrel addalias.png

Figure 4. Adding Hive JDBC Commuter in SQuirreL SQL Client

Since you did not specify the database to use, Spark SQL's default is used.

spark thriftserver squirrel metadata.png

Effigy five. SQuirreL SQL Client Connected to Spark Thrift Server (Metadata Tab)

Below is show tables SQL query in SQuirrel SQL Client executed in Spark SQL through Spark Thrift Server.

spark thriftserver squirrel show tables.png

Figure 6. bear witness tables SQL Query in SQuirrel SQL Customer using Spark Austerity Server

Using Spark SQL's DataSource API to Connect to Spark Thrift Server

What might seem a quite artificial setup at first is accessing Spark Austerity Server using Spark SQL's DataSource API, i.east. DataFrameReader'due south jdbc method.

Tip

When executed in local fashion, Spark Thrift Server and spark-shell will try to access the same Hive Warehouse'southward directory that will inevitably lead to an fault.

                        ./bin/spark-shell --conf spark.sql.warehouse.dir=/tmp/spark-warehouse                      

Yous should also not share the same home directory betwixt them since metastore_db becomes an consequence.

                              // Inside spark-beat out                // Paste in :paste fashion                val                df = spark   .read   .option("url",                "jdbc:hive2://localhost:10000") (i)   .option("dbtable",                "people") (two)   .format("jdbc")   .load            
  1. Connect to Spark Thrift Server at localhost on port 10000

  2. Utilize people table. It assumes that people table is bachelor.

ThriftServerTab — spider web UI's Tab for Spark Thrift Server

ThriftServerTab is…​FIXME

Stopping Thrift JDBC/ODBC Server —stop-thriftserver.sh

You lot can cease a running instance of Thrift JDBC/ODBC Server using ./sbin/terminate-thriftserver.sh beat script.

With DEBUG logging level enabled, you should run into the following messages in the logs:

              Mistake HiveThriftServer2: RECEIVED SIGNAL TERM DEBUG SparkSQLEnv: Shutting downward Spark SQL Environment INFO HiveServer2: Shutting down HiveServer2 INFO BlockManager: BlockManager stopped INFO SparkContext: Successfully stopped SparkContext            

Tip

You tin can also send SIGTERM signal to the process of Thrift JDBC/ODBC Server, i.e. impale [PID] that triggers the aforementioned sequence of shutdown steps every bit stop-thriftserver.sh.

Transport Way

Spark Austerity Server can be configured to listen in two modes (aka send modes):

  1. Binary mode — clients should send austerity requests in binary

  2. HTTP way — clients send thrift requests over HTTP.

You can control the transport modes using HIVE_SERVER2_TRANSPORT_MODE=http or hive.server2.transport.mode (default: binary). It can be binary (default) or http.

primary method

Austerity JDBC/ODBC Server is a Spark standalone application that you…​

HiveThriftServer2Listener

millersayindons.blogspot.com

Source: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-thrift-server.html

0 Response to "Pyspark Read From Hive Tables Using Jdbc Driver"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel