Подключение к Hive из Apache Spark

У меня есть простая программа, которую я запускаю на автономной виртуальной машине Cloudera. Я создал управляемую таблицу в Hive, которую хочу прочитать в Apache spark, но начальное подключение к улью не устанавливается. Пожалуйста, порекомендуйте.

Я запускаю эту программу в IntelliJ, я скопировал hive-site.xml из моего /etc/hive/conf в /etc/spark/conf, даже тогда spark-job не подключается к метахранилищу Hive.

 public static void main(String[] args) throws AnalysisException {
         String master = "local[*]";

         SparkSession sparkSession = SparkSession
                 .builder().appName(ConnectToHive.class.getName())
                 .config("spark.sql.warehouse.dir", "hdfs://quickstart.cloudera:8020/user/hive/warehouse")
                 .enableHiveSupport()
                 .master(master).getOrCreate();

         SparkContext context = sparkSession.sparkContext();
         context.setLogLevel("ERROR");

         SQLContext sqlCtx = sparkSession.sqlContext();

         HiveContext hiveContext = new HiveContext(sparkSession);
         hiveContext.setConf("hive.metastore.warehouse.dir", "hdfs://quickstart.cloudera:8020/user/hive/warehouse");

         hiveContext.sql("SHOW DATABASES").show();
         hiveContext.sql("SHOW TABLES").show();

         sparkSession.close();
     }

Вывод показан ниже, где ожидается увидеть «Таблицу сотрудников», чтобы я мог запросить. Поскольку я работаю на Standa-alone, хранилище метаданных куста находится на локальном сервере mySQL.

 +------------+
 |databaseName|
 +------------+
 |     default|
 +------------+

 +--------+---------+-----------+
 |database|tableName|isTemporary|
 +--------+---------+-----------+
 +--------+---------+-----------+

jdbc:mysql://127.0.0.1/metastore?createDatabaseIfNotExist=true — это конфигурация хранилища метаданных Hive.

 hive> show databases;
 OK
 default
 sxm
 temp
 Time taken: 0.019 seconds, Fetched: 3 row(s)
 hive> use default;
 OK
 Time taken: 0.015 seconds
 hive> show tables;
 OK
 employee
 Time taken: 0.014 seconds, Fetched: 1 row(s)
 hive> describe formatted employee;
 OK
 # col_name             data_type               comment             

 id                     string                                      
 firstname              string                                      
 lastname               string                                      
 addresses              array<struct<street:string,city:string,state:string>>                       

 # Detailed Table Information        
 Database:              default                  
 Owner:                 cloudera                 
 CreateTime:            Tue Jul 25 06:33:01 PDT 2017     
 LastAccessTime:        UNKNOWN                  
 Protect Mode:          None                     
 Retention:             0                        
 Location:              hdfs://quickstart.cloudera:8020/user/hive/warehouse/employee     
 Table Type:            MANAGED_TABLE            
 Table Parameters:       
    transient_lastDdlTime   1500989581          

 # Storage Information       
 SerDe Library:         org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe  
 InputFormat:           org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat    
 OutputFormat:          org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat   
 Compressed:            No                       
 Num Buckets:           -1                       
 Bucket Columns:        []                       
 Sort Columns:          []                       
 Storage Desc Params:        
    serialization.format    1                   
 Time taken: 0.07 seconds, Fetched: 29 row(s)
 hive> 

Добавлены журналы искры

 log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
 log4j:WARN Please initialize the log4j system properly.
 log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
 Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
 17/07/25 11:38:30 INFO SparkContext: Running Spark version 2.1.0
 17/07/25 11:38:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
 17/07/25 11:38:30 INFO SecurityManager: Changing view acls to: cloudera
 17/07/25 11:38:30 INFO SecurityManager: Changing modify acls to: cloudera
 17/07/25 11:38:30 INFO SecurityManager: Changing view acls groups to: 
 17/07/25 11:38:30 INFO SecurityManager: Changing modify acls groups to: 
 17/07/25 11:38:30 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(cloudera); groups with view permissions: Set(); users  with modify permissions: Set(cloudera); groups with modify permissions: Set()
 17/07/25 11:38:31 INFO Utils: Successfully started service 'sparkDriver' on port 55232.
 17/07/25 11:38:31 INFO SparkEnv: Registering MapOutputTracker
 17/07/25 11:38:31 INFO SparkEnv: Registering BlockManagerMaster
 17/07/25 11:38:31 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
 17/07/25 11:38:31 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
 17/07/25 11:38:31 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-eb1e611f-1b88-487f-b600-3da1ff8353db
 17/07/25 11:38:31 INFO MemoryStore: MemoryStore started with capacity 1909.8 MB
 17/07/25 11:38:31 INFO SparkEnv: Registering OutputCommitCoordinator
 17/07/25 11:38:31 INFO Utils: Successfully started service 'SparkUI' on port 4040.
 17/07/25 11:38:31 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.0.2.15:4040
 17/07/25 11:38:31 INFO Executor: Starting executor ID driver on host localhost
 17/07/25 11:38:31 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 41433.
 17/07/25 11:38:31 INFO NettyBlockTransferService: Server created on 10.0.2.15:41433
 17/07/25 11:38:31 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
 17/07/25 11:38:31 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.0.2.15, 41433, None)
 17/07/25 11:38:31 INFO BlockManagerMasterEndpoint: Registering block manager 10.0.2.15:41433 with 1909.8 MB RAM, BlockManagerId(driver, 10.0.2.15, 41433, None)
 17/07/25 11:38:31 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.0.2.15, 41433, None)
 17/07/25 11:38:31 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.0.2.15, 41433, None)
 17/07/25 11:38:32 INFO SharedState: Warehouse path is 'file:/home/cloudera/works/JsonHive/spark-warehouse/'.
 17/07/25 11:38:32 INFO HiveUtils: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes.
 17/07/25 11:38:32 INFO deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
 17/07/25 11:38:32 INFO deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative
 17/07/25 11:38:32 INFO deprecation: mapred.committer.job.setup.cleanup.needed is deprecated. Instead, use mapreduce.job.committer.setup.cleanup.needed
 17/07/25 11:38:32 INFO deprecation: mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack
 17/07/25 11:38:32 INFO deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
 17/07/25 11:38:32 INFO deprecation: mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node
 17/07/25 11:38:32 INFO deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
 17/07/25 11:38:32 INFO deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive
 17/07/25 11:38:32 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
 17/07/25 11:38:32 INFO ObjectStore: ObjectStore, initialize called
 17/07/25 11:38:32 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
 17/07/25 11:38:32 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored
 17/07/25 11:38:34 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
 17/07/25 11:38:35 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
 17/07/25 11:38:35 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
 17/07/25 11:38:35 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
 17/07/25 11:38:35 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
 17/07/25 11:38:35 INFO Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is closing
 17/07/25 11:38:35 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
 17/07/25 11:38:35 INFO ObjectStore: Initialized ObjectStore
 17/07/25 11:38:36 INFO HiveMetaStore: Added admin role in metastore
 17/07/25 11:38:36 INFO HiveMetaStore: Added public role in metastore
 17/07/25 11:38:36 INFO HiveMetaStore: No user is added in admin role, since config is empty
 17/07/25 11:38:36 INFO HiveMetaStore: 0: get_all_databases
 17/07/25 11:38:36 INFO audit: ugi=cloudera ip=unknown-ip-addr  cmd=get_all_databases   
 17/07/25 11:38:36 INFO HiveMetaStore: 0: get_functions: db=default pat=*
 17/07/25 11:38:36 INFO audit: ugi=cloudera ip=unknown-ip-addr  cmd=get_functions: db=default pat=* 
 17/07/25 11:38:36 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
 17/07/25 11:38:36 INFO SessionState: Created local directory: /tmp/76258222-81db-4ac1-9566-1d8f05c3ecba_resources
 17/07/25 11:38:36 INFO SessionState: Created HDFS directory: /tmp/hive/cloudera/76258222-81db-4ac1-9566-1d8f05c3ecba
 17/07/25 11:38:36 INFO SessionState: Created local directory: /tmp/cloudera/76258222-81db-4ac1-9566-1d8f05c3ecba
 17/07/25 11:38:36 INFO SessionState: Created HDFS directory: /tmp/hive/cloudera/76258222-81db-4ac1-9566-1d8f05c3ecba/_tmp_space.db
 17/07/25 11:38:36 INFO HiveClientImpl: Warehouse location for Hive client (version 1.2.1) is file:/home/cloudera/works/JsonHive/spark-warehouse/
 17/07/25 11:38:36 INFO HiveMetaStore: 0: get_database: default
 17/07/25 11:38:36 INFO audit: ugi=cloudera ip=unknown-ip-addr  cmd=get_database: default   
 17/07/25 11:38:36 INFO HiveMetaStore: 0: get_database: global_temp
 17/07/25 11:38:36 INFO audit: ugi=cloudera ip=unknown-ip-addr  cmd=get_database: global_temp   
 17/07/25 11:38:36 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
 +------------+
 |databaseName|
 +------------+
 |     default|
 +------------+

 +--------+---------+-----------+
 |database|tableName|isTemporary|
 +--------+---------+-----------+
 +--------+---------+-----------+


 Process finished with exit code 0

ОБНОВИТЬ

/usr/lib/hive/conf/hive-site.xml не было в пути к классам, поэтому он не читал таблицы, после добавления его в путь к классам он работал нормально ... Поскольку я работал с IntelliJ, у меня есть эта проблема. .в продакшене папка spark-conf будет иметь ссылку на hive-site.xml...


person Dave    schedule 25.07.2017    source источник
comment
Вам больше не нужно создавать HiveContext. Позвонить enableHiveSupport на SparkSession должно быть достаточно. Попробуйте позвонить sparkSession.sql("SHOW DATABASES").show();   -  person nonsleepr    schedule 25.07.2017
comment
Не повезло. Я пробовал это.   -  person Dave    schedule 25.07.2017
comment
Что, если вы уроните .config("spark.sql.warehouse.dir", ...)? Spark должен подобрать правильную конфигурацию самостоятельно. Если нет, можете поделиться логами исполнения?   -  person nonsleepr    schedule 25.07.2017
comment
Добавлены искровые логи в вопрос   -  person Dave    schedule 25.07.2017
comment
Влияет ли "spark.sql.warehouse.dir" на журнал? SharedState: Warehouse path is 'file:/home/cloudera/works/JsonHive/spark-warehouse/'. MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY HiveClientImpl: Warehouse location for Hive client (version 1.2.1) is file:/home/cloudera/works/JsonHive/spark-warehouse/ Искра   -  person nonsleepr    schedule 25.07.2017
comment
Попробуйте добавить /etc/hive/conf в путь к классам, чтобы Spark мог найти там hive-site.xml.   -  person nonsleepr    schedule 25.07.2017


Ответы (1)


 17/07/25 11:38:35 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY

Это намек на то, что вы не подключены к удаленному хранилищу метаданных куста (которое вы установили как MySQL), и XML-файл неправильно указан в вашем пути к классам.

Вы можете сделать это программно без XML, прежде чем создавать SparkSession.

System.setProperty("hive.metastore.uris", "thrift://METASTORE:9083");

Как программно подключиться к хранилищу метаданных Hive в SparkSQL?< /а>

person OneCricketeer    schedule 28.07.2017
comment
спасибо... в spark_home у меня есть ссылка на hive-site.xml, где я указал все метасторы и другие детали. /usr/lib/hive/conf/hive-site.xml - person Dave; 29.07.2017
comment
Дайте ссылку или скопируйте, конечно, но это не обязательно. Альтернативой является определение переменной окружения HADOOP_CONF_DIR - person OneCricketeer; 29.07.2017