hadoop - Presto failing to query hive table -
on emr created dataset in parquet using spark , storing on s3. able create external table , query using hive when try perform same query using presto obtain error (the part referred changes @ every run).
2016-11-13t13:11:15.165z error remote-task-callback-36 com.facebook.presto.execution.stagestatemachine stage 20161113_131114_00004_yp8y5.1 failed com.facebook.presto.spi.prestoexception: error opening hive split s3://my_bucket/my_table/part-r-00013-b17b4495-f407-49e0-9d15-41bb0b68c605.snappy.parquet (offset=1100508800, length=68781800): null @ com.facebook.presto.hive.parquet.parquethiverecordcursor.createparquetrecordreader(parquethiverecordcursor.java:475) @ com.facebook.presto.hive.parquet.parquethiverecordcursor.<init>(parquethiverecordcursor.java:247) @ com.facebook.presto.hive.parquet.parquetrecordcursorprovider.createhiverecordcursor(parquetrecordcursorprovider.java:96) @ com.facebook.presto.hive.hivepagesourceprovider.gethiverecordcursor(hivepagesourceprovider.java:129) @ com.facebook.presto.hive.hivepagesourceprovider.createpagesource(hivepagesourceprovider.java:107) @ com.facebook.presto.spi.connector.classloader.classloadersafeconnectorpagesourceprovider.createpagesource(classloadersafeconnectorpagesourceprovider.java:44) @ com.facebook.presto.split.pagesourcemanager.createpagesource(pagesourcemanager.java:48) @ com.facebook.presto.operator.tablescanoperator.createsourceifnecessary(tablescanoperator.java:268) @ com.facebook.presto.operator.tablescanoperator.isfinished(tablescanoperator.java:210) @ com.facebook.presto.operator.driver.processinternal(driver.java:375) @ com.facebook.presto.operator.driver.processfor(driver.java:301) @ com.facebook.presto.execution.sqltaskexecution$driversplitrunner.processfor(sqltaskexecution.java:622) @ com.facebook.presto.execution.taskexecutor$prioritizedsplitrunner.process(taskexecutor.java:529) @ com.facebook.presto.execution.taskexecutor$runner.run(taskexecutor.java:665) @ java.util.concurrent.threadpoolexecutor.runworker(threadpoolexecutor.java:1142) @ java.util.concurrent.threadpoolexecutor$worker.run(threadpoolexecutor.java:617) @ java.lang.thread.run(thread.java:745) caused by: java.io.eofexception @ java.io.datainputstream.readfully(datainputstream.java:197) @ java.io.datainputstream.readfully(datainputstream.java:169) @ parquet.hadoop.parquetfilereader.readfooter(parquetfilereader.java:420) @ parquet.hadoop.parquetfilereader.readfooter(parquetfilereader.java:385) @ com.facebook.presto.hive.parquet.parquethiverecordcursor.lambda$createparquetrecordreader$0(parquethiverecordcursor.java:416) @ com.facebook.presto.hive.authentication.nohdfsauthentication.doas(nohdfsauthentication.java:23) @ com.facebook.presto.hive.hdfsenvironment.doas(hdfsenvironment.java:76) @ com.facebook.presto.hive.parquet.parquethiverecordcursor.createparquetrecordreader(parquethiverecordcursor.java:416) ... 16 more
the parquet location constituted 128 parts - data stored on s3 , encrypted using client-side encryption kms. presto uses custom encryption-materials provider (specified using presto.s3.encryption-materials-provider) returns kmsencryptionmaterials object initialized master key. using emr 5.1.0 (hive 2.1.0, spark 2.0.1, presto 0.152.3).
does surface when encryption turned off?
there bugreport surfaced against asf s3a client (not emr one), things breaking when filesystem listed length != actual file length. is: because of encryption, file length in list > length in read.
we couldn't repro in our tests, , our conclusion anyway "filesystems must not that" (indeed, it's fundamental requirement of hadoop fs spec: listed len must equal actual length). if emr code getting wrong, it's in driver downstream code cannot expected handle
Comments
Post a Comment