{"id":5975,"date":"2021-08-13T08:36:13","date_gmt":"2021-08-13T08:36:13","guid":{"rendered":"https:\/\/northbaysolutions.com\/?p=5975"},"modified":"2025-03-20T11:29:52","modified_gmt":"2025-03-20T11:29:52","slug":"amazon-athena-beyond-the-basics-part-2","status":"publish","type":"post","link":"https:\/\/northbaysolutions.com\/big-data-data-lake-analytics\/amazon-athena-beyond-the-basics-part-2\/","title":{"rendered":"Amazon Athena: Beyond The Basics \u2013 Part 2"},"content":{"rendered":"<div class=\"fusion-fullwidth fullwidth-box fusion-builder-row-1 fusion-flex-container nonhundred-percent-fullwidth non-hundred-percent-height-scrolling\" style=\"--awb-border-radius-top-left:0px;--awb-border-radius-top-right:0px;--awb-border-radius-bottom-right:0px;--awb-border-radius-bottom-left:0px;--awb-padding-top:0px;--awb-padding-right:0px;--awb-padding-bottom:0px;--awb-padding-left:0px;--awb-flex-wrap:wrap;\" ><div class=\"fusion-builder-row fusion-row fusion-flex-align-items-flex-start fusion-flex-content-wrap\" style=\"max-width:1310.4px;margin-left: calc(-4% \/ 2 );margin-right: calc(-4% \/ 2 );\"><div class=\"fusion-layout-column fusion_builder_column fusion-builder-column-0 fusion_builder_column_1_1 1_1 fusion-flex-column\" style=\"--awb-overflow:hidden;--awb-bg-size:cover;--awb-border-color:var(--awb-color4);--awb-border-top:1px;--awb-border-right:1px;--awb-border-bottom:1px;--awb-border-left:1px;--awb-border-style:solid;--awb-border-radius:10px 10px 10px 10px;--awb-width-large:100%;--awb-margin-top-large:0px;--awb-spacing-right-large:1.92%;--awb-margin-bottom-large:30px;--awb-spacing-left-large:1.92%;--awb-width-medium:100%;--awb-order-medium:0;--awb-spacing-right-medium:1.92%;--awb-spacing-left-medium:1.92%;--awb-width-small:100%;--awb-order-small:0;--awb-spacing-right-small:1.92%;--awb-spacing-left-small:1.92%;\"><div class=\"fusion-column-wrapper fusion-column-has-shadow fusion-flex-justify-content-center fusion-content-layout-row\"><div class=\"fusion-image-element \" style=\"--awb-caption-title-font-family:var(--h2_typography-font-family);--awb-caption-title-font-weight:var(--h2_typography-font-weight);--awb-caption-title-font-style:var(--h2_typography-font-style);--awb-caption-title-size:var(--h2_typography-font-size);--awb-caption-title-transform:var(--h2_typography-text-transform);--awb-caption-title-line-height:var(--h2_typography-line-height);--awb-caption-title-letter-spacing:var(--h2_typography-letter-spacing);\"><span class=\" fusion-imageframe imageframe-none imageframe-1 hover-type-none\"><img loading=\"lazy\" decoding=\"async\" width=\"492\" height=\"358\" title=\"amazon\" src=\"https:\/\/northbaysolutions.com\/wp-content\/uploads\/2021\/08\/amazon.png\" class=\"img-responsive wp-image-6023\" srcset=\"https:\/\/northbaysolutions.com\/wp-content\/uploads\/2021\/08\/amazon-200x146.png 200w, https:\/\/northbaysolutions.com\/wp-content\/uploads\/2021\/08\/amazon-400x291.png 400w, https:\/\/northbaysolutions.com\/wp-content\/uploads\/2021\/08\/amazon.png 492w\" sizes=\"auto, (max-width: 640px) 100vw, 492px\" alt=\"\"><\/span><\/div><\/div><\/div><div class=\"fusion-layout-column fusion_builder_column fusion-builder-column-1 fusion_builder_column_1_1 1_1 fusion-flex-column\" style=\"--awb-bg-size:cover;--awb-width-large:100%;--awb-margin-top-large:0px;--awb-spacing-right-large:1.92%;--awb-margin-bottom-large:0px;--awb-spacing-left-large:1.92%;--awb-width-medium:100%;--awb-order-medium:0;--awb-spacing-right-medium:1.92%;--awb-spacing-left-medium:1.92%;--awb-width-small:100%;--awb-order-small:0;--awb-spacing-right-small:1.92%;--awb-spacing-left-small:1.92%;\"><div class=\"fusion-column-wrapper fusion-column-has-shadow fusion-flex-justify-content-flex-start fusion-content-layout-column\"><div class=\"fusion-text fusion-text-1 fusion-text-no-margin text-style\" style=\"--awb-font-size:28px;--awb-text-color:#000000;--awb-margin-bottom:50px;\"><p class=\"vc_custom_heading b_heading first_heading\"><strong>Optimizing file formats and compression<\/strong><\/p>\n<\/div><div class=\"fusion-text fusion-text-2 fusion-text-no-margin\" style=\"--awb-margin-bottom:0px;\"><p>In the earlier blog post Athena: Beyond the Basics \u2013 Part 1, we have examined working with twitter data and executing complex queries using Athena. In the current article, we will understand the pricing model, experiment with different file formats and compression techniques and perform analysis based on the results and decide the best price to performance solution for the current use case.<\/p>\n<\/div><div class=\"fusion-text fusion-text-3 fusion-text-no-margin\" style=\"--awb-font-size:28px;--awb-margin-bottom:50px;\"><p><strong>Athena Pricing<\/strong><\/p>\n<\/div><div class=\"fusion-text fusion-text-4 fusion-text-no-margin\" style=\"--awb-margin-bottom:50px;\"><p>The Athena service is priced by the amount of data scanned when running the queries. Selecting the appropriate format and compression and balancing those factors with response time for queries will yield considerable cost savings for the expected response times.<\/p>\n<p>https:\/\/aws.amazon.com\/athena\/pricing\/<\/p>\n<p>The following section will share the process of converting to multiple file formats and compressions for additional experimentation for calculating the costs and response time for Twitter use-case.<\/p>\n<\/div><\/div><\/div><div class=\"fusion-layout-column fusion_builder_column fusion-builder-column-2 fusion_builder_column_1_1 1_1 fusion-flex-column\" style=\"--awb-bg-size:cover;--awb-width-large:100%;--awb-margin-top-large:0px;--awb-spacing-right-large:1.92%;--awb-margin-bottom-large:0px;--awb-spacing-left-large:1.92%;--awb-width-medium:100%;--awb-order-medium:0;--awb-spacing-right-medium:1.92%;--awb-spacing-left-medium:1.92%;--awb-width-small:100%;--awb-order-small:0;--awb-spacing-right-small:1.92%;--awb-spacing-left-small:1.92%;\"><div class=\"fusion-column-wrapper fusion-column-has-shadow fusion-flex-justify-content-flex-start fusion-content-layout-column\"><div class=\"fusion-text fusion-text-5 fusion-text-no-margin text-style\" style=\"--awb-font-size:28px;--awb-text-color:#000000;--awb-margin-bottom:50px;\"><p><strong>Converting File formats and Compress<\/strong><\/p>\n<\/div><div class=\"fusion-text fusion-text-6 fusion-text-no-margin\" style=\"--awb-text-color:#000000;--awb-margin-bottom:30px;\"><p>Athena currently doesn\u2019t support inserting\/updating data on S3. EMR cluster or similar has to be used to create new data set using preferred file format from existing data set. We are going to add EMR to the previous architecture to convert the data to different file formats and compression techniques<\/p>\n<\/div><div class=\"fusion-image-element \" style=\"text-align:center;--awb-margin-bottom:30px;--awb-max-width:100%;--awb-caption-title-font-family:var(--h2_typography-font-family);--awb-caption-title-font-weight:var(--h2_typography-font-weight);--awb-caption-title-font-style:var(--h2_typography-font-style);--awb-caption-title-size:var(--h2_typography-font-size);--awb-caption-title-transform:var(--h2_typography-text-transform);--awb-caption-title-line-height:var(--h2_typography-line-height);--awb-caption-title-letter-spacing:var(--h2_typography-letter-spacing);\"><span class=\" fusion-imageframe imageframe-none imageframe-2 hover-type-none\"><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"251\" title=\"Athena_Blog_Graphics2-02\" src=\"https:\/\/northbaysolutions.com\/wp-content\/uploads\/2021\/08\/Athena_Blog_Graphics2-02-800x251.png\" class=\"img-responsive wp-image-5979\" srcset=\"https:\/\/northbaysolutions.com\/wp-content\/uploads\/2021\/08\/Athena_Blog_Graphics2-02-200x63.png 200w, https:\/\/northbaysolutions.com\/wp-content\/uploads\/2021\/08\/Athena_Blog_Graphics2-02-400x126.png 400w, https:\/\/northbaysolutions.com\/wp-content\/uploads\/2021\/08\/Athena_Blog_Graphics2-02-600x188.png 600w, https:\/\/northbaysolutions.com\/wp-content\/uploads\/2021\/08\/Athena_Blog_Graphics2-02-800x251.png 800w, https:\/\/northbaysolutions.com\/wp-content\/uploads\/2021\/08\/Athena_Blog_Graphics2-02-1200x377.png 1200w, https:\/\/northbaysolutions.com\/wp-content\/uploads\/2021\/08\/Athena_Blog_Graphics2-02.png 1458w\" sizes=\"auto, (max-width: 640px) 100vw, 800px\" alt=\"\"><\/span><\/div><div class=\"fusion-text fusion-text-7 fusion-text-no-margin\" style=\"--awb-margin-bottom:30px;\"><p>The following link covers great deal of converting into multiple file formats<br \/>\nhttp:\/\/docs.aws.amazon.com\/athena\/latest\/ug\/convert-to-columnar.html<\/p>\n<p>The approach for converting to different file formats and querying from Athena<\/p>\n<ul>\n<li>Create Hive table to the relevant target format and compression<\/li>\n<\/ul>\n<\/div><\/div><\/div><div class=\"fusion-layout-column fusion_builder_column fusion-builder-column-3 fusion_builder_column_1_1 1_1 fusion-flex-column\" style=\"--awb-bg-size:cover;--awb-border-color:#c6c6c6;--awb-border-top:2px;--awb-border-right:2px;--awb-border-bottom:2px;--awb-border-left:2px;--awb-border-style:solid;--awb-width-large:100%;--awb-margin-top-large:0px;--awb-spacing-right-large:1.92%;--awb-margin-bottom-large:0px;--awb-spacing-left-large:1.92%;--awb-width-medium:100%;--awb-order-medium:0;--awb-spacing-right-medium:1.92%;--awb-spacing-left-medium:1.92%;--awb-width-small:100%;--awb-order-small:0;--awb-spacing-right-small:1.92%;--awb-spacing-left-small:1.92%;\"><div class=\"fusion-column-wrapper fusion-column-has-shadow fusion-flex-justify-content-flex-start fusion-content-layout-column\"><div class=\"fusion-text fusion-text-8 fusion-text-no-margin blog-text-style\" style=\"--awb-margin-top:20px;--awb-margin-right:20px;--awb-margin-bottom:20px;--awb-margin-left:20px;\"><p>CREATE EXTERNAL TABLE ([column - data type])<br \/>\nSTORED AS<br \/>\nLOCATION<br \/>\ntblproperties(\u201c)<\/p>\n<\/div><\/div><\/div><div class=\"fusion-layout-column fusion_builder_column fusion-builder-column-4 fusion_builder_column_1_1 1_1 fusion-flex-column\" style=\"--awb-bg-size:cover;--awb-width-large:100%;--awb-margin-top-large:0px;--awb-spacing-right-large:1.92%;--awb-margin-bottom-large:0px;--awb-spacing-left-large:1.92%;--awb-width-medium:100%;--awb-order-medium:0;--awb-spacing-right-medium:1.92%;--awb-spacing-left-medium:1.92%;--awb-width-small:100%;--awb-order-small:0;--awb-spacing-right-small:1.92%;--awb-spacing-left-small:1.92%;\"><div class=\"fusion-column-wrapper fusion-column-has-shadow fusion-flex-justify-content-flex-start fusion-content-layout-column\"><div class=\"fusion-text fusion-text-9 fusion-text-no-margin\" style=\"--awb-margin-top:20px;--awb-margin-bottom:20px;\"><ul>\n<li>Populate the data into new storage\/compression format<\/li>\n<\/ul>\n<\/div><\/div><\/div><div class=\"fusion-layout-column fusion_builder_column fusion-builder-column-5 fusion_builder_column_1_1 1_1 fusion-flex-column\" style=\"--awb-bg-size:cover;--awb-border-color:#bababa;--awb-border-top:2px;--awb-border-right:2px;--awb-border-bottom:2px;--awb-border-left:2px;--awb-border-style:solid;--awb-width-large:100%;--awb-margin-top-large:0px;--awb-spacing-right-large:1.92%;--awb-margin-bottom-large:0px;--awb-spacing-left-large:1.92%;--awb-width-medium:100%;--awb-order-medium:0;--awb-spacing-right-medium:1.92%;--awb-spacing-left-medium:1.92%;--awb-width-small:100%;--awb-order-small:0;--awb-spacing-right-small:1.92%;--awb-spacing-left-small:1.92%;\"><div class=\"fusion-column-wrapper fusion-column-has-shadow fusion-flex-justify-content-flex-start fusion-content-layout-column\"><div class=\"fusion-text fusion-text-10 fusion-text-no-margin\" style=\"--awb-margin-top:20px;--awb-margin-right:20px;--awb-margin-bottom:20px;--awb-margin-left:20px;\"><p>INSERT INTO SELECT * from<\/p>\n<\/div><\/div><\/div><div class=\"fusion-layout-column fusion_builder_column fusion-builder-column-6 fusion_builder_column_1_1 1_1 fusion-flex-column\" style=\"--awb-bg-size:cover;--awb-width-large:100%;--awb-margin-top-large:0px;--awb-spacing-right-large:1.92%;--awb-margin-bottom-large:0px;--awb-spacing-left-large:1.92%;--awb-width-medium:100%;--awb-order-medium:0;--awb-spacing-right-medium:1.92%;--awb-spacing-left-medium:1.92%;--awb-width-small:100%;--awb-order-small:0;--awb-spacing-right-small:1.92%;--awb-spacing-left-small:1.92%;\"><div class=\"fusion-column-wrapper fusion-column-has-shadow fusion-flex-justify-content-flex-start fusion-content-layout-column\"><div class=\"fusion-text fusion-text-11 fusion-text-no-margin blog-text-style\" style=\"--awb-margin-bottom:20px;\"><ul>\n<li>Create Athena table based on the new dataset stored on S3. Currently, Athena catalog manager doesn\u2019t share Hive catalog<\/li>\n<\/ul>\n<p>The following code snippets are used to create multiple versions of the same data set for experimenting with Athena<\/p>\n<\/div><div class=\"fusion-text fusion-text-12 fusion-text-no-margin\" style=\"--awb-font-size:20px;--awb-margin-top:20px;--awb-margin-bottom:20px;\"><p><strong>JSON FORMAT:<\/strong><\/p>\n<\/div><div class=\"fusion-text fusion-text-13 fusion-text-no-margin\" style=\"--awb-margin-bottom:20px;\"><ul>\n<li>To convert from Json to snappy compression we execute this commands in HIVE<\/li>\n<\/ul>\n<\/div><\/div><\/div><div class=\"fusion-layout-column fusion_builder_column fusion-builder-column-7 fusion_builder_column_1_1 1_1 fusion-flex-column\" style=\"--awb-bg-size:cover;--awb-border-color:#bababa;--awb-border-top:2px;--awb-border-right:2px;--awb-border-bottom:2px;--awb-border-left:2px;--awb-border-style:solid;--awb-width-large:100%;--awb-margin-top-large:0px;--awb-spacing-right-large:1.92%;--awb-margin-bottom-large:0px;--awb-spacing-left-large:1.92%;--awb-width-medium:100%;--awb-order-medium:0;--awb-spacing-right-medium:1.92%;--awb-spacing-left-medium:1.92%;--awb-width-small:100%;--awb-order-small:0;--awb-spacing-right-small:1.92%;--awb-spacing-left-small:1.92%;\"><div class=\"fusion-column-wrapper fusion-column-has-shadow fusion-flex-justify-content-flex-start fusion-content-layout-column\"><div class=\"fusion-text fusion-text-14 fusion-text-no-margin blog-text-style\" style=\"--awb-margin-top:20px;--awb-margin-right:20px;--awb-margin-bottom:20px;--awb-margin-left:20px;\"><p>SET hive.exec.compress.output=true;<br \/>\nSET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;<br \/>\nSET mapred.output.compression.type=BLOCK;<\/p>\n<\/div><\/div><\/div><div class=\"fusion-layout-column fusion_builder_column fusion-builder-column-8 fusion_builder_column_1_1 1_1 fusion-flex-column\" style=\"--awb-bg-size:cover;--awb-border-color:#c9c9c9;--awb-border-style:solid;--awb-width-large:100%;--awb-margin-top-large:0px;--awb-spacing-right-large:1.92%;--awb-margin-bottom-large:0px;--awb-spacing-left-large:1.92%;--awb-width-medium:100%;--awb-order-medium:0;--awb-spacing-right-medium:1.92%;--awb-spacing-left-medium:1.92%;--awb-width-small:100%;--awb-order-small:0;--awb-spacing-right-small:1.92%;--awb-spacing-left-small:1.92%;\"><div class=\"fusion-column-wrapper fusion-column-has-shadow fusion-flex-justify-content-flex-start fusion-content-layout-column\"><div class=\"fusion-text fusion-text-15 fusion-text-no-margin blog-text-style\" style=\"--awb-margin-top:20px;--awb-margin-bottom:20px;\"><p>Then we create the snappy external table exactly the same as we\u2019ve created it as json but with different name (tweets_snappy for example) and in different location and insert from json table to snappy table<\/p>\n<ul>\n<li>To convert from Json to gzip compression we execute this commands in HIVE<\/li>\n<\/ul>\n<\/div><\/div><\/div><div class=\"fusion-layout-column fusion_builder_column fusion-builder-column-9 fusion_builder_column_1_1 1_1 fusion-flex-column\" style=\"--awb-bg-size:cover;--awb-border-color:#c1c1c1;--awb-border-top:2px;--awb-border-right:2px;--awb-border-bottom:2px;--awb-border-left:2px;--awb-border-style:solid;--awb-width-large:100%;--awb-margin-top-large:0px;--awb-spacing-right-large:1.92%;--awb-margin-bottom-large:0px;--awb-spacing-left-large:1.92%;--awb-width-medium:100%;--awb-order-medium:0;--awb-spacing-right-medium:1.92%;--awb-spacing-left-medium:1.92%;--awb-width-small:100%;--awb-order-small:0;--awb-spacing-right-small:1.92%;--awb-spacing-left-small:1.92%;\"><div class=\"fusion-column-wrapper fusion-column-has-shadow fusion-flex-justify-content-flex-start fusion-content-layout-column\"><div class=\"fusion-text fusion-text-16 fusion-text-no-margin blog-text-style\" style=\"--awb-margin-top:20px;--awb-margin-right:20px;--awb-margin-bottom:20px;--awb-margin-left:20px;\"><p>SET hive.exec.compress.output=true;<br \/>\nSET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;<br \/>\nSET mapred.output.compression.type=BLOCK;<\/p>\n<\/div><\/div><\/div><div class=\"fusion-layout-column fusion_builder_column fusion-builder-column-10 fusion_builder_column_1_1 1_1 fusion-flex-column\" style=\"--awb-bg-size:cover;--awb-width-large:100%;--awb-margin-top-large:0px;--awb-spacing-right-large:1.92%;--awb-margin-bottom-large:0px;--awb-spacing-left-large:1.92%;--awb-width-medium:100%;--awb-order-medium:0;--awb-spacing-right-medium:1.92%;--awb-spacing-left-medium:1.92%;--awb-width-small:100%;--awb-order-small:0;--awb-spacing-right-small:1.92%;--awb-spacing-left-small:1.92%;\"><div class=\"fusion-column-wrapper fusion-column-has-shadow fusion-flex-justify-content-flex-start fusion-content-layout-column\"><div class=\"fusion-text fusion-text-17 fusion-text-no-margin\" style=\"--awb-margin-top:20px;--awb-margin-bottom:30px;\"><p>Then we create the gzip external table exactly the same as we\u2019ve created it as json but with different name (tweets_gzip for example) and in different location and insert from json table to gzip table<\/p>\n<p><strong>TIP:<\/strong> for Hive running faster is better to copy from a compressed table, so it has to read less data and do less maps.<\/p>\n<p>PARQUET FORMAT:<\/p>\n<ul>\n<li>To convert from parquet to parquet_snappy compression we add the following to the end of CREATE parquet table statement:<\/li>\n<\/ul>\n<\/div><\/div><\/div><div class=\"fusion-layout-column fusion_builder_column fusion-builder-column-11 fusion_builder_column_1_1 1_1 fusion-flex-column\" style=\"--awb-bg-size:cover;--awb-border-color:#c4c4c4;--awb-border-top:2px;--awb-border-right:2px;--awb-border-bottom:2px;--awb-border-left:2px;--awb-border-style:solid;--awb-width-large:100%;--awb-margin-top-large:0px;--awb-spacing-right-large:1.92%;--awb-margin-bottom-large:30px;--awb-spacing-left-large:1.92%;--awb-width-medium:100%;--awb-order-medium:0;--awb-spacing-right-medium:1.92%;--awb-spacing-left-medium:1.92%;--awb-width-small:100%;--awb-order-small:0;--awb-spacing-right-small:1.92%;--awb-spacing-left-small:1.92%;\"><div class=\"fusion-column-wrapper fusion-column-has-shadow fusion-flex-justify-content-flex-start fusion-content-layout-column\"><div class=\"fusion-text fusion-text-18 fusion-text-no-margin\" style=\"--awb-margin-top:20px;--awb-margin-right:20px;--awb-margin-bottom:50px;--awb-margin-left:20px;\"><p>TBLPROPERTIES (\"PARQUET.COMPRESS\"=\"SNAPPY\");<\/p>\n<\/div><\/div><\/div><div class=\"fusion-layout-column fusion_builder_column fusion-builder-column-12 fusion_builder_column_1_1 1_1 fusion-flex-column\" style=\"--awb-bg-size:cover;--awb-width-large:100%;--awb-margin-top-large:0px;--awb-spacing-right-large:1.92%;--awb-margin-bottom-large:0px;--awb-spacing-left-large:1.92%;--awb-width-medium:100%;--awb-order-medium:0;--awb-spacing-right-medium:1.92%;--awb-spacing-left-medium:1.92%;--awb-width-small:100%;--awb-order-small:0;--awb-spacing-right-small:1.92%;--awb-spacing-left-small:1.92%;\"><div class=\"fusion-column-wrapper fusion-column-has-shadow fusion-flex-justify-content-flex-start fusion-content-layout-column\"><div class=\"fusion-text fusion-text-19 fusion-text-no-margin blog-text-style\" style=\"--awb-margin-top:20px;--awb-margin-bottom:20px;\"><p>Then we insert from any other already created and with data (json, json_snappy, parquet\u2026) to parquet_snappy table<\/p>\n<ul>\n<li>To convert from parquet to parquet_gzip compression we we add the following to the end of CREATE parquet table statement:<\/li>\n<\/ul>\n<\/div><\/div><\/div><div class=\"fusion-layout-column fusion_builder_column fusion-builder-column-13 fusion_builder_column_1_1 1_1 fusion-flex-column\" style=\"--awb-bg-size:cover;--awb-border-color:#cccccc;--awb-border-top:2px;--awb-border-right:2px;--awb-border-bottom:2px;--awb-border-left:2px;--awb-border-style:solid;--awb-width-large:100%;--awb-margin-top-large:0px;--awb-spacing-right-large:1.92%;--awb-margin-bottom-large:0px;--awb-spacing-left-large:1.92%;--awb-width-medium:100%;--awb-order-medium:0;--awb-spacing-right-medium:1.92%;--awb-spacing-left-medium:1.92%;--awb-width-small:100%;--awb-order-small:0;--awb-spacing-right-small:1.92%;--awb-spacing-left-small:1.92%;\"><div class=\"fusion-column-wrapper fusion-column-has-shadow fusion-flex-justify-content-flex-start fusion-content-layout-column\"><div class=\"fusion-text fusion-text-20 fusion-text-no-margin\" style=\"--awb-margin-top:20px;--awb-margin-right:20px;--awb-margin-bottom:20px;--awb-margin-left:20px;\"><p>TBLPROPERTIES (\"PARQUET.COMPRESS\"=\"GZIP\");<\/p>\n<\/div><\/div><\/div><div class=\"fusion-layout-column fusion_builder_column fusion-builder-column-14 fusion_builder_column_1_1 1_1 fusion-flex-column\" style=\"--awb-bg-size:cover;--awb-border-color:#e8e8e8;--awb-border-style:solid;--awb-width-large:100%;--awb-margin-top-large:0px;--awb-spacing-right-large:1.92%;--awb-margin-bottom-large:0px;--awb-spacing-left-large:1.92%;--awb-width-medium:100%;--awb-order-medium:0;--awb-spacing-right-medium:1.92%;--awb-spacing-left-medium:1.92%;--awb-width-small:100%;--awb-order-small:0;--awb-spacing-right-small:1.92%;--awb-spacing-left-small:1.92%;\"><div class=\"fusion-column-wrapper fusion-column-has-shadow fusion-flex-justify-content-flex-start fusion-content-layout-column\"><div class=\"fusion-text fusion-text-21 fusion-text-no-margin blog-text-style\" style=\"--awb-margin-top:20px;--awb-margin-bottom:20px;\"><p>Then we insert from any other already created and with data (json, json_snappy, parquet\u2026) to parquet_gzip table<\/p>\n<\/div><\/div><\/div><div class=\"fusion-layout-column fusion_builder_column fusion-builder-column-15 fusion_builder_column_1_1 1_1 fusion-flex-column\" style=\"--awb-bg-size:cover;--awb-border-color:#e8e8e8;--awb-border-style:solid;--awb-width-large:100%;--awb-margin-top-large:0px;--awb-spacing-right-large:1.92%;--awb-margin-bottom-large:0px;--awb-spacing-left-large:1.92%;--awb-width-medium:100%;--awb-order-medium:0;--awb-spacing-right-medium:1.92%;--awb-spacing-left-medium:1.92%;--awb-width-small:100%;--awb-order-small:0;--awb-spacing-right-small:1.92%;--awb-spacing-left-small:1.92%;\"><div class=\"fusion-column-wrapper fusion-column-has-shadow fusion-flex-justify-content-flex-start fusion-content-layout-column\"><div class=\"fusion-text fusion-text-22 fusion-text-no-margin\" style=\"--awb-font-size:20px;--awb-margin-top:20px;--awb-margin-bottom:20px;\"><p><strong>ORC FORMAT:<\/strong><\/p>\n<\/div><div class=\"fusion-text fusion-text-23 fusion-text-no-margin\" style=\"--awb-margin-top:20px;--awb-margin-bottom:20px;\"><ul>\n<li>To convert from orc to orc_snappy compression we add the following to the end of CREATE ORC table statement:<\/li>\n<\/ul>\n<\/div><\/div><\/div><div class=\"fusion-layout-column fusion_builder_column fusion-builder-column-16 fusion_builder_column_1_1 1_1 fusion-flex-column\" style=\"--awb-bg-size:cover;--awb-border-color:#c4c4c4;--awb-border-top:2px;--awb-border-right:2px;--awb-border-bottom:2px;--awb-border-left:2px;--awb-border-style:solid;--awb-width-large:100%;--awb-margin-top-large:0px;--awb-spacing-right-large:1.92%;--awb-margin-bottom-large:0px;--awb-spacing-left-large:1.92%;--awb-width-medium:100%;--awb-order-medium:0;--awb-spacing-right-medium:1.92%;--awb-spacing-left-medium:1.92%;--awb-width-small:100%;--awb-order-small:0;--awb-spacing-right-small:1.92%;--awb-spacing-left-small:1.92%;\"><div class=\"fusion-column-wrapper fusion-column-has-shadow fusion-flex-justify-content-flex-start fusion-content-layout-column\"><div class=\"fusion-text fusion-text-24 fusion-text-no-margin\" style=\"--awb-margin-top:20px;--awb-margin-right:20px;--awb-margin-bottom:20px;--awb-margin-left:20px;\"><p>TBLPROPERTIES (\"ORC.COMPRESS\"=\"SNAPPY\");<\/p>\n<\/div><\/div><\/div><div class=\"fusion-layout-column fusion_builder_column fusion-builder-column-17 fusion_builder_column_1_1 1_1 fusion-flex-column\" style=\"--awb-bg-size:cover;--awb-border-color:#e8e8e8;--awb-border-style:solid;--awb-width-large:100%;--awb-margin-top-large:0px;--awb-spacing-right-large:1.92%;--awb-margin-bottom-large:0px;--awb-spacing-left-large:1.92%;--awb-width-medium:100%;--awb-order-medium:0;--awb-spacing-right-medium:1.92%;--awb-spacing-left-medium:1.92%;--awb-width-small:100%;--awb-order-small:0;--awb-spacing-right-small:1.92%;--awb-spacing-left-small:1.92%;\"><div class=\"fusion-column-wrapper fusion-column-has-shadow fusion-flex-justify-content-flex-start fusion-content-layout-column\"><div class=\"fusion-text fusion-text-25 fusion-text-no-margin\" style=\"--awb-margin-top:20px;--awb-margin-bottom:20px;\"><p>Then we insert from any other already created and with data (json, json_snappy, parquet\u2026) to orc_snappy table<\/p>\n<ul>\n<li>To convert from orc to orc_gzip compression we we add the following to the end of CREATE orc table statement:<\/li>\n<\/ul>\n<\/div><\/div><\/div><div class=\"fusion-layout-column fusion_builder_column fusion-builder-column-18 fusion_builder_column_1_1 1_1 fusion-flex-column\" style=\"--awb-bg-size:cover;--awb-border-color:#c9c9c9;--awb-border-top:2px;--awb-border-right:2px;--awb-border-bottom:2px;--awb-border-left:2px;--awb-border-style:solid;--awb-width-large:100%;--awb-margin-top-large:0px;--awb-spacing-right-large:1.92%;--awb-margin-bottom-large:0px;--awb-spacing-left-large:1.92%;--awb-width-medium:100%;--awb-order-medium:0;--awb-spacing-right-medium:1.92%;--awb-spacing-left-medium:1.92%;--awb-width-small:100%;--awb-order-small:0;--awb-spacing-right-small:1.92%;--awb-spacing-left-small:1.92%;\"><div class=\"fusion-column-wrapper fusion-column-has-shadow fusion-flex-justify-content-flex-start fusion-content-layout-column\"><div class=\"fusion-text fusion-text-26 fusion-text-no-margin\" style=\"--awb-margin-top:20px;--awb-margin-right:20px;--awb-margin-bottom:20px;--awb-margin-left:20px;\"><p>TBLPROPERTIES (\"ORC.COMPRESS\"=\"GZIP\");<\/p>\n<\/div><\/div><\/div><div class=\"fusion-layout-column fusion_builder_column fusion-builder-column-19 fusion_builder_column_1_1 1_1 fusion-flex-column\" style=\"--awb-bg-size:cover;--awb-border-color:#dbdbdb;--awb-border-style:solid;--awb-width-large:100%;--awb-margin-top-large:0px;--awb-spacing-right-large:1.92%;--awb-margin-bottom-large:0px;--awb-spacing-left-large:1.92%;--awb-width-medium:100%;--awb-order-medium:0;--awb-spacing-right-medium:1.92%;--awb-spacing-left-medium:1.92%;--awb-width-small:100%;--awb-order-small:0;--awb-spacing-right-small:1.92%;--awb-spacing-left-small:1.92%;\"><div class=\"fusion-column-wrapper fusion-column-has-shadow fusion-flex-justify-content-flex-start fusion-content-layout-column\"><div class=\"fusion-text fusion-text-27 fusion-text-no-margin blog-text-style\" style=\"--awb-margin-top:20px;--awb-margin-bottom:20px;\"><p>Then we insert from any other already created and with data (json, json_snappy, parquet\u2026) to orc_gzip table.<\/p>\n<p>Based on the above code snippets JSON, ORC, PARQUET without and with compression (SNAPPY, GZIP) are created for a total of 9 tables<\/p>\n<\/div><\/div><\/div><div class=\"fusion-layout-column fusion_builder_column fusion-builder-column-20 fusion_builder_column_1_1 1_1 fusion-flex-column\" style=\"--awb-bg-size:cover;--awb-border-color:#dbdbdb;--awb-border-style:solid;--awb-width-large:100%;--awb-margin-top-large:0px;--awb-spacing-right-large:1.92%;--awb-margin-bottom-large:0px;--awb-spacing-left-large:1.92%;--awb-width-medium:100%;--awb-order-medium:0;--awb-spacing-right-medium:1.92%;--awb-spacing-left-medium:1.92%;--awb-width-small:100%;--awb-order-small:0;--awb-spacing-right-small:1.92%;--awb-spacing-left-small:1.92%;\"><div class=\"fusion-column-wrapper fusion-column-has-shadow fusion-flex-justify-content-flex-start fusion-content-layout-column\"><div class=\"fusion-text fusion-text-28 fusion-text-no-margin\" style=\"--awb-font-size:28px;--awb-margin-bottom:30px;\"><p><strong>Queries for Analysis<\/strong><\/p>\n<\/div><div class=\"fusion-text fusion-text-29 fusion-text-no-margin\" style=\"--awb-margin-bottom:20px;\"><p>The following queries from the earlier blog post will be used to perform analysis for data scanned and response times.<\/p>\n<\/div><div class=\"fusion-text fusion-text-30 fusion-text-no-margin\" style=\"--awb-font-size:20px;--awb-margin-top:20px;--awb-margin-bottom:20px;\"><p><strong>Query 1: Total records in the data set<\/strong><\/p>\n<\/div><\/div><\/div><div class=\"fusion-layout-column fusion_builder_column fusion-builder-column-21 fusion_builder_column_1_1 1_1 fusion-flex-column\" style=\"--awb-bg-size:cover;--awb-border-color:#cecece;--awb-border-top:2px;--awb-border-right:2px;--awb-border-bottom:2px;--awb-border-left:2px;--awb-border-style:solid;--awb-width-large:100%;--awb-margin-top-large:0px;--awb-spacing-right-large:1.92%;--awb-margin-bottom-large:50px;--awb-spacing-left-large:1.92%;--awb-width-medium:100%;--awb-order-medium:0;--awb-spacing-right-medium:1.92%;--awb-spacing-left-medium:1.92%;--awb-width-small:100%;--awb-order-small:0;--awb-spacing-right-small:1.92%;--awb-spacing-left-small:1.92%;\"><div class=\"fusion-column-wrapper fusion-column-has-shadow fusion-flex-justify-content-flex-start fusion-content-layout-column\"><div class=\"fusion-text fusion-text-31 fusion-text-no-margin\" style=\"--awb-margin-top:20px;--awb-margin-right:20px;--awb-margin-bottom:20px;--awb-margin-left:20px;\"><p>Select count(*) from tweets<\/p>\n<\/div><\/div><\/div><div class=\"fusion-layout-column fusion_builder_column fusion-builder-column-22 fusion_builder_column_1_1 1_1 fusion-flex-column\" style=\"--awb-bg-size:cover;--awb-width-large:100%;--awb-margin-top-large:0px;--awb-spacing-right-large:1.92%;--awb-margin-bottom-large:0px;--awb-spacing-left-large:1.92%;--awb-width-medium:100%;--awb-order-medium:0;--awb-spacing-right-medium:1.92%;--awb-spacing-left-medium:1.92%;--awb-width-small:100%;--awb-order-small:0;--awb-spacing-right-small:1.92%;--awb-spacing-left-small:1.92%;\"><div class=\"fusion-column-wrapper fusion-column-has-shadow fusion-flex-justify-content-flex-start fusion-content-layout-column\"><div class=\"fusion-text fusion-text-32 fusion-text-no-margin\" style=\"--awb-font-size:20px;--awb-margin-top:20px;--awb-margin-bottom:20px;\"><p><strong>Query 2: Get all records<\/strong><\/p>\n<\/div><\/div><\/div><div class=\"fusion-layout-column fusion_builder_column fusion-builder-column-23 fusion_builder_column_1_1 1_1 fusion-flex-column\" style=\"--awb-bg-size:cover;--awb-border-color:#d8d8d8;--awb-border-top:2px;--awb-border-right:2px;--awb-border-bottom:2px;--awb-border-left:2px;--awb-border-style:solid;--awb-width-large:100%;--awb-margin-top-large:0px;--awb-spacing-right-large:1.92%;--awb-margin-bottom-large:50px;--awb-spacing-left-large:1.92%;--awb-width-medium:100%;--awb-order-medium:0;--awb-spacing-right-medium:1.92%;--awb-spacing-left-medium:1.92%;--awb-width-small:100%;--awb-order-small:0;--awb-spacing-right-small:1.92%;--awb-spacing-left-small:1.92%;\"><div class=\"fusion-column-wrapper fusion-column-has-shadow fusion-flex-justify-content-flex-start fusion-content-layout-column\"><div class=\"fusion-text fusion-text-33 fusion-text-no-margin\" style=\"--awb-margin-top:20px;--awb-margin-right:20px;--awb-margin-bottom:20px;--awb-margin-left:20px;\"><p>select * from tweets<\/p>\n<\/div><\/div><\/div><div class=\"fusion-layout-column fusion_builder_column fusion-builder-column-24 fusion_builder_column_1_1 1_1 fusion-flex-column\" style=\"--awb-bg-size:cover;--awb-width-large:100%;--awb-margin-top-large:0px;--awb-spacing-right-large:1.92%;--awb-margin-bottom-large:0px;--awb-spacing-left-large:1.92%;--awb-width-medium:100%;--awb-order-medium:0;--awb-spacing-right-medium:1.92%;--awb-spacing-left-medium:1.92%;--awb-width-small:100%;--awb-order-small:0;--awb-spacing-right-small:1.92%;--awb-spacing-left-small:1.92%;\"><div class=\"fusion-column-wrapper fusion-column-has-shadow fusion-flex-justify-content-flex-start fusion-content-layout-column\"><div class=\"fusion-text fusion-text-34 fusion-text-no-margin\" style=\"--awb-font-size:20px;--awb-margin-top:20px;--awb-margin-bottom:20px;\"><p><strong>Query 3: Top Hashtags with at least 100 occurrences<\/strong><\/p>\n<\/div><\/div><\/div><div class=\"fusion-layout-column fusion_builder_column fusion-builder-column-25 fusion_builder_column_1_1 1_1 fusion-flex-column\" style=\"--awb-bg-size:cover;--awb-border-color:#d8d8d8;--awb-border-top:2px;--awb-border-right:2px;--awb-border-bottom:2px;--awb-border-left:2px;--awb-border-style:solid;--awb-width-large:100%;--awb-margin-top-large:0px;--awb-spacing-right-large:1.92%;--awb-margin-bottom-large:50px;--awb-spacing-left-large:1.92%;--awb-width-medium:100%;--awb-order-medium:0;--awb-spacing-right-medium:1.92%;--awb-spacing-left-medium:1.92%;--awb-width-small:100%;--awb-order-small:0;--awb-spacing-right-small:1.92%;--awb-spacing-left-small:1.92%;\"><div class=\"fusion-column-wrapper fusion-column-has-shadow fusion-flex-justify-content-flex-start fusion-content-layout-column\"><div class=\"fusion-text fusion-text-35 fusion-text-no-margin\" style=\"--awb-margin-top:20px;--awb-margin-right:20px;--awb-margin-bottom:20px;--awb-margin-left:20px;\"><p>SELECT ht.text,<br \/>\ncount(*)<br \/>\nFROM tweets<br \/>\nCROSS JOIN UNNEST (entities.hashtags) AS t(ht)<br \/>\nGROUP BY ht.text<br \/>\nHAVING count(*)&amp;gt;100<br \/>\nORDER by count(*) desc;<\/p>\n<\/div><\/div><\/div><div class=\"fusion-layout-column fusion_builder_column fusion-builder-column-26 fusion_builder_column_1_1 1_1 fusion-flex-column\" style=\"--awb-bg-size:cover;--awb-width-large:100%;--awb-margin-top-large:0px;--awb-spacing-right-large:1.92%;--awb-margin-bottom-large:0px;--awb-spacing-left-large:1.92%;--awb-width-medium:100%;--awb-order-medium:0;--awb-spacing-right-medium:1.92%;--awb-spacing-left-medium:1.92%;--awb-width-small:100%;--awb-order-small:0;--awb-spacing-right-small:1.92%;--awb-spacing-left-small:1.92%;\"><div class=\"fusion-column-wrapper fusion-column-has-shadow fusion-flex-justify-content-flex-start fusion-content-layout-column\"><div class=\"fusion-text fusion-text-36 fusion-text-no-margin\" style=\"--awb-font-size:20px;--awb-margin-top:20px;--awb-margin-bottom:20px;\"><p><strong>Query 4: Number of tweets from verified accounts with most followers<\/strong><\/p>\n<\/div><\/div><\/div><div class=\"fusion-layout-column fusion_builder_column fusion-builder-column-27 fusion_builder_column_1_1 1_1 fusion-flex-column\" style=\"--awb-bg-size:cover;--awb-border-color:#d3d3d3;--awb-border-top:2px;--awb-border-right:2px;--awb-border-bottom:2px;--awb-border-left:2px;--awb-border-style:solid;--awb-width-large:100%;--awb-margin-top-large:0px;--awb-spacing-right-large:1.92%;--awb-margin-bottom-large:50px;--awb-spacing-left-large:1.92%;--awb-width-medium:100%;--awb-order-medium:0;--awb-spacing-right-medium:1.92%;--awb-spacing-left-medium:1.92%;--awb-width-small:100%;--awb-order-small:0;--awb-spacing-right-small:1.92%;--awb-spacing-left-small:1.92%;\"><div class=\"fusion-column-wrapper fusion-column-has-shadow fusion-flex-justify-content-flex-start fusion-content-layout-column\"><div class=\"fusion-text fusion-text-37 fusion-text-no-margin blog-text-style\" style=\"--awb-margin-top:20px;--awb-margin-right:20px;--awb-margin-bottom:20px;--awb-margin-left:20px;\"><p>SELECT user.screen_name,user.name,max(user.followers_count),count(*)<br \/>\nFROM tweets<br \/>\nWHERE user.verified='true'<br \/>\nGROUP BY user.screen_name,user.name<br \/>\nORDER BY cast(max(user.followers_count) as integer) DESC<\/p>\n<\/div><\/div><\/div><div class=\"fusion-layout-column fusion_builder_column fusion-builder-column-28 fusion_builder_column_1_1 1_1 fusion-flex-column\" style=\"--awb-bg-size:cover;--awb-width-large:100%;--awb-margin-top-large:0px;--awb-spacing-right-large:1.92%;--awb-margin-bottom-large:0px;--awb-spacing-left-large:1.92%;--awb-width-medium:100%;--awb-order-medium:0;--awb-spacing-right-medium:1.92%;--awb-spacing-left-medium:1.92%;--awb-width-small:100%;--awb-order-small:0;--awb-spacing-right-small:1.92%;--awb-spacing-left-small:1.92%;\"><div class=\"fusion-column-wrapper fusion-column-has-shadow fusion-flex-justify-content-flex-start fusion-content-layout-column\"><div class=\"fusion-text fusion-text-38 fusion-text-no-margin\" style=\"--awb-font-size:20px;--awb-margin-top:20px;--awb-margin-bottom:20px;\"><p><strong>Query 5: Top URL mentions in Tweets<\/strong><\/p>\n<\/div><\/div><\/div><div class=\"fusion-layout-column fusion_builder_column fusion-builder-column-29 fusion_builder_column_1_1 1_1 fusion-flex-column\" style=\"--awb-bg-size:cover;--awb-border-color:#d3d3d3;--awb-border-top:2px;--awb-border-right:2px;--awb-border-bottom:2px;--awb-border-left:2px;--awb-border-style:solid;--awb-width-large:100%;--awb-margin-top-large:0px;--awb-spacing-right-large:1.92%;--awb-margin-bottom-large:50px;--awb-spacing-left-large:1.92%;--awb-width-medium:100%;--awb-order-medium:0;--awb-spacing-right-medium:1.92%;--awb-spacing-left-medium:1.92%;--awb-width-small:100%;--awb-order-small:0;--awb-spacing-right-small:1.92%;--awb-spacing-left-small:1.92%;\"><div class=\"fusion-column-wrapper fusion-column-has-shadow fusion-flex-justify-content-flex-start fusion-content-layout-column\"><div class=\"fusion-text fusion-text-39 fusion-text-no-margin blog-text-style\" style=\"--awb-margin-top:20px;--awb-margin-right:20px;--awb-margin-bottom:20px;--awb-margin-left:20px;\"><p>SELECT url_extract_host(u.expanded_url),<br \/>\ncount(*)<br \/>\nFROM tweets<br \/>\nCROSS JOIN UNNEST (entities.urls) AS t(u)<br \/>\nGROUP BY url_extract_host(u.expanded_url)<br \/>\nHAVING count(*)&gt;100<br \/>\nORDER by count(*) desc;<\/p>\n<\/div><\/div><\/div><div class=\"fusion-layout-column fusion_builder_column fusion-builder-column-30 fusion_builder_column_1_1 1_1 fusion-flex-column\" style=\"--awb-bg-size:cover;--awb-width-large:100%;--awb-margin-top-large:0px;--awb-spacing-right-large:1.92%;--awb-margin-bottom-large:0px;--awb-spacing-left-large:1.92%;--awb-width-medium:100%;--awb-order-medium:0;--awb-spacing-right-medium:1.92%;--awb-spacing-left-medium:1.92%;--awb-width-small:100%;--awb-order-small:0;--awb-spacing-right-small:1.92%;--awb-spacing-left-small:1.92%;\"><div class=\"fusion-column-wrapper fusion-column-has-shadow fusion-flex-justify-content-flex-start fusion-content-layout-column\"><div class=\"fusion-text fusion-text-40 fusion-text-no-margin\" style=\"--awb-font-size:20px;--awb-margin-top:20px;--awb-margin-bottom:20px;\"><p><strong>Query 6: Hashtags tweeted along with \u201cAmazon\u201d<\/strong><\/p>\n<\/div><\/div><\/div><div class=\"fusion-layout-column fusion_builder_column fusion-builder-column-31 fusion_builder_column_1_1 1_1 fusion-flex-column\" style=\"--awb-bg-size:cover;--awb-border-color:#d6d6d6;--awb-border-top:2px;--awb-border-right:2px;--awb-border-bottom:2px;--awb-border-left:2px;--awb-border-style:solid;--awb-width-large:100%;--awb-margin-top-large:0px;--awb-spacing-right-large:1.92%;--awb-margin-bottom-large:50px;--awb-spacing-left-large:1.92%;--awb-width-medium:100%;--awb-order-medium:0;--awb-spacing-right-medium:1.92%;--awb-spacing-left-medium:1.92%;--awb-width-small:100%;--awb-order-small:0;--awb-spacing-right-small:1.92%;--awb-spacing-left-small:1.92%;\"><div class=\"fusion-column-wrapper fusion-column-has-shadow fusion-flex-justify-content-flex-start fusion-content-layout-column\"><div class=\"fusion-text fusion-text-41 fusion-text-no-margin blog-text-style\" style=\"--awb-margin-top:20px;--awb-margin-right:20px;--awb-margin-bottom:20px;--awb-margin-left:20px;\"><p>WITH ht_list AS<br \/>\n(SELECT entities.hashtags<br \/>\nFROM tweets<br \/>\nCROSS JOIN UNNEST (entities.hashtags) AS t(ht)<br \/>\nWHERE ht.text LIKE 'amazon')<br \/>\nSELECT t AS \"hashtag\",count(*) AS \"occurences\" FROM ht_list<br \/>\nCROSS JOIN UNNEST (hashtags) AS t(t)<br \/>\nGROUP BY t<br \/>\nORDER BY count(*) desc;<\/p>\n<\/div><\/div><\/div><div class=\"fusion-layout-column fusion_builder_column fusion-builder-column-32 fusion_builder_column_1_1 1_1 fusion-flex-column\" style=\"--awb-bg-size:cover;--awb-width-large:100%;--awb-margin-top-large:0px;--awb-spacing-right-large:1.92%;--awb-margin-bottom-large:0px;--awb-spacing-left-large:1.92%;--awb-width-medium:100%;--awb-order-medium:0;--awb-spacing-right-medium:1.92%;--awb-spacing-left-medium:1.92%;--awb-width-small:100%;--awb-order-small:0;--awb-spacing-right-small:1.92%;--awb-spacing-left-small:1.92%;\"><div class=\"fusion-column-wrapper fusion-column-has-shadow fusion-flex-justify-content-flex-start fusion-content-layout-column\"><div class=\"fusion-text fusion-text-42 fusion-text-no-margin\" style=\"--awb-font-size:20px;--awb-margin-top:20px;--awb-margin-bottom:20px;\"><p><strong>Query 7: Find the number of tweets by language and sensitive media content<\/strong><\/p>\n<\/div><\/div><\/div><div class=\"fusion-layout-column fusion_builder_column fusion-builder-column-33 fusion_builder_column_1_1 1_1 fusion-flex-column\" style=\"--awb-bg-size:cover;--awb-border-color:#e2e2e2;--awb-border-top:2px;--awb-border-right:2px;--awb-border-bottom:2px;--awb-border-left:2px;--awb-border-style:solid;--awb-width-large:100%;--awb-margin-top-large:0px;--awb-spacing-right-large:1.92%;--awb-margin-bottom-large:50px;--awb-spacing-left-large:1.92%;--awb-width-medium:100%;--awb-order-medium:0;--awb-spacing-right-medium:1.92%;--awb-spacing-left-medium:1.92%;--awb-width-small:100%;--awb-order-small:0;--awb-spacing-right-small:1.92%;--awb-spacing-left-small:1.92%;\"><div class=\"fusion-column-wrapper fusion-column-has-shadow fusion-flex-justify-content-flex-start fusion-content-layout-column\"><div class=\"fusion-text fusion-text-43 fusion-text-no-margin blog-text-style\" style=\"--awb-margin-top:20px;--awb-margin-right:20px;--awb-margin-bottom:20px;--awb-margin-left:20px;\"><p>SELECT lang, possibly_sensitive, count(*)<br \/>\nFROM tweets<br \/>\nGROUP BY lang, possibly_sensitive<\/p>\n<\/div><\/div><\/div><div class=\"fusion-layout-column fusion_builder_column fusion-builder-column-34 fusion_builder_column_1_1 1_1 fusion-flex-column\" style=\"--awb-bg-size:cover;--awb-width-large:100%;--awb-margin-top-large:0px;--awb-spacing-right-large:1.92%;--awb-margin-bottom-large:0px;--awb-spacing-left-large:1.92%;--awb-width-medium:100%;--awb-order-medium:0;--awb-spacing-right-medium:1.92%;--awb-spacing-left-medium:1.92%;--awb-width-small:100%;--awb-order-small:0;--awb-spacing-right-small:1.92%;--awb-spacing-left-small:1.92%;\"><div class=\"fusion-column-wrapper fusion-column-has-shadow fusion-flex-justify-content-flex-start fusion-content-layout-column\"><div class=\"fusion-text fusion-text-44 fusion-text-no-margin text-style\" style=\"--awb-font-size:28px;--awb-margin-top:20px;--awb-margin-bottom:20px;\"><p><strong>Query execution metrics<\/strong><\/p>\n<\/div><div class=\"fusion-text fusion-text-45\" style=\"--awb-margin-top:20px;\"><p>The queries are executed on Athena (from Athena Query manager). Response times and Data scanned metrics are captured with each execution. Though the data scanned is same across multiple executions of query and table combination, response time varied by few milliseconds and sometimes by couple of seconds. We have taken the average of response time over multiple executions for the following output<\/p>\n<\/div><div class=\"fusion-image-element \" style=\"--awb-margin-top:20px;--awb-margin-bottom:20px;--awb-caption-title-font-family:var(--h2_typography-font-family);--awb-caption-title-font-weight:var(--h2_typography-font-weight);--awb-caption-title-font-style:var(--h2_typography-font-style);--awb-caption-title-size:var(--h2_typography-font-size);--awb-caption-title-transform:var(--h2_typography-text-transform);--awb-caption-title-line-height:var(--h2_typography-line-height);--awb-caption-title-letter-spacing:var(--h2_typography-letter-spacing);\"><span class=\" fusion-imageframe imageframe-none imageframe-3 hover-type-none\"><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"303\" title=\"blog1\" src=\"https:\/\/northbaysolutions.com\/wp-content\/uploads\/2021\/08\/blog1-800x303.png\" class=\"img-responsive wp-image-5982\" srcset=\"https:\/\/northbaysolutions.com\/wp-content\/uploads\/2021\/08\/blog1-200x76.png 200w, https:\/\/northbaysolutions.com\/wp-content\/uploads\/2021\/08\/blog1-400x152.png 400w, https:\/\/northbaysolutions.com\/wp-content\/uploads\/2021\/08\/blog1-600x227.png 600w, https:\/\/northbaysolutions.com\/wp-content\/uploads\/2021\/08\/blog1-800x303.png 800w, https:\/\/northbaysolutions.com\/wp-content\/uploads\/2021\/08\/blog1.png 1179w\" sizes=\"auto, (max-width: 640px) 100vw, 800px\" alt=\"\"><\/span><\/div><div class=\"fusion-image-element \" style=\"--awb-margin-top:20px;--awb-margin-bottom:50px;--awb-caption-title-font-family:var(--h2_typography-font-family);--awb-caption-title-font-weight:var(--h2_typography-font-weight);--awb-caption-title-font-style:var(--h2_typography-font-style);--awb-caption-title-size:var(--h2_typography-font-size);--awb-caption-title-transform:var(--h2_typography-text-transform);--awb-caption-title-line-height:var(--h2_typography-line-height);--awb-caption-title-letter-spacing:var(--h2_typography-letter-spacing);\"><span class=\" fusion-imageframe imageframe-none imageframe-4 hover-type-none\"><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"300\" title=\"blog2\" src=\"https:\/\/northbaysolutions.com\/wp-content\/uploads\/2021\/08\/blog2-800x300.png\" class=\"img-responsive wp-image-5983\" srcset=\"https:\/\/northbaysolutions.com\/wp-content\/uploads\/2021\/08\/blog2-200x75.png 200w, https:\/\/northbaysolutions.com\/wp-content\/uploads\/2021\/08\/blog2-400x150.png 400w, https:\/\/northbaysolutions.com\/wp-content\/uploads\/2021\/08\/blog2-600x225.png 600w, https:\/\/northbaysolutions.com\/wp-content\/uploads\/2021\/08\/blog2-800x300.png 800w, https:\/\/northbaysolutions.com\/wp-content\/uploads\/2021\/08\/blog2.png 1188w\" sizes=\"auto, (max-width: 640px) 100vw, 800px\" alt=\"\"><\/span><\/div><\/div><\/div><div class=\"fusion-layout-column fusion_builder_column fusion-builder-column-35 fusion_builder_column_1_1 1_1 fusion-flex-column\" style=\"--awb-bg-size:cover;--awb-width-large:100%;--awb-margin-top-large:0px;--awb-spacing-right-large:1.92%;--awb-margin-bottom-large:0px;--awb-spacing-left-large:1.92%;--awb-width-medium:100%;--awb-order-medium:0;--awb-spacing-right-medium:1.92%;--awb-spacing-left-medium:1.92%;--awb-width-small:100%;--awb-order-small:0;--awb-spacing-right-small:1.92%;--awb-spacing-left-small:1.92%;\"><div class=\"fusion-column-wrapper fusion-column-has-shadow fusion-flex-justify-content-flex-start fusion-content-layout-column\"><div class=\"fusion-text fusion-text-46 fusion-text-no-margin\" style=\"--awb-font-size:28px;--awb-margin-top:50px;--awb-margin-bottom:50px;\"><p><strong>Testing Analysis<\/strong><\/p>\n<\/div><div class=\"fusion-text fusion-text-47 fusion-text-no-margin\" style=\"--awb-margin-bottom:50px;\"><ul>\n<li>For the current use case and the identified set of queries, ORC format is providing the best price to performance ($0.55 for executing all queries within the 30 seconds of aggregated response times). This can be attributed to columnar storage format and the query use cases used for the blog post<\/li>\n<li>Converting into ORC format or PARQUET format can save the costs on Athena usage by over 90%<\/li>\n<li>Compressing raw JSON to snappy or gzip can also significantly reduce the costs by over 80% but the response time did not improve<\/li>\n<li>Snappy\/Gzip compression on PARQUET and ORC formats have limited benefits as the file formats are already compressed<br \/>\nAvoid using raw text format of dataset when querying with Athena<\/li>\n<\/ul>\n<\/div><\/div><\/div><div class=\"fusion-layout-column fusion_builder_column fusion-builder-column-36 fusion_builder_column_1_1 1_1 fusion-flex-column\" style=\"--awb-bg-size:cover;--awb-width-large:100%;--awb-margin-top-large:0px;--awb-spacing-right-large:1.92%;--awb-margin-bottom-large:0px;--awb-spacing-left-large:1.92%;--awb-width-medium:100%;--awb-order-medium:0;--awb-spacing-right-medium:1.92%;--awb-spacing-left-medium:1.92%;--awb-width-small:100%;--awb-order-small:0;--awb-spacing-right-small:1.92%;--awb-spacing-left-small:1.92%;\"><div class=\"fusion-column-wrapper fusion-column-has-shadow fusion-flex-justify-content-flex-start fusion-content-layout-column\"><div class=\"fusion-text fusion-text-48 fusion-text-no-margin\" style=\"--awb-font-size:28px;--awb-margin-bottom:50px;\"><p><strong>Conclusion<\/strong><\/p>\n<\/div><div class=\"fusion-text fusion-text-49\" style=\"--awb-margin-top:20px;\"><p>The blog post is intended to share the approach of testing and experimenting with different file formats and compression techniques to optimize for costs and performance. The findings for the current use case may or may not be applicable for different queries and use cases. We recommend that different options are tested out before finalizing the formats and compression for huge data sets and data lake scenarios.<\/p>\n<\/div><div class=\"fusion-text fusion-text-50 fusion-text-no-margin\" style=\"--awb-text-color:#a3a3a3;--awb-margin-bottom:30px;\"><p>\u201cAthena: Beyond the Basics \u2013 Part 2\u201d<\/p>\n<\/div><div class=\"fusion-text fusion-text-51 fusion-text-no-margin\" style=\"--awb-font-size:20px;--awb-margin-top:20px;--awb-margin-bottom:20px;\"><p><strong>Additional references:<\/strong><\/p>\n<\/div><div class=\"fusion-text fusion-text-52\" style=\"--awb-margin-top:20px;\"><p>List of SQL statements supported by Athena<\/p>\n<p>https:\/\/aws.amazon.com\/blogs\/big-data\/analyzing-data-in-s3-using-amazon-athena\/<br \/>\nhttps:\/\/orc.apache.org\/<br \/>\nhttps:\/\/parquet.apache.org\/<\/p>\n<\/div><\/div><\/div><\/div><\/div>\n","protected":false},"excerpt":{"rendered":"","protected":false},"author":3,"featured_media":6023,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[55,38],"tags":[57,88],"class_list":["post-5975","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-blog","category-big-data-data-lake-analytics","tag-all-industries","tag-exclude"],"_links":{"self":[{"href":"https:\/\/northbaysolutions.com\/wp-json\/wp\/v2\/posts\/5975","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/northbaysolutions.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/northbaysolutions.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/northbaysolutions.com\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/northbaysolutions.com\/wp-json\/wp\/v2\/comments?post=5975"}],"version-history":[{"count":3,"href":"https:\/\/northbaysolutions.com\/wp-json\/wp\/v2\/posts\/5975\/revisions"}],"predecessor-version":[{"id":24345,"href":"https:\/\/northbaysolutions.com\/wp-json\/wp\/v2\/posts\/5975\/revisions\/24345"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/northbaysolutions.com\/wp-json\/wp\/v2\/media\/6023"}],"wp:attachment":[{"href":"https:\/\/northbaysolutions.com\/wp-json\/wp\/v2\/media?parent=5975"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/northbaysolutions.com\/wp-json\/wp\/v2\/categories?post=5975"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/northbaysolutions.com\/wp-json\/wp\/v2\/tags?post=5975"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}