Data-Driven Marketing: Wie die kundenzentrierte Ansprache in Zukunft gelingt, Teil 5¶

Gezielte Kundenansprache dank KI¶

Zur Blog-Serie https://www.shi-gmbh.com/herausforderung-customer-centricity/

Author: Dr. Eduardo Torres Schumann, SHI Gmbh

Inhalt:¶

1. Merkmale zur Beschreibung von Kunden¶

Einlesen der Google-Analytics-Daten
Extraktion von Produkttransaktionen aus den Besuchen
Bildung von Produktkategorien und Aufsummierung der Ausgaben nach Kategorien
Durchschnittliche Warenkorbgröße und Zeit bis zum Wiederkauf
Finale Kundenmatrix

2. Eine ML Spark Pipeline für Clustering¶

Feature-Engineering
Clustering mit K-Means
Parameter-Optimierung und Auswahl eines Modells

3. Analyse der Kundensegmente¶

Segment-Profile
Visualisierung

1. Merkmale zur Beschreibung von Kunden¶

Kunden sollen anhand ihrer Käufe in unterschiedlichen Produktkategorien beschrieben werden. Hierfür werden alle Transaktionen eines Kunden betrachtet und für die einzelnen Kategorien den Gesamteinkaufswert ermittelt. Neben den Ausgaben nach Kategorie sollen die durschnittliche Produktanzahl und das Volumen des Warenkorbs sowie die durchschnitliche Zeit in Tagen bis zum Wiederkauf den Kunden charakterisieren.

Wir betrachten alle Kunden die Ende 2016 eingekauf haben.

Einlesen der Google-Analytics-Daten¶

import findspark
findspark.init('/usr/local/Cellar/apache-spark/2.3.2/libexec/')

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.window import Window

sc = SparkSession.builder.appName("TestGA").getOrCreate()

# vor ga_sessions_201611* weitgehend category "(not set)"
# pyspark wird manchmal falsch interpretiert mit Java > 8

ga_data = sc.read.json('/Users/eduardoschumann/Documents/Projekte/E-CommerceAnalytics/ga_data/ga_sessions_2016*.json.gz')

Extraktion von Produkttransaktionen aus den Besuchen¶

Filterung der 'Hits', die eine Transaktion darstellen:

transactionsInfo = ga_data.select('fullVisitorId','visitId', 'visitNumber',
                                  expr("totals.transactions").cast('integer'),
                                  expr("totals.transactionRevenue").cast('integer'),
                                  explode('hits').alias('hit'),
                                  'hit.contentGroup',
                                  'hit.page',
                                  'hit.isExit',
                                  'hit.isInteraction',
                                  'hit.product',
                                  'hit.transaction',
                                  'hit.transaction.transactionId',
                                  'hit.product.productQuantity',
                                  'hit.type').fillna(0)\
            .filter("transactions > 0")\
            .filter(col('hit.transaction.transactionId').isNotNull())\
            .filter(col('hit.isExit').isNull())\
            .drop('hit')

transactionsInfo.show(3,truncate=False)

+-------------------+----------+-----------+------------+------------------+---------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------+------+-------------+-----------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+---------------+---------------+----+
|fullVisitorId      |visitId   |visitNumber|transactions|transactionRevenue|contentGroup                                                                                                   |page                                                                                                        |isExit|isInteraction|product                                                                                                                                        |transaction                                                                                                |transactionId  |productQuantity|type|
+-------------------+----------+-----------+------------+------------------+---------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------+------+-------------+-----------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+---------------+---------------+----+
|2324511444137196326|1472681213|1          |1           |61610000          |[(not set), (not set), (not set), (not set), (not set),,,, (not set), Apparel, (not set), (not set), (not set)]|[shop.googlemerchandisestore.com, /ordercompleted.html, /ordercompleted.html, , , , Checkout Confirmation,,]|null  |true         |[[[], [],,, 51990000, 52990000, (not set), (not set), 0, 51990000, 1, 52990000, GGOEGAFJ036217,  2XL, (not set), Google Men's Pullover Hoodie]]|[Google Merchandise Store, USD, 62610000, 7500000, 4120000,, ORD201608312125, 62610000, 7500000, 4120000]  |ORD201608312125|[1]            |PAGE|
|6810807631474905667|1472682771|1          |1           |15040000          |[(not set), (not set), (not set), (not set), (not set),,,, (not set), Office, (not set), (not set), (not set)] |[shop.googlemerchandisestore.com, /ordercompleted.html, /ordercompleted.html, , , , Checkout Confirmation,,]|null  |true         |[[[], [],,, 3990000, 14980000, (not set), (not set), 0, 3990000, 2, 14980000, GGOEGOCR017899,  RED, (not set), Recycled Paper Journal Set]]    |[Google Merchandise Store, USD, 22040000, 13500000, 7560000,, ORD201608312148, 22040000, 13500000, 7560000]|ORD201608312148|[2]            |PAGE|
|6810807631474905667|1472682771|1          |1           |15040000          |[(not set), (not set), (not set), (not set), (not set),,,, (not set), Office, (not set), (not set), (not set)] |[shop.googlemerchandisestore.com, /ordercompleted.html, /ordercompleted.html, , , , Checkout Confirmation,,]|null  |true         |[[[], [],,, 3990000,, (not set), (not set), 0, 3990000,,, GGOEGOCR017899,  RED, (not set), Recycled Paper Journal Set]]                        |[Google Merchandise Store, USD,,,,, ORD201608312148,,,]                                                    |ORD201608312148|[]             |PAGE|
+-------------------+----------+-----------+------------+------------------+---------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------+------+-------------+-----------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+---------------+---------------+----+
only showing top 3 rows

Produktumsatz in den Transaktion-'Hits'

productTransactions = transactionsInfo.select('fullVisitorId','transactionRevenue',
                                              explode('product').alias('sproduct'),
                                              'transaction.transactionId',
                                              'sproduct.v2ProductName',
                                              'sproduct.v2ProductCategory',
                                              expr('sproduct.productPrice').cast('integer').alias('price'),
                                              expr('sproduct.productQuantity').cast('integer').alias('quantity'),
                                              (col("sproduct.productPrice") * col("sproduct.productQuantity")).alias('productSales'),
                                              'transactionRevenue')\
                        .filter(col('productQuantity').isNotNull()).drop('sproduct')

productTransactions.show(truncate=False)

+-------------------+------------------+---------------+----------------------------------------------------+-----------------+--------+--------+------------+------------------+
|fullVisitorId      |transactionRevenue|transactionId  |v2ProductName                                       |v2ProductCategory|price   |quantity|productSales|transactionRevenue|
+-------------------+------------------+---------------+----------------------------------------------------+-----------------+--------+--------+------------+------------------+
|2324511444137196326|61610000          |ORD201608312125|Google Men's Pullover Hoodie                        |(not set)        |51990000|1       |5.199E7     |61610000          |
|6810807631474905667|15040000          |ORD201608312148|Recycled Paper Journal Set                          |(not set)        |3990000 |2       |7980000.0   |15040000          |
|6810807631474905667|15040000          |ORD201608312148|Recycled Paper Journal Set                          |(not set)        |3990000 |null    |null        |15040000          |
|6810807631474905667|15040000          |ORD201608312148|Recycled Paper Journal Set                          |(not set)        |3990000 |null    |null        |15040000          |
|0973104829844612521|91100000          |ORD201608312216|Google Laptop and Cell Phone Stickers               |(not set)        |1990000 |10      |1.99E7      |91100000          |
|0973104829844612521|91100000          |ORD201608312216|8 pc Android Sticker Sheet                          |(not set)        |1990000 |5       |9950000.0   |91100000          |
|0973104829844612521|91100000          |ORD201608312216|Android Sticker Sheet Ultra Removable               |(not set)        |2990000 |5       |1.495E7     |91100000          |
|0973104829844612521|91100000          |ORD201608312216|Google Ballpoint Pen Black                          |(not set)        |1990000 |20      |3.98E7      |91100000          |
|3797990038447668987|51080000          |ORD201608312276|Google Men's 100% Cotton Short Sleeve Hero Tee Navy |(not set)        |16990000|1       |1.699E7     |51080000          |
|3797990038447668987|51080000          |ORD201608312276|Google Men's 100% Cotton Short Sleeve Hero Tee Navy |(not set)        |16990000|1       |1.699E7     |51080000          |
|3797990038447668987|51080000          |ORD201608312276|Collapsible Shopping Bag                            |(not set)        |4990000 |1       |4990000.0   |51080000          |
|3797990038447668987|51080000          |ORD201608312276|Google Laptop and Cell Phone Stickers               |(not set)        |1990000 |1       |1990000.0   |51080000          |
|1704269846042149545|126200000         |ORD201608311837|Windup Android                                      |(not set)        |3990000 |30      |1.197E8     |126200000         |
|1704269846042149545|126200000         |ORD201608311837|Windup Android                                      |(not set)        |3990000 |null    |null        |126200000         |
|1124125006906712847|27160000          |ORD201608311905|Google 17oz Stainless Steel Sport Bottle            |(not set)        |18990000|1       |1.899E7     |27160000          |
|330289116549575054 |39950000          |ORD201608311935|Google Baby Essentials Set                          |(not set)        |27190000|1       |2.719E7     |39950000          |
|9972043774359472649|671220000         |ORD201608311928|Google 17oz Stainless Steel Sport Bottle            |(not set)        |15190000|35      |5.3165E8    |671220000         |
|3906774598490786925|356450000         |ORD201608311979|Google Men's 100% Cotton Short Sleeve Hero Tee Black|(not set)        |13590000|1       |1.359E7     |356450000         |
|3906774598490786925|356450000         |ORD201608311979|22 oz YouTube Bottle Infuser                        |(not set)        |3990000 |16      |6.384E7     |356450000         |
|3906774598490786925|356450000         |ORD201608311979|Google Zipper-front Sports Bag                      |(not set)        |15990000|2       |3.198E7     |356450000         |
+-------------------+------------------+---------------+----------------------------------------------------+-----------------+--------+--------+------------+------------------+
only showing top 20 rows

Bildung von Produktkategorien und Aufsummierung der Ausgaben nach Kategorien¶

customerCategories = productTransactions.groupBy('fullVisitorId', 'v2ProductCategory').agg(sum('productSales').alias('categorySales'))

customerCategories.show(truncate=False)

+-------------------+-----------------+-------------+
|fullVisitorId      |v2ProductCategory|categorySales|
+-------------------+-----------------+-------------+
|2100286135111282949|(not set)        |4.038E7      |
|9540801604803443114|(not set)        |1.7667E8     |
|8084853554695349186|(not set)        |5.994E7      |
|1634397925109674668|(not set)        |7.395E7      |
|4932662498284478415|(not set)        |6.6591E8     |
|1999795518481333408|(not set)        |2.876E8      |
|788289060362109435 |Backpacks        |5.2789E8     |
|476894088375216381 |Apparel          |1.3354E8     |
|8461110695299432015|(not set)        |7.641E7      |
|0532638128006986594|Apparel          |1.699E7      |
|5575936581064854068|Drinkware        |7990000.0    |
|28920166996312450  |Apparel          |8.799E7      |
|4164473973717598175|Drinkware        |6.076E7      |
|087919605807565397 |Lifestyle        |3990000.0    |
|7247995041552365194|Office           |2.143E7      |
|1691262461712944619|Apparel          |1.1998E8     |
|6218917791895202821|Bags             |7.999E7      |
|9372618001980854712|Lifestyle        |3.5E7        |
|3158193721801238039|Office           |1590000.0    |
|8488247280044091366|Apparel          |3.199E7      |
+-------------------+-----------------+-------------+
only showing top 20 rows

customerCategories.groupBy("v2ProductCategory").count().show(30)

+--------------------+-----+
|   v2ProductCategory|count|
+--------------------+-----+
|          Housewares|   22|
|           Backpacks|   33|
|              Office|  772|
|                Waze|   40|
|                Bags|  318|
|             Apparel| 1735|
|         Electronics|  288|
|             Bottles|   57|
|           More Bags|   26|
|            Headgear|  121|
|              Google|    3|
|         Accessories|   73|
|           Lifestyle|  330|
|            Tumblers|    6|
|Notebooks & Journals|   13|
|${productitem.pro...|   45|
|             Android|    1|
|           (not set)| 2125|
|           Drinkware|  683|
|          Gift Cards|   16|
+--------------------+-----+

Zusammenfassung von Kategorien in Top-Level-Kategorien

Top-Level Kategorien:

Apparel
Lifestyle
Electronics
Brand
Office
Bags
Drinkware

drinkware_regex = "Housewares|Drinkware|Bottles|Tumblers"
bags_regex = "Backpacks|Bags|More Bags"
office_regex = "Office|Notebooks & Journals|Gift Cards"
brand_regex = "Waze|Google|Android"
electronics_regex = "Electronics"
lifestyle_regex = "Accessories|Lifestyle|Fun"
apparel_regex = "Apparel|Headgear"

toplevelCategories = customerCategories.filter(~((col('v2ProductCategory') == '${productitem.product.origCatName}')|
                            (col('v2ProductCategory') == '(not set)'))).\
  withColumn("category",
    regexp_replace(
        regexp_replace(
            regexp_replace(
                regexp_replace(
                    regexp_replace(
                        regexp_replace(
                            regexp_replace(col("v2ProductCategory"),
                            apparel_regex, "Apparel"),
                        lifestyle_regex, "Lifestyle"),
                    electronics_regex,"Electronics"),
                brand_regex,"Brand"),
            office_regex,"Office"),
        bags_regex, "Bags"),
    drinkware_regex, "Drinkware"))


toplevelCategories.show()

#groupBy("category").count().show(30)

+-------------------+-----------------+-------------+---------+
|      fullVisitorId|v2ProductCategory|categorySales| category|
+-------------------+-----------------+-------------+---------+
| 788289060362109435|        Backpacks|     5.2789E8|     Bags|
| 476894088375216381|          Apparel|     1.3354E8|  Apparel|
|0532638128006986594|          Apparel|      1.699E7|  Apparel|
|5575936581064854068|        Drinkware|    7990000.0|Drinkware|
|  28920166996312450|          Apparel|      8.799E7|  Apparel|
|4164473973717598175|        Drinkware|      6.076E7|Drinkware|
| 087919605807565397|        Lifestyle|    3990000.0|Lifestyle|
|7247995041552365194|           Office|      2.143E7|   Office|
|1691262461712944619|          Apparel|     1.1998E8|  Apparel|
|6218917791895202821|             Bags|      7.999E7|     Bags|
|9372618001980854712|        Lifestyle|        3.5E7|Lifestyle|
|3158193721801238039|           Office|    1590000.0|   Office|
|8488247280044091366|          Apparel|      3.199E7|  Apparel|
|5216295436197840308|          Apparel|      8.396E7|  Apparel|
| 254413001205574315|        Drinkware|      2.499E7|Drinkware|
|5358177725781959243|          Apparel|      3.118E7|  Apparel|
|9029794295932939024|        Backpacks|     5.2789E8|     Bags|
|  64739735733627926|           Office|      6.092E7|   Office|
|1531540148956296102|        Drinkware|      3.998E8|Drinkware|
|4471415710206918415|             Bags|      3.995E8|     Bags|
+-------------------+-----------------+-------------+---------+
only showing top 20 rows

Aufsummierung der Ausgaben pro Kategorie und Pivotierung der Tabelle

Pivotierung von DataFrames in Spark: https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-apache-spark.html

visitorRevenueByCategory = toplevelCategories.groupBy('fullVisitorId').pivot('category').agg(sum('categorySales')).fillna(0.0)

visitorRevenueByCategory.show()

+-------------------+--------+-------+-----+---------+-----------+---------+---------+
|      fullVisitorId| Apparel|   Bags|Brand|Drinkware|Electronics|Lifestyle|   Office|
+-------------------+--------+-------+-----+---------+-----------+---------+---------+
|7983896832980033675| 5.999E7|    0.0|  0.0|      0.0|        0.0|      0.0|      0.0|
|3220580369738167789|     0.0|    0.0|  0.0|7190000.0|        0.0|      0.0|1590000.0|
|4186017304980916856| 9.198E7|    0.0|  0.0|      0.0|        0.0|      0.0|      0.0|
|8706126538675944851|     0.0|    0.0|  0.0| 1.5975E8|        0.0|      0.0| 1.7545E8|
|6402267118449500708|1.1598E8|    0.0|  0.0|      0.0|        0.0|      0.0|      0.0|
|3090631969460248576| 1.359E7|    0.0|  0.0|      0.0|        0.0|      0.0|      0.0|
|0829523931932991940|     0.0|    0.0|  0.0| 3.7975E8|        0.0|      0.0|      0.0|
|8008255526797390672|     0.0|    0.0|  0.0|      0.0|        0.0|      0.0|  6.087E7|
|  93310453484553750| 2.719E7|    0.0|  0.0|      0.0|        0.0|      0.0|      0.0|
|0385969934087457197|     0.0|    0.0|  0.0|  1.119E8|        0.0|      0.0|      0.0|
|9077832746989941740|     0.0|    0.0|  0.0|      0.0|    1.197E7|      0.0|  1.397E7|
|8250541234138796403|     0.0|    0.0|  0.0|  8.096E7|        0.0|      0.0|      0.0|
|2435691687281327196|     0.0|7.994E7|  0.0|      0.0|    4.317E7|5590000.0|      0.0|
|3782396217900147405| 3.358E7|    0.0|  0.0|      0.0|        0.0|      0.0|      0.0|
|7758585801799209122| 5.999E7|    0.0|  0.0|      0.0|        0.0|      0.0|      0.0|
|3269834865385146569| 1.799E8|    0.0|  0.0|      0.0|        0.0|      0.0|      0.0|
| 419161784882208907|2.0292E8|5.492E7|  0.0| 1.0986E8|        0.0|   3.98E7| 1.6675E8|
|2497486617423212304| 1.999E7|7.674E7|  0.0|      0.0|        0.0|      0.0|      0.0|
|1658702239532336072| 4.479E7|    0.0|  0.0|      0.0|        0.0|      0.0|      0.0|
|9019087328365285075| 1.599E7|    0.0|  0.0|      0.0|        0.0|      0.0|      0.0|
+-------------------+--------+-------+-----+---------+-----------+---------+---------+
only showing top 20 rows

Durchschnittliche Warenkorbgröße und Zeit bis zum Wiederkauf¶

transactions = ga_data.select('fullVisitorId',
                               to_date("date", "yyyyMMdd").alias("pdate"),
                               expr("totals.transactions").cast('integer'),
                               expr("totals.totalTransactionRevenue").cast('float')).fillna(0).filter("transactions > 0")


#transactions.show(50)

timeintertrans = transactions.withColumn("time_intertrans",datediff(transactions.pdate, lag(transactions.pdate, 1).over(Window.partitionBy("fullVisitorId").orderBy("pdate")))).fillna(0)
timeintertrans.show()

+-------------------+----------+------------+-----------------------+---------------+
|      fullVisitorId|     pdate|transactions|totalTransactionRevenue|time_intertrans|
+-------------------+----------+------------+-----------------------+---------------+
|0829523931932991940|2016-11-18|           1|           3.87750016E8|              0|
|1944533426362788043|2016-09-21|           1|                1.499E7|              0|
|3090631969460248576|2016-10-27|           1|                2.259E7|              0|
|3213840074316400693|2016-08-01|           1|                1.704E8|              0|
|3220580369738167789|2016-12-20|           1|                1.378E7|              0|
|3745754200111893080|2016-08-09|           1|               1.7177E8|              0|
|4132503829043338043|2016-08-15|           1|                8.345E7|              0|
|4186017304980916856|2016-12-11|           1|                9.298E7|              0|
|4389802785459743912|2016-10-02|           1|                6.499E7|              0|
| 439449392549956950|2016-09-20|           1|           3.63390016E8|              0|
|4681864752414054401|2016-08-04|           1|                2.947E7|              0|
|6332431001963915083|2016-08-01|           1|                3.219E7|              0|
|6332431001963915083|2016-08-15|           1|                3.748E7|             14|
|6402267118449500708|2016-12-12|           1|               1.2198E8|              0|
|6845776877174113141|2016-08-21|           1|               2.3166E8|              0|
|7285721176134268388|2016-08-13|           1|                5.008E7|              0|
|7709768845264385801|2016-09-21|           1|               2.2105E8|              0|
|7983896832980033675|2016-12-08|           1|                6.399E7|              0|
|8008255526797390672|2016-11-09|           1|                6.287E7|              0|
|8706126538675944851|2016-11-29|           1|                3.362E8|              0|
+-------------------+----------+------------+-----------------------+---------------+
only showing top 20 rows

visitorTransactionKPIs = timeintertrans.groupBy('fullVisitorId').agg(avg("totalTransactionRevenue").alias("avg_revenue"),
                                            sum("totalTransactionRevenue").alias("sum_revenue"),
                                            sum("transactions").alias("total_transactions"),
                                            avg("time_intertrans").alias("avg_time"))

Finale Kundenmatrix¶

customerMatrix = visitorTransactionKPIs.join(visitorRevenueByCategory,"fullVisitorId")

customerMatrix.limit(15).toPandas()

2. Eine ML Spark Pipeline für Clustering¶

Feature-Engineering¶

Für den ML-Algorithmus müssen die Merkmale als Vector von Zahlen kodiert werden. Da die Warenkorb- und Kategorienausgaben sich in unterschiedlichen Bereichen bewegen, werden sie skaliert.

from pyspark.ml.feature import StandardScaler
from pyspark.ml.feature import VectorAssembler
from pyspark.ml import Pipeline

#scaling
sScaler = StandardScaler(inputCol="features", outputCol="scaledFeatures",
                        withStd=True, withMean=False)

#input to the algorithm must be vector

vectorAssembler = VectorAssembler()\
    .setInputCols(["avg_revenue", "sum_revenue", "total_transactions", "avg_time",
               "Apparel", "Bags", "Brand", "Drinkware", "Electronics", "Lifestyle", "Office"])\
    .setOutputCol("features")

# assembling, scaling
transformationPipeline = Pipeline().setStages([vectorAssembler,sScaler])

# collect statistics for scaling
fittedPipeline = transformationPipeline.fit(customerMatrix)

# actually transform
transformedMatrix = fittedPipeline.transform(customerMatrix)

# put it into cache for accessing it many times efficently
transformedMatrix.cache()

DataFrame[fullVisitorId: string, avg_revenue: double, sum_revenue: double, total_transactions: bigint, avg_time: double, Apparel: double, Bags: double, Brand: double, Drinkware: double, Electronics: double, Lifestyle: double, Office: double, features: vector, scaledFeatures: vector]

transformedMatrix.select("features", "scaledFeatures").show()

+--------------------+--------------------+
|            features|      scaledFeatures|
+--------------------+--------------------+
|(11,[0,1,2,4],[6....|(11,[0,1,2,4],[0....|
|(11,[0,1,2,7,10],...|(11,[0,1,2,7,10],...|
|(11,[0,1,2,4],[9....|(11,[0,1,2,4],[0....|
|(11,[0,1,2,7,10],...|(11,[0,1,2,7,10],...|
|(11,[0,1,2,4],[1....|(11,[0,1,2,4],[0....|
|(11,[0,1,2,4],[2....|(11,[0,1,2,4],[0....|
|(11,[0,1,2,7],[3....|(11,[0,1,2,7],[1....|
|(11,[0,1,2,10],[6...|(11,[0,1,2,10],[0...|
|(11,[0,1,2,4],[3....|(11,[0,1,2,4],[0....|
|(11,[0,1,2,7],[1....|(11,[0,1,2,7],[0....|
|(11,[0,1,2,8,10],...|(11,[0,1,2,8,10],...|
|(11,[0,1,2,7],[8....|(11,[0,1,2,7],[0....|
|(11,[0,1,2,5,8,9]...|(11,[0,1,2,5,8,9]...|
|(11,[0,1,2,4],[3....|(11,[0,1,2,4],[0....|
|(11,[0,1,2,4],[6....|(11,[0,1,2,4],[0....|
|(11,[0,1,2,4],[1....|(11,[0,1,2,4],[0....|
|[7.3116E8,7.3116E...|[2.58141286470510...|
|(11,[0,1,2,4,5],[...|(11,[0,1,2,4,5],[...|
|(11,[0,1,2,4],[5....|(11,[0,1,2,4],[0....|
|(11,[0,1,2,4],[2....|(11,[0,1,2,4],[0....|
+--------------------+--------------------+
only showing top 20 rows

Clustering mit K-Means¶

from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator

# Trains a k-means model.
# Test with k = 10
kmeans = KMeans().setK(10).setSeed(1)
model = kmeans.fit(transformedMatrix)

# Make predictions
predictions = model.transform(transformedMatrix)

model.computeCost(transformedMatrix)

1.4004123823026964e+20

predictions

DataFrame[fullVisitorId: string, avg_revenue: double, sum_revenue: double, total_transactions: bigint, avg_time: double, Apparel: double, Bags: double, Brand: double, Drinkware: double, Electronics: double, Lifestyle: double, Office: double, features: vector, scaledFeatures: vector, prediction: int]

# Evaluate clustering by computing Silhouette score

# The Silhouette is a measure for the validation of the consistency within clusters. 
#I t ranges between 1 and -1, where a value close to 1 means that the points in a cluster are close to the other 
# points in the same cluster and far from the points of the other clusters.
# https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html

evaluator = ClusteringEvaluator()

# https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html

silhouette = evaluator.evaluate(predictions)
print("Silhouette with squared euclidean distance = " + str(silhouette))

# Shows the result.
centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
    print(center)

Silhouette with squared euclidean distance = 0.834235010594473
Cluster Centers:
[5.81771813e+07 6.24261807e+07 1.10599284e+00 7.06425164e-01
 3.52531798e+07 4.42479875e+06 1.84950805e+05 5.73668157e+06
 2.29698569e+06 1.87020125e+06 5.75765206e+06]
[7.23102426e+08 6.79384497e+09 1.85000000e+01 4.84761905e+00
 9.48220000e+08 1.70685000e+08 0.00000000e+00 1.99618000e+09
 2.61890000e+08 2.25350000e+08 1.04896000e+09]
[7.98711772e+08 3.16860749e+09 6.12500000e+00 1.19705357e+01
 2.16402500e+08 2.16368750e+08 6.98750000e+05 5.13008750e+08
 3.50368750e+08 3.60990000e+08 9.77097500e+08]
[3.10956333e+09 3.41733833e+09 1.33333333e+00 3.66666667e+00
 3.57371667e+08 6.74608333e+08 0.00000000e+00 6.74525000e+08
 7.92983333e+08 6.64166667e+07 5.78566667e+08]
[1.53355001e+09 4.60065003e+09 6.00000000e+00 4.33333333e+00
 4.58265000e+09 0.00000000e+00 0.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00]
[2.95605388e+08 3.74175101e+08 1.59595960e+00 4.10867003e+00
 1.25032626e+08 2.23641919e+07 8.07575758e+04 5.41484343e+07
 3.31446970e+07 1.47384848e+07 8.93552525e+07]
[7.00350003e+09 7.00350003e+09 1.00000000e+00 0.00000000e+00
 0.00000000e+00 0.00000000e+00 0.00000000e+00 6.99650000e+09
 0.00000000e+00 0.00000000e+00 0.00000000e+00]
[1.35352462e+09 1.98355091e+09 1.95454545e+00 6.44696970e+00
 6.71865000e+08 4.42477727e+08 0.00000000e+00 2.74834091e+08
 1.37850455e+08 1.23099091e+08 2.26979091e+08]
[6.96652430e+08 9.58375627e+08 2.03125000e+00 5.88076637e+00
 1.42125469e+08 1.30504688e+08 7.40937500e+06 1.82208281e+08
 1.16909688e+08 6.23792188e+07 1.63022656e+08]
[7.23027102e+08 1.73297200e+09 2.70000000e+00 1.61433333e+01
 6.09420000e+07 1.07547000e+08 0.00000000e+00 1.44440000e+08
 1.63446000e+08 1.27230000e+08 9.41692000e+08]

Parameter-Optimierung und Auswahl eines Modells¶

bestK = 0
bestSilhouette = 0.0
bestModel = None

evaluator = ClusteringEvaluator()

for k in range(5,16):
    kmeans = KMeans().setK(k).setSeed(1)
    model = kmeans.fit(transformedMatrix)
    cost = model.computeCost(transformedMatrix)
    predictions = model.transform(transformedMatrix)
    silhouette = evaluator.evaluate(predictions)
    print("k: ", k, "silhouette: ", silhouette )
    if (silhouette > bestSilhouette):
            bestK = k
            bestSilhouette = silhouette
            bestModel = model

print("best k: ", bestK, "silhouette: ", bestSilhouette )

k:  5 silhouette:  0.8884372518274559
k:  6 silhouette:  0.9143857363097061
k:  7 silhouette:  0.8922666959148186
k:  8 silhouette:  0.8911821945231404
k:  9 silhouette:  0.8217613365683168
k:  10 silhouette:  0.834235010594473
k:  11 silhouette:  0.8220060929467548
k:  12 silhouette:  0.8808577205951386
k:  13 silhouette:  0.8749689752418214
k:  14 silhouette:  0.7761558116055427
k:  15 silhouette:  0.7896723885115109
best k:  6 silhouette:  0.9143857363097061

bestModel.summary.clusterSizes

[2372, 2, 1, 2, 131, 40]

3. Darstellung der Kundensegmente¶

Segment-Profile¶

predictions = bestModel.transform(transformedMatrix)

predictions

DataFrame[fullVisitorId: string, avg_revenue: double, sum_revenue: double, total_transactions: bigint, avg_time: double, Apparel: double, Bags: double, Brand: double, Drinkware: double, Electronics: double, Lifestyle: double, Office: double, features: vector, scaledFeatures: vector, prediction: int]

clusterProperties = predictions.groupBy('prediction').agg(avg(expr('avg_revenue / 1.0E6')).alias('avgRev'),
                                     avg(expr('sum_revenue / 1.0E6')).alias('sumRev'),
                                     avg('total_transactions').alias('transactions'),
                                     avg('avg_time').alias('days'),
                                     avg(expr('Apparel / 1.0E6')).alias('Apparel'),
                                     avg(expr('Bags / 1.0E6')).alias('Bags'),
                                     avg(expr('Brand / 1.0E6')).alias('Brand'),
                                     avg(expr('Drinkware / 1.0E6')).alias('Drinkware'),
                                     avg(expr('Electronics / 1.0E6')).alias('Electronics'),
                                     avg(expr('Lifestyle/ 1.0E6')).alias('Lifestyle'),
                                     avg(expr('Office / 1.0E6')).alias('Office'),
                                     count('prediction').alias('size')).orderBy('prediction')

clusterProperties.show()

+----------+------------------+------------------+------------------+------------------+-----------------+-----------------+-------------------+------------------+------------------+------------------+------------------+----+
|prediction|            avgRev|            sumRev|      transactions|              days|          Apparel|             Bags|              Brand|         Drinkware|       Electronics|         Lifestyle|            Office|size|
+----------+------------------+------------------+------------------+------------------+-----------------+-----------------+-------------------+------------------+------------------+------------------+------------------+----+
|         0| 68.68862881590783| 76.35014331871841|1.1332209106239461|0.8954820123664979|40.09289629005059|5.320876897133218|0.17434654300168634| 7.648912310286676|3.9968001686340604|2.3556197301854955| 8.512647554806067|2372|
|         1| 723.1024258285714| 6793.844967999999|              18.5| 4.847619047619047|           948.22|          170.685|                0.0|           1996.18|            261.89|            225.35|           1048.96|   2|
|         2|       7003.500032|       7003.500032|               1.0|               0.0|              0.0|              0.0|                0.0|            6996.5|               0.0|               0.0|               0.0|   1|
|         3|       4431.340032|       4431.340032|               1.0|               0.0|         1072.115|         1224.825|                0.0|          1312.025|               0.0|              99.5|           718.875|   2|
|         4| 566.1913733675028| 767.8419863206109|1.8473282442748091| 5.158796801163214|144.2908396946565|76.11045801526716| 3.7419083969465654| 131.3081679389313| 74.06809160305342| 40.82541984732824| 195.6112977099236| 131|
|         5|1270.6264768233339|2396.9127420000004|               3.0| 9.873273809523809|        509.51175|           355.54|0.13974999999999999|322.08174999999994|          305.6405|          180.6125|486.09274999999997|  40|
+----------+------------------+------------------+------------------+------------------+-----------------+-----------------+-------------------+------------------+------------------+------------------+------------------+----+

Visualisierung¶

import numpy as np
import plotly.plotly as py
import plotly.graph_objs as go
import matplotlib.pyplot as plt
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode()

clusterSizes = clusterProperties.select(col('prediction').cast('string'),'size').toPandas()

iplot(go.Figure( layout=dict(title='Cluster Sizes', width=500, height=500),

                data= [go.Bar( y = clusterSizes["size"],
                                x = clusterSizes["prediction"],
                                text= clusterSizes["size"],
                                textposition='auto')]))

clusterProperties.select(max('avgRev'),
                         max('sumRev'),
                         max('transactions'),
                         max('days'),
                         max('Apparel'),
                         max('Bags'),
                         max('Brand'),
                         max('Drinkware'),
                         max('Electronics'),
                         max('Lifestyle'),
                         max('Office')).toPandas()

# modified from https://www.kaggle.com/fabiendaniel/customer-segmentation


def _scale_data(data, ranges):
    (x1, x2) = ranges[0]
    d = data[0]
    return [(d - y1) / (y2 - y1) * (x2 - x1) + x1 for d, (y1, y2) in zip(data, ranges)]

class RadarChart():
    def __init__(self, fig, location, sizes, variables, ranges, n_ordinate_levels = 6):

        angles = np.arange(0, 360, 360./len(variables))

        ix, iy = location[:] ; size_x, size_y = sizes[:]

        axes = [fig.add_axes([ix, iy, size_x, size_y], polar = True,
        label = "axes{}".format(i)) for i in range(len(variables))]

        _, text = axes[0].set_thetagrids(angles, labels = variables)

        for txt, angle in zip(text, angles):
            if angle > -1 and angle < 181:
                txt.set_rotation(angle - 90)
            else:
                txt.set_rotation(angle - 270)

        for ax in axes[1:]:
            ax.patch.set_visible(False)
            ax.xaxis.set_visible(False)
            ax.grid(False)

        for i, ax in enumerate(axes):
            grid = np.linspace(*ranges[i],num = n_ordinate_levels)
            grid_label = [""]+["{:.0f}".format(x) for x in grid[1:-1]]
            ax.set_rgrids(grid, labels = grid_label, angle = angles[i])
            ax.set_ylim(*ranges[i])

        self.angle = np.deg2rad(np.r_[angles, angles[0]])
        self.ranges = ranges
        self.ax = axes[0]

    def plot(self, data, *args, **kw):
        sdata = _scale_data(data, self.ranges)
        self.ax.plot(self.angle, np.r_[sdata, sdata[0]], *args, **kw)

    def fill(self, data, *args, **kw):
        sdata = _scale_data(data, self.ranges)
        self.ax.fill(self.angle, np.r_[sdata, sdata[0]], *args, **kw)

    def legend(self, *args, **kw):
        self.ax.legend(*args, **kw)

    def title(self, title, *args, **kw):
        self.ax.text(0, 1.1, title, transform = self.ax.transAxes, *args, **kw)

merged_df = clusterProperties.toPandas()

n_clusters = 6

fig = plt.figure(figsize=(12,12))

attributes = ['avgRev', 'sumRev', 'transactions', 'days', 'Apparel', 'Bags', 'Brand', 'Drinkware', 'Electronics', 'Lifestyle', 'Office']
#to-do
ranges = [[0, 7100], [0, 7100], [0, 20], [0, 10], [0, 1500], [0, 1500], [0, 10], [0, 7100], [0, 1500], [0, 1500], [0, 1500]]
#7003.500032 	7003.500032 	18.5 	9.873274 	1072.115 	1224.825 	3.741908 	6996.5 	305.6405 	225.35 	1048.96

index  = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]

n_groups = n_clusters ; i_cols = 3
i_rows = n_groups//i_cols
size_x, size_y = (1/i_cols), (1/i_rows)

for ind in range(n_clusters):
    cluster_name = merged_df.loc[index[ind], 'prediction']
    cluster_size = merged_df.loc[index[ind], 'size']
    ix = ind%3 ; iy = i_rows - ind//3
    pos_x = ix*(size_x + 0.1) ; pos_y = iy*(size_y + 0.0)
    location = [pos_x, pos_y]  ; sizes = [size_x, size_y]
    #______________________________________________________
    data = np.array(merged_df.loc[index[ind], attributes])
    radar = RadarChart(fig, location, sizes, attributes, ranges)
    radar.plot(data, color = 'b', linewidth=2.0)
    radar.fill(data, alpha = 0.2, color = 'b')
    radar.title(title = 'cluster {}, size = {}'.format(cluster_name, cluster_size), color = 'r', size=14)
    ind += 1

	fullVisitorId	avg_revenue	sum_revenue	total_transactions	Apparel	Bags	Drinkware	Electronics	Lifestyle	Office
0	7983896832980033675	63990000.0	63990000.0	1	59990000.0	0.0	0.0	0.0	0.0	0.0
1	3220580369738167789	13780000.0	13780000.0	1	0.0	0.0	7190000.0	0.0	0.0	1590000.0
2	4186017304980916856	92980000.0	92980000.0	1	91980000.0	0.0	0.0	0.0	0.0	0.0
3	8706126538675944851	336200000.0	336200000.0	1	0.0	0.0	159750000.0	0.0	0.0	175450000.0
4	6402267118449500708	121980000.0	121980000.0	1	115980000.0	0.0	0.0	0.0	0.0	0.0
5	3090631969460248576	22590000.0	22590000.0	1	13590000.0	0.0	0.0	0.0	0.0	0.0
6	0829523931932991940	387750016.0	387750016.0	1	0.0	0.0	379750000.0	0.0	0.0	0.0
7	8008255526797390672	62870000.0	62870000.0	1	0.0	0.0	0.0	0.0	0.0	60870000.0
8	93310453484553750	35190000.0	35190000.0	1	27190000.0	0.0	0.0	0.0	0.0	0.0
9	0385969934087457197	114900000.0	114900000.0	1	0.0	0.0	111900000.0	0.0	0.0	0.0
10	9077832746989941740	26940000.0	26940000.0	1	0.0	0.0	0.0	11970000.0	0.0	13970000.0
11	8250541234138796403	85960000.0	85960000.0	1	0.0	0.0	80960000.0	0.0	0.0	0.0
12	2435691687281327196	133700000.0	133700000.0	1	0.0	79940000.0	0.0	43170000.0	5590000.0	0.0
13	3782396217900147405	36580000.0	36580000.0	1	33580000.0	0.0	0.0	0.0	0.0	0.0
14	7758585801799209122	68990000.0	68990000.0	1	59990000.0	0.0	0.0	0.0	0.0	0.0